LLM/Agentic AI Interview Questions - Medium
Medium-level LLM and Agentic AI interview questions covering agent architectures, RAG optimization, and production systems.
Q1: Design and implement a ReAct (Reasoning + Acting) agent.
Answer:
How ReAct Works:
ReAct alternates between reasoning (thinking) and acting (using tools) to solve problems.
Pattern:
1Thought: I need to find information about X
2Action: search("X")
3Observation: [search results]
4Thought: Now I need to calculate Y
5Action: calculate("Y")
6Observation: [result]
7Thought: I have enough information to answer
8Action: FINISH("answer")
LangChain Implementation:
1from langchain.agents import initialize_agent, AgentType, Tool
2from langchain.llms import OpenAI
3from langchain.chains import LLMChain
4from langchain.prompts import PromptTemplate
5from langchain.agents.react.base import ReActDocstoreAgent
6from langchain.agents import AgentExecutor
7from langchain.tools import Tool
8from langchain.utilities import WikipediaAPIWrapper, PythonREPL
9
10# Initialize LLM
11llm = OpenAI(temperature=0)
12
13# Define tools
14wikipedia = WikipediaAPIWrapper()
15python_repl = PythonREPL()
16
17def calculate(expression: str) -> str:
18 """Evaluate mathematical expression safely"""
19 try:
20 result = eval(expression, {"__builtins__": {}}, {})
21 return str(result)
22 except Exception as e:
23 return f"Error: {str(e)}"
24
25def get_current_date() -> str:
26 """Get current date"""
27 from datetime import datetime
28 return datetime.now().strftime("%Y-%m-%d")
29
30# Create tools list
31tools = [
32 Tool(
33 name="Wikipedia",
34 func=wikipedia.run,
35 description="Search Wikipedia for information about a topic"
36 ),
37 Tool(
38 name="Calculator",
39 func=calculate,
40 description="Evaluate mathematical expressions. Input should be a valid Python expression."
41 ),
42 Tool(
43 name="CurrentDate",
44 func=get_current_date,
45 description="Get the current date in YYYY-MM-DD format"
46 ),
47]
48
49# Initialize ReAct agent
50agent = initialize_agent(
51 tools=tools,
52 llm=llm,
53 agent=AgentType.REACT_DOCSTORE,
54 verbose=True,
55 max_iterations=10,
56 handle_parsing_errors=True
57)
58
59# Run agent
60result = agent.run("What is 15% of the population of France?")
61print(result)
62
63# Alternative: Custom ReAct with more control
64from langchain.agents import AgentExecutor, create_react_agent
65from langchain.prompts import PromptTemplate
66
67# Custom ReAct prompt template
68react_prompt = PromptTemplate.from_template("""
69You are a helpful assistant that can use tools to answer questions.
70
71You have access to the following tools:
72{tools}
73
74Use the following format:
75
76Question: the input question you must answer
77Thought: you should always think about what to do
78Action: the action to take, should be one of [{tool_names}]
79Action Input: the input to the action
80Observation: the result of the action
81... (this Thought/Action/Action Input/Observation can repeat N times)
82Thought: I now know the final answer
83Final Answer: the final answer to the original input question
84
85Begin!
86
87Question: {input}
88Thought: {agent_scratchpad}
89""")
90
91# Create agent
92agent = create_react_agent(llm, tools, react_prompt)
93agent_executor = AgentExecutor(
94 agent=agent,
95 tools=tools,
96 verbose=True,
97 max_iterations=10
98)
99
100# Execute
101result = agent_executor.invoke({
102 "input": "What is 15% of the population of France?"
103})
104print(result["output"])
Why ReAct Works:
- Reasoning: Helps agent plan and reflect
- Acting: Grounds reasoning in real actions
- Observation: Provides feedback to adjust plan
Key Design Decisions:
- How to parse LLM output (regex, JSON, structured)
- Error handling (invalid actions, tool failures)
- When to stop (max steps, success criteria)
- How to format history (token efficiency)
Q2: Implement an advanced RAG system with re-ranking and hybrid search.
Answer:
How Advanced RAG Works:
- Hybrid Search: Combine semantic (vector) + keyword (BM25) search
- Re-ranking: Use cross-encoder to re-score top results
- Context Compression: Remove irrelevant parts
- Citation: Track which chunks were used
LangChain Implementation:
1from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
2from langchain.vectorstores import FAISS, Chroma
3from langchain.retrievers import ContextualCompressionRetriever
4from langchain.retrievers.document_compressors import LLMChainExtractor
5from langchain.retrievers import BM25Retriever, EnsembleRetriever
6from langchain.retrievers import ContextualCompressionRetriever
7from langchain.retrievers.document_compressors import CrossEncoderReranker
8from langchain.text_splitter import RecursiveCharacterTextSplitter
9from langchain.chains import RetrievalQA
10from langchain.llms import OpenAI
11from langchain.prompts import PromptTemplate
12from langchain.document_loaders import TextLoader
13
14# Initialize components
15llm = OpenAI(temperature=0)
16embeddings = OpenAIEmbeddings() # or HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
17
18# Load and split documents
19loader = TextLoader("documents.txt")
20documents = loader.load()
21
22text_splitter = RecursiveCharacterTextSplitter(
23 chunk_size=1000,
24 chunk_overlap=200
25)
26texts = text_splitter.split_documents(documents)
27
28# Create vector store for semantic search
29vectorstore = FAISS.from_documents(texts, embeddings)
30
31# Create BM25 retriever for keyword search
32bm25_retriever = BM25Retriever.from_documents(texts)
33bm25_retriever.k = 10
34
35# Create ensemble retriever (hybrid search)
36vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
37ensemble_retriever = EnsembleRetriever(
38 retrievers=[vector_retriever, bm25_retriever],
39 weights=[0.7, 0.3] # 70% semantic, 30% keyword
40)
41
42# Re-ranking with cross-encoder
43from sentence_transformers import CrossEncoder
44reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
45
46compressor = CrossEncoderReranker(
47 model=reranker,
48 top_n=5
49)
50
51# Context compression
52compression_retriever = ContextualCompressionRetriever(
53 base_compressor=compressor,
54 base_retriever=ensemble_retriever
55)
56
57# Alternative: LLM-based compression
58llm_compressor = LLMChainExtractor.from_llm(llm)
59compression_retriever_llm = ContextualCompressionRetriever(
60 base_compressor=llm_compressor,
61 base_retriever=ensemble_retriever
62)
63
64# Create QA chain with citations
65prompt_template = """Use the following pieces of context to answer the question at the end.
66If you don't know the answer, just say that you don't know, don't try to make up an answer.
67
68Context:
69{context}
70
71Question: {question}
72
73Answer based on the context above. Include citations [1], [2], etc. for each source used.
74
75Answer:"""
76
77PROMPT = PromptTemplate(
78 template=prompt_template,
79 input_variables=["context", "question"]
80)
81
82qa_chain = RetrievalQA.from_chain_type(
83 llm=llm,
84 chain_type="stuff",
85 retriever=compression_retriever,
86 return_source_documents=True,
87 chain_type_kwargs={"prompt": PROMPT}
88)
89
90# Query with advanced RAG
91query = "Who created Python and when?"
92result = qa_chain({"query": query})
93
94print(f"Answer: {result['result']}")
95print(f"\nSources:")
96for i, doc in enumerate(result['source_documents'], 1):
97 print(f"[{i}] {doc.page_content[:100]}...")
98 print(f" Metadata: {doc.metadata}\n")
99
100# Custom advanced RAG class
101from langchain.chains import LLMChain
102from typing import List, Dict
103
104class AdvancedRAG:
105 def __init__(self, llm, embeddings, documents):
106 self.llm = llm
107 self.embeddings = embeddings
108
109 # Split documents
110 text_splitter = RecursiveCharacterTextSplitter(
111 chunk_size=1000,
112 chunk_overlap=200
113 )
114 self.texts = text_splitter.split_documents(documents)
115
116 # Create retrievers
117 self.vectorstore = FAISS.from_documents(self.texts, embeddings)
118 self.vector_retriever = self.vectorstore.as_retriever(search_kwargs={"k": 20})
119 self.bm25_retriever = BM25Retriever.from_documents(self.texts)
120 self.bm25_retriever.k = 20
121
122 # Ensemble retriever
123 self.ensemble_retriever = EnsembleRetriever(
124 retrievers=[self.vector_retriever, self.bm25_retriever],
125 weights=[0.7, 0.3]
126 )
127
128 # Re-ranker
129 reranker_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
130 self.compressor = CrossEncoderReranker(model=reranker_model, top_n=5)
131
132 # Compression retriever
133 self.compression_retriever = ContextualCompressionRetriever(
134 base_compressor=self.compressor,
135 base_retriever=self.ensemble_retriever
136 )
137
138 # Prompt
139 self.prompt = PromptTemplate(
140 template=prompt_template,
141 input_variables=["context", "question"]
142 )
143 self.chain = LLMChain(llm=llm, prompt=self.prompt)
144
145 def query(self, question: str) -> Dict:
146 # Retrieve and compress
147 docs = self.compression_retriever.get_relevant_documents(question)
148
149 # Format context with citations
150 context = "\n\n".join([
151 f"[{i+1}] {doc.page_content}"
152 for i, doc in enumerate(docs)
153 ])
154
155 # Generate answer
156 answer = self.chain.run(context=context, question=question)
157
158 # Prepare sources
159 sources = [
160 {
161 "index": i+1,
162 "text": doc.page_content,
163 "metadata": doc.metadata
164 }
165 for i, doc in enumerate(docs)
166 ]
167
168 return {
169 "answer": answer,
170 "sources": sources
171 }
172
173# Usage
174rag = AdvancedRAG(llm, embeddings, documents)
175result = rag.query("Who created Python and when?")
176print(f"Answer: {result['answer']}")
177print(f"\nSources: {len(result['sources'])} documents")
Key Improvements:
- Hybrid Search: Better recall (finds more relevant docs)
- Re-ranking: Better precision (ranks best docs higher)
- Compression: Reduces token usage, focuses on relevant parts
- Citations: Provides source attribution
Trade-offs:
- More compute (multiple models)
- Higher latency
- Better quality
Q3: Implement agent memory (short-term and long-term).
Answer:
How Agent Memory Works:
Short-term: Recent conversation context (sliding window) Long-term: Persistent storage of important information (vector DB)
LangChain Implementation:
1from langchain.memory import ConversationBufferWindowMemory, ConversationSummaryMemory
2from langchain.memory import ConversationBufferMemory
3from langchain.memory import VectorStoreRetrieverMemory
4from langchain.vectorstores import FAISS
5from langchain.embeddings import OpenAIEmbeddings
6from langchain.llms import OpenAI
7from langchain.chains import ConversationChain
8from langchain.schema import Document
9from typing import List
10
11class AgentMemory:
12 def __init__(self, llm, embeddings, max_short_term=10):
13 self.llm = llm
14 self.embeddings = embeddings
15
16 # Short-term memory: Conversation buffer with sliding window
17 self.short_term_memory = ConversationBufferWindowMemory(
18 k=max_short_term,
19 return_messages=True
20 )
21
22 # Alternative: Summary memory (compresses old messages)
23 self.summary_memory = ConversationSummaryMemory(
24 llm=llm,
25 return_messages=True
26 )
27
28 # Long-term memory: Vector store for semantic search
29 self.long_term_store = FAISS.from_texts(
30 [""], # Initialize with empty text
31 embeddings
32 )
33 self.long_term_memory = VectorStoreRetrieverMemory(
34 retriever=self.long_term_store.as_retriever(search_kwargs={"k": 3})
35 )
36
37 def add_message(self, role: str, content: str, metadata: dict = None):
38 """Add message to short-term memory"""
39 if role == "user":
40 self.short_term_memory.chat_memory.add_user_message(content)
41 else:
42 self.short_term_memory.chat_memory.add_ai_message(content)
43
44 # Check if important for long-term storage
45 if self._is_important(content):
46 self.add_to_long_term(content, metadata)
47
48 def _is_important(self, content: str) -> bool:
49 """Determine if content should be stored long-term"""
50 important_keywords = ["remember", "important", "always", "never", "prefer", "name is"]
51 return any(keyword in content.lower() for keyword in important_keywords)
52
53 def add_to_long_term(self, content: str, metadata: dict = None):
54 """Store in long-term memory (vector store)"""
55 # Create document with metadata
56 doc = Document(
57 page_content=content,
58 metadata=metadata or {}
59 )
60
61 # Add to vector store
62 self.long_term_store.add_documents([doc])
63
64 # Update retriever
65 self.long_term_memory.retriever = self.long_term_store.as_retriever(
66 search_kwargs={"k": 3}
67 )
68
69 def get_full_context(self, query: str) -> str:
70 """Combine short-term and long-term memories"""
71 # Get short-term context
72 short_term = self.short_term_memory.load_memory_variables({})
73
74 # Get long-term relevant memories
75 long_term = self.long_term_memory.load_memory_variables({"prompt": query})
76
77 # Combine
78 context = ""
79 if long_term.get("history"):
80 context += f"Relevant memories:\n{long_term['history']}\n\n"
81
82 if short_term.get("history"):
83 context += f"Recent conversation:\n{short_term['history']}"
84
85 return context
86
87 def summarize_old_messages(self):
88 """Summarize and compress old messages"""
89 # Use summary memory to compress
90 summary = self.summary_memory.load_memory_variables({})
91 if summary.get("history"):
92 # Store summary in long-term
93 self.add_to_long_term(summary["history"], {"type": "summary"})
94
95# Usage with LangChain chains
96llm = OpenAI(temperature=0)
97embeddings = OpenAIEmbeddings()
98
99memory = AgentMemory(llm, embeddings, max_short_term=10)
100
101# Add messages
102memory.add_message("user", "My name is Alice")
103memory.add_message("assistant", "Nice to meet you, Alice!")
104memory.add_message("user", "I prefer Python over JavaScript")
105memory.add_message("assistant", "Noted! I'll remember your preference for Python.")
106
107# Create conversation chain with memory
108conversation = ConversationChain(
109 llm=llm,
110 memory=memory.short_term_memory,
111 verbose=True
112)
113
114# Query with full context
115query = "What programming language do I like?"
116context = memory.get_full_context(query)
117
118# Use in chain
119response = conversation.predict(input=query)
120print(response)
121
122# Alternative: Using ConversationSummaryBufferMemory
123from langchain.memory import ConversationSummaryBufferMemory
124
125summary_memory = ConversationSummaryBufferMemory(
126 llm=llm,
127 max_token_limit=1000,
128 return_messages=True
129)
130
131# Add messages
132summary_memory.chat_memory.add_user_message("My name is Alice")
133summary_memory.chat_memory.add_ai_message("Nice to meet you, Alice!")
134summary_memory.chat_memory.add_user_message("I prefer Python over JavaScript")
135
136# Memory automatically summarizes when token limit is reached
137conversation_with_summary = ConversationChain(
138 llm=llm,
139 memory=summary_memory,
140 verbose=True
141)
142
143# Query
144result = conversation_with_summary.predict(
145 input="What programming language do I like?"
146)
147print(result)
Key Features:
- Sliding window: Recent context always available
- Semantic search: Find relevant past information
- Importance filtering: Only store meaningful content
- Summarization: Compress old conversations
- Persistence: Save/load across sessions
Q4: Implement tool/function calling with error handling and retry logic.
Answer:
How Tool Calling Works:
LLM decides which tool to call and with what arguments. System executes and returns result.
LangChain Implementation with Robust Error Handling:
1from langchain.agents import initialize_agent, AgentType, Tool
2from langchain.agents import AgentExecutor
3from langchain.tools import Tool
4from langchain.llms import OpenAI
5from langchain.callbacks import get_openai_callback
6from langchain.agents import create_openai_functions_agent
7from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
8from langchain.schema import SystemMessage
9import time
10from typing import Dict, Any
11
12# Define tools with error handling
13def search_web(query: str, max_results: int = 5) -> str:
14 """Search the web for information.
15
16 Args:
17 query: Search query (must be at least 3 characters)
18 max_results: Maximum number of results (1-10)
19
20 Returns:
21 JSON string with search results
22 """
23 if not query or len(query) < 3:
24 raise ValueError("Query must be at least 3 characters")
25 if max_results < 1 or max_results > 10:
26 raise ValueError("max_results must be between 1 and 10")
27
28 # Simulated search
29 results = [f"Result {i} for '{query}'" for i in range(max_results)]
30 return str({"results": results})
31
32def calculate(expression: str) -> str:
33 """Evaluate mathematical expression safely.
34
35 Args:
36 expression: Mathematical expression to evaluate
37
38 Returns:
39 Result as string
40 """
41 try:
42 # Safe evaluation (in production, use ast.literal_eval)
43 result = eval(expression, {"__builtins__": {}}, {})
44 return str(float(result))
45 except Exception as e:
46 raise ValueError(f"Invalid expression: {e}")
47
48def send_email(to: str, subject: str, body: str) -> str:
49 """Send an email.
50
51 Args:
52 to: Email address (must contain @)
53 subject: Email subject (cannot be empty)
54 body: Email body
55
56 Returns:
57 Status message
58 """
59 if "@" not in to:
60 raise ValueError("Invalid email address")
61 if not subject:
62 raise ValueError("Subject cannot be empty")
63
64 # Simulated email sending
65 return f"Email sent to {to} with subject '{subject}'"
66
67# Create tools with retry wrapper
68class RetryTool:
69 def __init__(self, tool: Tool, max_retries: int = 3):
70 self.tool = tool
71 self.max_retries = max_retries
72
73 def run(self, *args, **kwargs) -> str:
74 """Execute tool with retry logic"""
75 last_error = None
76
77 for attempt in range(self.max_retries):
78 try:
79 return self.tool.run(*args, **kwargs)
80 except Exception as e:
81 last_error = str(e)
82 if attempt < self.max_retries - 1:
83 # Exponential backoff
84 time.sleep(2 ** attempt)
85 else:
86 return f"Error after {self.max_retries} attempts: {last_error}"
87
88 return f"Error: {last_error}"
89
90# Create tools
91tools = [
92 Tool(
93 name="search_web",
94 func=search_web,
95 description="Search the web for information. Input should be a search query string."
96 ),
97 Tool(
98 name="calculate",
99 func=calculate,
100 description="Evaluate mathematical expressions. Input should be a valid Python expression."
101 ),
102 Tool(
103 name="send_email",
104 func=send_email,
105 description="Send an email. Input should be a dictionary with 'to', 'subject', and 'body' keys."
106 ),
107]
108
109# Wrap tools with retry logic
110retry_tools = [RetryTool(tool, max_retries=3) for tool in tools]
111
112# Initialize agent with error handling
113llm = OpenAI(temperature=0)
114
115agent = initialize_agent(
116 tools=tools,
117 llm=llm,
118 agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
119 verbose=True,
120 max_iterations=10,
121 handle_parsing_errors=True, # Handle parsing errors gracefully
122 return_intermediate_steps=True
123)
124
125# Usage with error handling
126try:
127 result = agent.run("What is 15% of 240?")
128 print(f"Result: {result}")
129except Exception as e:
130 print(f"Agent error: {e}")
131
132# Alternative: Using OpenAI Functions (structured tool calling)
133from langchain.chat_models import ChatOpenAI
134from langchain.agents import create_openai_functions_agent
135from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
136
137chat_llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
138
139# Create function-calling agent
140prompt = ChatPromptTemplate.from_messages([
141 ("system", "You are a helpful assistant with access to tools."),
142 ("user", "{input}"),
143 MessagesPlaceholder(variable_name="agent_scratchpad"),
144])
145
146agent = create_openai_functions_agent(chat_llm, tools, prompt)
147agent_executor = AgentExecutor(
148 agent=agent,
149 tools=tools,
150 verbose=True,
151 handle_parsing_errors=True,
152 max_iterations=10
153)
154
155# Execute with callback for monitoring
156with get_openai_callback() as cb:
157 result = agent_executor.invoke({
158 "input": "What is 15% of 240? Use the calculator tool."
159 })
160 print(f"Result: {result['output']}")
161 print(f"Tokens used: {cb.total_tokens}")
162
163# Custom tool executor with advanced error handling
164from langchain.tools import StructuredTool
165from pydantic import BaseModel, Field
166
167class SearchInput(BaseModel):
168 query: str = Field(description="Search query")
169 max_results: int = Field(default=5, description="Maximum results", ge=1, le=10)
170
171def search_with_validation(query: str, max_results: int = 5) -> str:
172 """Search with Pydantic validation"""
173 return search_web(query, max_results)
174
175# Structured tool with validation
176structured_tool = StructuredTool.from_function(
177 func=search_with_validation,
178 name="search",
179 description="Search the web",
180 args_schema=SearchInput
181)
182
183# Agent with structured tools
184structured_agent = initialize_agent(
185 tools=[structured_tool] + tools[1:],
186 llm=llm,
187 agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
188 verbose=True
189)
Key Features:
- Schema validation: Tools define expected inputs
- Retry logic: Automatic retries with exponential backoff
- Error recovery: LLM attempts to fix invalid arguments
- Detailed results: Track attempts and errors
Q5: How do you evaluate LLM outputs in production?
Answer:
Evaluation Strategies:
1. Automated Metrics
Implementation:
1from typing import List, Dict
2import numpy as np
3
4class LLMEvaluator:
5 def __init__(self, embedding_model):
6 self.embedding_model = embedding_model
7
8 def semantic_similarity(self, generated: str, reference: str) -> float:
9 """Measure semantic similarity between generated and reference"""
10 emb1 = self.embedding_model.encode([generated])[0]
11 emb2 = self.embedding_model.encode([reference])[0]
12
13 similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
14 return float(similarity)
15
16 def factual_consistency(self, generated: str, context: str) -> float:
17 """Check if generated text is consistent with context"""
18 # Use NLI model to check entailment
19 from transformers import pipeline
20
21 nli = pipeline("text-classification",
22 model="microsoft/deberta-base-mnli")
23
24 result = nli(f"{context} [SEP] {generated}")
25
26 # Return probability of entailment
27 if result[0]["label"] == "ENTAILMENT":
28 return result[0]["score"]
29 return 0.0
30
31 def toxicity_score(self, text: str) -> float:
32 """Measure toxicity of generated text"""
33 from transformers import pipeline
34
35 toxicity = pipeline("text-classification",
36 model="unitary/toxic-bert")
37
38 result = toxicity(text)
39 return result[0]["score"] if result[0]["label"] == "toxic" else 0.0
40
41 def evaluate_batch(self, generations: List[Dict]) -> Dict:
42 """Evaluate a batch of generations"""
43 results = {
44 "semantic_similarity": [],
45 "factual_consistency": [],
46 "toxicity": [],
47 "length": []
48 }
49
50 for gen in generations:
51 if "reference" in gen:
52 results["semantic_similarity"].append(
53 self.semantic_similarity(gen["generated"], gen["reference"])
54 )
55
56 if "context" in gen:
57 results["factual_consistency"].append(
58 self.factual_consistency(gen["generated"], gen["context"])
59 )
60
61 results["toxicity"].append(self.toxicity_score(gen["generated"]))
62 results["length"].append(len(gen["generated"].split()))
63
64 # Aggregate
65 summary = {}
66 for metric, values in results.items():
67 if values:
68 summary[metric] = {
69 "mean": np.mean(values),
70 "std": np.std(values),
71 "min": np.min(values),
72 "max": np.max(values)
73 }
74
75 return summary
2. LLM-as-Judge
Implementation:
1def llm_judge(generated: str, criteria: str, judge_llm) -> Dict:
2 """Use LLM to evaluate another LLM's output"""
3 prompt = f"""Evaluate the following response based on these criteria:
4{criteria}
5
6Response to evaluate:
7{generated}
8
9Provide:
101. Score (1-10)
112. Reasoning
123. Specific issues (if any)
13
14Format as JSON:
15{{
16 "score": <1-10>,
17 "reasoning": "<explanation>",
18 "issues": ["<issue1>", "<issue2>"]
19}}"""
20
21 response = judge_llm.generate(prompt)
22
23 try:
24 evaluation = json.loads(response)
25 return evaluation
26 except:
27 return {"score": 0, "reasoning": "Failed to parse", "issues": []}
28
29# Usage
30criteria = """
31- Accuracy: Is the information correct?
32- Relevance: Does it answer the question?
33- Clarity: Is it easy to understand?
34- Completeness: Does it cover all aspects?
35"""
36
37evaluation = llm_judge(generated_text, criteria, judge_llm)
38print(f"Score: {evaluation['score']}/10")
39print(f"Reasoning: {evaluation['reasoning']}")
3. Human Feedback Loop
Implementation:
1class FeedbackCollector:
2 def __init__(self):
3 self.feedback_db = []
4
5 def collect_feedback(self, query: str, response: str,
6 rating: int, comments: str = ""):
7 """Collect user feedback"""
8 feedback = {
9 "timestamp": datetime.now().isoformat(),
10 "query": query,
11 "response": response,
12 "rating": rating, # 1-5 stars
13 "comments": comments
14 }
15 self.feedback_db.append(feedback)
16
17 def get_low_rated_samples(self, threshold: int = 3) -> List[Dict]:
18 """Get samples that need improvement"""
19 return [f for f in self.feedback_db if f["rating"] < threshold]
20
21 def analyze_feedback(self) -> Dict:
22 """Analyze feedback trends"""
23 if not self.feedback_db:
24 return {}
25
26 ratings = [f["rating"] for f in self.feedback_db]
27
28 return {
29 "total_samples": len(self.feedback_db),
30 "average_rating": np.mean(ratings),
31 "rating_distribution": {
32 i: ratings.count(i) for i in range(1, 6)
33 },
34 "low_rated_count": len(self.get_low_rated_samples())
35 }
4. A/B Testing
Implementation:
1import random
2
3class ABTester:
4 def __init__(self):
5 self.results = {"A": [], "B": []}
6
7 def get_variant(self, user_id: str) -> str:
8 """Assign user to variant (consistent hashing)"""
9 hash_val = hash(user_id)
10 return "A" if hash_val % 2 == 0 else "B"
11
12 def log_result(self, variant: str, metric: float):
13 """Log result for variant"""
14 self.results[variant].append(metric)
15
16 def analyze(self) -> Dict:
17 """Statistical analysis of A/B test"""
18 from scipy import stats
19
20 a_results = self.results["A"]
21 b_results = self.results["B"]
22
23 if not a_results or not b_results:
24 return {"error": "Insufficient data"}
25
26 # T-test
27 t_stat, p_value = stats.ttest_ind(a_results, b_results)
28
29 return {
30 "variant_A": {
31 "mean": np.mean(a_results),
32 "std": np.std(a_results),
33 "count": len(a_results)
34 },
35 "variant_B": {
36 "mean": np.mean(b_results),
37 "std": np.std(b_results),
38 "count": len(b_results)
39 },
40 "p_value": p_value,
41 "significant": p_value < 0.05,
42 "winner": "B" if np.mean(b_results) > np.mean(a_results) and p_value < 0.05 else "A"
43 }
Production Evaluation Checklist:
- ✅ Automated metrics (similarity, consistency, toxicity)
- ✅ LLM-as-judge for qualitative assessment
- ✅ Human feedback collection
- ✅ A/B testing for improvements
- ✅ Monitoring dashboards
- ✅ Alert thresholds
- ✅ Regular manual review
Summary
Medium-level LLM/Agent topics:
- ReAct agents: Reasoning + acting pattern
- Advanced RAG: Hybrid search, re-ranking, compression
- Agent memory: Short-term and long-term storage
- Tool calling: Robust execution with error handling
- Evaluation: Automated metrics, LLM-as-judge, human feedback
Key Skills:
- Design agent architectures
- Optimize RAG pipelines
- Handle errors gracefully
- Evaluate quality in production
- Iterate based on feedback
Related Snippets
- AI/ML Interview Questions - Easy
Easy-level AI/ML interview questions with LangChain examples and Mermaid … - AI/ML Interview Questions - Hard
Hard-level AI/ML interview questions covering advanced architectures, … - AI/ML Interview Questions - Medium
Medium-level AI/ML interview questions covering neural networks, ensemble … - LLM/Agentic AI Interview Questions - Easy
Easy-level LLM and Agentic AI interview questions covering fundamentals, … - LLM/Agentic AI Interview Questions - Hard
Hard-level LLM and Agentic AI interview questions covering multi-agent systems, …