LLM/Agentic AI Interview Questions - Medium

Medium-level LLM and Agentic AI interview questions covering agent architectures, RAG optimization, and production systems.

Q1: Design and implement a ReAct (Reasoning + Acting) agent.

Answer:

How ReAct Works:

ReAct alternates between reasoning (thinking) and acting (using tools) to solve problems.

Pattern:

1Thought: I need to find information about X
2Action: search("X")
3Observation: [search results]
4Thought: Now I need to calculate Y
5Action: calculate("Y")
6Observation: [result]
7Thought: I have enough information to answer
8Action: FINISH("answer")

LangChain Implementation:

  1from langchain.agents import initialize_agent, AgentType, Tool
  2from langchain.llms import OpenAI
  3from langchain.chains import LLMChain
  4from langchain.prompts import PromptTemplate
  5from langchain.agents.react.base import ReActDocstoreAgent
  6from langchain.agents import AgentExecutor
  7from langchain.tools import Tool
  8from langchain.utilities import WikipediaAPIWrapper, PythonREPL
  9
 10# Initialize LLM
 11llm = OpenAI(temperature=0)
 12
 13# Define tools
 14wikipedia = WikipediaAPIWrapper()
 15python_repl = PythonREPL()
 16
 17def calculate(expression: str) -> str:
 18    """Evaluate mathematical expression safely"""
 19    try:
 20        result = eval(expression, {"__builtins__": {}}, {})
 21        return str(result)
 22    except Exception as e:
 23        return f"Error: {str(e)}"
 24
 25def get_current_date() -> str:
 26    """Get current date"""
 27    from datetime import datetime
 28    return datetime.now().strftime("%Y-%m-%d")
 29
 30# Create tools list
 31tools = [
 32    Tool(
 33        name="Wikipedia",
 34        func=wikipedia.run,
 35        description="Search Wikipedia for information about a topic"
 36    ),
 37    Tool(
 38        name="Calculator",
 39        func=calculate,
 40        description="Evaluate mathematical expressions. Input should be a valid Python expression."
 41    ),
 42    Tool(
 43        name="CurrentDate",
 44        func=get_current_date,
 45        description="Get the current date in YYYY-MM-DD format"
 46    ),
 47]
 48
 49# Initialize ReAct agent
 50agent = initialize_agent(
 51    tools=tools,
 52    llm=llm,
 53    agent=AgentType.REACT_DOCSTORE,
 54    verbose=True,
 55    max_iterations=10,
 56    handle_parsing_errors=True
 57)
 58
 59# Run agent
 60result = agent.run("What is 15% of the population of France?")
 61print(result)
 62
 63# Alternative: Custom ReAct with more control
 64from langchain.agents import AgentExecutor, create_react_agent
 65from langchain.prompts import PromptTemplate
 66
 67# Custom ReAct prompt template
 68react_prompt = PromptTemplate.from_template("""
 69You are a helpful assistant that can use tools to answer questions.
 70
 71You have access to the following tools:
 72{tools}
 73
 74Use the following format:
 75
 76Question: the input question you must answer
 77Thought: you should always think about what to do
 78Action: the action to take, should be one of [{tool_names}]
 79Action Input: the input to the action
 80Observation: the result of the action
 81... (this Thought/Action/Action Input/Observation can repeat N times)
 82Thought: I now know the final answer
 83Final Answer: the final answer to the original input question
 84
 85Begin!
 86
 87Question: {input}
 88Thought: {agent_scratchpad}
 89""")
 90
 91# Create agent
 92agent = create_react_agent(llm, tools, react_prompt)
 93agent_executor = AgentExecutor(
 94    agent=agent,
 95    tools=tools,
 96    verbose=True,
 97    max_iterations=10
 98)
 99
100# Execute
101result = agent_executor.invoke({
102    "input": "What is 15% of the population of France?"
103})
104print(result["output"])

Why ReAct Works:

  • Reasoning: Helps agent plan and reflect
  • Acting: Grounds reasoning in real actions
  • Observation: Provides feedback to adjust plan

Key Design Decisions:

  • How to parse LLM output (regex, JSON, structured)
  • Error handling (invalid actions, tool failures)
  • When to stop (max steps, success criteria)
  • How to format history (token efficiency)

Answer:

How Advanced RAG Works:

  1. Hybrid Search: Combine semantic (vector) + keyword (BM25) search
  2. Re-ranking: Use cross-encoder to re-score top results
  3. Context Compression: Remove irrelevant parts
  4. Citation: Track which chunks were used

LangChain Implementation:

  1from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
  2from langchain.vectorstores import FAISS, Chroma
  3from langchain.retrievers import ContextualCompressionRetriever
  4from langchain.retrievers.document_compressors import LLMChainExtractor
  5from langchain.retrievers import BM25Retriever, EnsembleRetriever
  6from langchain.retrievers import ContextualCompressionRetriever
  7from langchain.retrievers.document_compressors import CrossEncoderReranker
  8from langchain.text_splitter import RecursiveCharacterTextSplitter
  9from langchain.chains import RetrievalQA
 10from langchain.llms import OpenAI
 11from langchain.prompts import PromptTemplate
 12from langchain.document_loaders import TextLoader
 13
 14# Initialize components
 15llm = OpenAI(temperature=0)
 16embeddings = OpenAIEmbeddings()  # or HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
 17
 18# Load and split documents
 19loader = TextLoader("documents.txt")
 20documents = loader.load()
 21
 22text_splitter = RecursiveCharacterTextSplitter(
 23    chunk_size=1000,
 24    chunk_overlap=200
 25)
 26texts = text_splitter.split_documents(documents)
 27
 28# Create vector store for semantic search
 29vectorstore = FAISS.from_documents(texts, embeddings)
 30
 31# Create BM25 retriever for keyword search
 32bm25_retriever = BM25Retriever.from_documents(texts)
 33bm25_retriever.k = 10
 34
 35# Create ensemble retriever (hybrid search)
 36vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
 37ensemble_retriever = EnsembleRetriever(
 38    retrievers=[vector_retriever, bm25_retriever],
 39    weights=[0.7, 0.3]  # 70% semantic, 30% keyword
 40)
 41
 42# Re-ranking with cross-encoder
 43from sentence_transformers import CrossEncoder
 44reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
 45
 46compressor = CrossEncoderReranker(
 47    model=reranker,
 48    top_n=5
 49)
 50
 51# Context compression
 52compression_retriever = ContextualCompressionRetriever(
 53    base_compressor=compressor,
 54    base_retriever=ensemble_retriever
 55)
 56
 57# Alternative: LLM-based compression
 58llm_compressor = LLMChainExtractor.from_llm(llm)
 59compression_retriever_llm = ContextualCompressionRetriever(
 60    base_compressor=llm_compressor,
 61    base_retriever=ensemble_retriever
 62)
 63
 64# Create QA chain with citations
 65prompt_template = """Use the following pieces of context to answer the question at the end.
 66If you don't know the answer, just say that you don't know, don't try to make up an answer.
 67
 68Context:
 69{context}
 70
 71Question: {question}
 72
 73Answer based on the context above. Include citations [1], [2], etc. for each source used.
 74
 75Answer:"""
 76
 77PROMPT = PromptTemplate(
 78    template=prompt_template,
 79    input_variables=["context", "question"]
 80)
 81
 82qa_chain = RetrievalQA.from_chain_type(
 83    llm=llm,
 84    chain_type="stuff",
 85    retriever=compression_retriever,
 86    return_source_documents=True,
 87    chain_type_kwargs={"prompt": PROMPT}
 88)
 89
 90# Query with advanced RAG
 91query = "Who created Python and when?"
 92result = qa_chain({"query": query})
 93
 94print(f"Answer: {result['result']}")
 95print(f"\nSources:")
 96for i, doc in enumerate(result['source_documents'], 1):
 97    print(f"[{i}] {doc.page_content[:100]}...")
 98    print(f"    Metadata: {doc.metadata}\n")
 99
100# Custom advanced RAG class
101from langchain.chains import LLMChain
102from typing import List, Dict
103
104class AdvancedRAG:
105    def __init__(self, llm, embeddings, documents):
106        self.llm = llm
107        self.embeddings = embeddings
108        
109        # Split documents
110        text_splitter = RecursiveCharacterTextSplitter(
111            chunk_size=1000,
112            chunk_overlap=200
113        )
114        self.texts = text_splitter.split_documents(documents)
115        
116        # Create retrievers
117        self.vectorstore = FAISS.from_documents(self.texts, embeddings)
118        self.vector_retriever = self.vectorstore.as_retriever(search_kwargs={"k": 20})
119        self.bm25_retriever = BM25Retriever.from_documents(self.texts)
120        self.bm25_retriever.k = 20
121        
122        # Ensemble retriever
123        self.ensemble_retriever = EnsembleRetriever(
124            retrievers=[self.vector_retriever, self.bm25_retriever],
125            weights=[0.7, 0.3]
126        )
127        
128        # Re-ranker
129        reranker_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
130        self.compressor = CrossEncoderReranker(model=reranker_model, top_n=5)
131        
132        # Compression retriever
133        self.compression_retriever = ContextualCompressionRetriever(
134            base_compressor=self.compressor,
135            base_retriever=self.ensemble_retriever
136        )
137        
138        # Prompt
139        self.prompt = PromptTemplate(
140            template=prompt_template,
141            input_variables=["context", "question"]
142        )
143        self.chain = LLMChain(llm=llm, prompt=self.prompt)
144    
145    def query(self, question: str) -> Dict:
146        # Retrieve and compress
147        docs = self.compression_retriever.get_relevant_documents(question)
148        
149        # Format context with citations
150        context = "\n\n".join([
151            f"[{i+1}] {doc.page_content}"
152            for i, doc in enumerate(docs)
153        ])
154        
155        # Generate answer
156        answer = self.chain.run(context=context, question=question)
157        
158        # Prepare sources
159        sources = [
160            {
161                "index": i+1,
162                "text": doc.page_content,
163                "metadata": doc.metadata
164            }
165            for i, doc in enumerate(docs)
166        ]
167        
168        return {
169            "answer": answer,
170            "sources": sources
171        }
172
173# Usage
174rag = AdvancedRAG(llm, embeddings, documents)
175result = rag.query("Who created Python and when?")
176print(f"Answer: {result['answer']}")
177print(f"\nSources: {len(result['sources'])} documents")

Key Improvements:

  1. Hybrid Search: Better recall (finds more relevant docs)
  2. Re-ranking: Better precision (ranks best docs higher)
  3. Compression: Reduces token usage, focuses on relevant parts
  4. Citations: Provides source attribution

Trade-offs:

  • More compute (multiple models)
  • Higher latency
  • Better quality

Q3: Implement agent memory (short-term and long-term).

Answer:

How Agent Memory Works:

Short-term: Recent conversation context (sliding window) Long-term: Persistent storage of important information (vector DB)

LangChain Implementation:

  1from langchain.memory import ConversationBufferWindowMemory, ConversationSummaryMemory
  2from langchain.memory import ConversationBufferMemory
  3from langchain.memory import VectorStoreRetrieverMemory
  4from langchain.vectorstores import FAISS
  5from langchain.embeddings import OpenAIEmbeddings
  6from langchain.llms import OpenAI
  7from langchain.chains import ConversationChain
  8from langchain.schema import Document
  9from typing import List
 10
 11class AgentMemory:
 12    def __init__(self, llm, embeddings, max_short_term=10):
 13        self.llm = llm
 14        self.embeddings = embeddings
 15        
 16        # Short-term memory: Conversation buffer with sliding window
 17        self.short_term_memory = ConversationBufferWindowMemory(
 18            k=max_short_term,
 19            return_messages=True
 20        )
 21        
 22        # Alternative: Summary memory (compresses old messages)
 23        self.summary_memory = ConversationSummaryMemory(
 24            llm=llm,
 25            return_messages=True
 26        )
 27        
 28        # Long-term memory: Vector store for semantic search
 29        self.long_term_store = FAISS.from_texts(
 30            [""],  # Initialize with empty text
 31            embeddings
 32        )
 33        self.long_term_memory = VectorStoreRetrieverMemory(
 34            retriever=self.long_term_store.as_retriever(search_kwargs={"k": 3})
 35        )
 36    
 37    def add_message(self, role: str, content: str, metadata: dict = None):
 38        """Add message to short-term memory"""
 39        if role == "user":
 40            self.short_term_memory.chat_memory.add_user_message(content)
 41        else:
 42            self.short_term_memory.chat_memory.add_ai_message(content)
 43        
 44        # Check if important for long-term storage
 45        if self._is_important(content):
 46            self.add_to_long_term(content, metadata)
 47    
 48    def _is_important(self, content: str) -> bool:
 49        """Determine if content should be stored long-term"""
 50        important_keywords = ["remember", "important", "always", "never", "prefer", "name is"]
 51        return any(keyword in content.lower() for keyword in important_keywords)
 52    
 53    def add_to_long_term(self, content: str, metadata: dict = None):
 54        """Store in long-term memory (vector store)"""
 55        # Create document with metadata
 56        doc = Document(
 57            page_content=content,
 58            metadata=metadata or {}
 59        )
 60        
 61        # Add to vector store
 62        self.long_term_store.add_documents([doc])
 63        
 64        # Update retriever
 65        self.long_term_memory.retriever = self.long_term_store.as_retriever(
 66            search_kwargs={"k": 3}
 67        )
 68    
 69    def get_full_context(self, query: str) -> str:
 70        """Combine short-term and long-term memories"""
 71        # Get short-term context
 72        short_term = self.short_term_memory.load_memory_variables({})
 73        
 74        # Get long-term relevant memories
 75        long_term = self.long_term_memory.load_memory_variables({"prompt": query})
 76        
 77        # Combine
 78        context = ""
 79        if long_term.get("history"):
 80            context += f"Relevant memories:\n{long_term['history']}\n\n"
 81        
 82        if short_term.get("history"):
 83            context += f"Recent conversation:\n{short_term['history']}"
 84        
 85        return context
 86    
 87    def summarize_old_messages(self):
 88        """Summarize and compress old messages"""
 89        # Use summary memory to compress
 90        summary = self.summary_memory.load_memory_variables({})
 91        if summary.get("history"):
 92            # Store summary in long-term
 93            self.add_to_long_term(summary["history"], {"type": "summary"})
 94
 95# Usage with LangChain chains
 96llm = OpenAI(temperature=0)
 97embeddings = OpenAIEmbeddings()
 98
 99memory = AgentMemory(llm, embeddings, max_short_term=10)
100
101# Add messages
102memory.add_message("user", "My name is Alice")
103memory.add_message("assistant", "Nice to meet you, Alice!")
104memory.add_message("user", "I prefer Python over JavaScript")
105memory.add_message("assistant", "Noted! I'll remember your preference for Python.")
106
107# Create conversation chain with memory
108conversation = ConversationChain(
109    llm=llm,
110    memory=memory.short_term_memory,
111    verbose=True
112)
113
114# Query with full context
115query = "What programming language do I like?"
116context = memory.get_full_context(query)
117
118# Use in chain
119response = conversation.predict(input=query)
120print(response)
121
122# Alternative: Using ConversationSummaryBufferMemory
123from langchain.memory import ConversationSummaryBufferMemory
124
125summary_memory = ConversationSummaryBufferMemory(
126    llm=llm,
127    max_token_limit=1000,
128    return_messages=True
129)
130
131# Add messages
132summary_memory.chat_memory.add_user_message("My name is Alice")
133summary_memory.chat_memory.add_ai_message("Nice to meet you, Alice!")
134summary_memory.chat_memory.add_user_message("I prefer Python over JavaScript")
135
136# Memory automatically summarizes when token limit is reached
137conversation_with_summary = ConversationChain(
138    llm=llm,
139    memory=summary_memory,
140    verbose=True
141)
142
143# Query
144result = conversation_with_summary.predict(
145    input="What programming language do I like?"
146)
147print(result)

Key Features:

  • Sliding window: Recent context always available
  • Semantic search: Find relevant past information
  • Importance filtering: Only store meaningful content
  • Summarization: Compress old conversations
  • Persistence: Save/load across sessions

Q4: Implement tool/function calling with error handling and retry logic.

Answer:

How Tool Calling Works:

LLM decides which tool to call and with what arguments. System executes and returns result.

LangChain Implementation with Robust Error Handling:

  1from langchain.agents import initialize_agent, AgentType, Tool
  2from langchain.agents import AgentExecutor
  3from langchain.tools import Tool
  4from langchain.llms import OpenAI
  5from langchain.callbacks import get_openai_callback
  6from langchain.agents import create_openai_functions_agent
  7from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
  8from langchain.schema import SystemMessage
  9import time
 10from typing import Dict, Any
 11
 12# Define tools with error handling
 13def search_web(query: str, max_results: int = 5) -> str:
 14    """Search the web for information.
 15    
 16    Args:
 17        query: Search query (must be at least 3 characters)
 18        max_results: Maximum number of results (1-10)
 19    
 20    Returns:
 21        JSON string with search results
 22    """
 23    if not query or len(query) < 3:
 24        raise ValueError("Query must be at least 3 characters")
 25    if max_results < 1 or max_results > 10:
 26        raise ValueError("max_results must be between 1 and 10")
 27    
 28    # Simulated search
 29    results = [f"Result {i} for '{query}'" for i in range(max_results)]
 30    return str({"results": results})
 31
 32def calculate(expression: str) -> str:
 33    """Evaluate mathematical expression safely.
 34    
 35    Args:
 36        expression: Mathematical expression to evaluate
 37    
 38    Returns:
 39        Result as string
 40    """
 41    try:
 42        # Safe evaluation (in production, use ast.literal_eval)
 43        result = eval(expression, {"__builtins__": {}}, {})
 44        return str(float(result))
 45    except Exception as e:
 46        raise ValueError(f"Invalid expression: {e}")
 47
 48def send_email(to: str, subject: str, body: str) -> str:
 49    """Send an email.
 50    
 51    Args:
 52        to: Email address (must contain @)
 53        subject: Email subject (cannot be empty)
 54        body: Email body
 55    
 56    Returns:
 57        Status message
 58    """
 59    if "@" not in to:
 60        raise ValueError("Invalid email address")
 61    if not subject:
 62        raise ValueError("Subject cannot be empty")
 63    
 64    # Simulated email sending
 65    return f"Email sent to {to} with subject '{subject}'"
 66
 67# Create tools with retry wrapper
 68class RetryTool:
 69    def __init__(self, tool: Tool, max_retries: int = 3):
 70        self.tool = tool
 71        self.max_retries = max_retries
 72    
 73    def run(self, *args, **kwargs) -> str:
 74        """Execute tool with retry logic"""
 75        last_error = None
 76        
 77        for attempt in range(self.max_retries):
 78            try:
 79                return self.tool.run(*args, **kwargs)
 80            except Exception as e:
 81                last_error = str(e)
 82                if attempt < self.max_retries - 1:
 83                    # Exponential backoff
 84                    time.sleep(2 ** attempt)
 85                else:
 86                    return f"Error after {self.max_retries} attempts: {last_error}"
 87        
 88        return f"Error: {last_error}"
 89
 90# Create tools
 91tools = [
 92    Tool(
 93        name="search_web",
 94        func=search_web,
 95        description="Search the web for information. Input should be a search query string."
 96    ),
 97    Tool(
 98        name="calculate",
 99        func=calculate,
100        description="Evaluate mathematical expressions. Input should be a valid Python expression."
101    ),
102    Tool(
103        name="send_email",
104        func=send_email,
105        description="Send an email. Input should be a dictionary with 'to', 'subject', and 'body' keys."
106    ),
107]
108
109# Wrap tools with retry logic
110retry_tools = [RetryTool(tool, max_retries=3) for tool in tools]
111
112# Initialize agent with error handling
113llm = OpenAI(temperature=0)
114
115agent = initialize_agent(
116    tools=tools,
117    llm=llm,
118    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
119    verbose=True,
120    max_iterations=10,
121    handle_parsing_errors=True,  # Handle parsing errors gracefully
122    return_intermediate_steps=True
123)
124
125# Usage with error handling
126try:
127    result = agent.run("What is 15% of 240?")
128    print(f"Result: {result}")
129except Exception as e:
130    print(f"Agent error: {e}")
131
132# Alternative: Using OpenAI Functions (structured tool calling)
133from langchain.chat_models import ChatOpenAI
134from langchain.agents import create_openai_functions_agent
135from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
136
137chat_llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
138
139# Create function-calling agent
140prompt = ChatPromptTemplate.from_messages([
141    ("system", "You are a helpful assistant with access to tools."),
142    ("user", "{input}"),
143    MessagesPlaceholder(variable_name="agent_scratchpad"),
144])
145
146agent = create_openai_functions_agent(chat_llm, tools, prompt)
147agent_executor = AgentExecutor(
148    agent=agent,
149    tools=tools,
150    verbose=True,
151    handle_parsing_errors=True,
152    max_iterations=10
153)
154
155# Execute with callback for monitoring
156with get_openai_callback() as cb:
157    result = agent_executor.invoke({
158        "input": "What is 15% of 240? Use the calculator tool."
159    })
160    print(f"Result: {result['output']}")
161    print(f"Tokens used: {cb.total_tokens}")
162
163# Custom tool executor with advanced error handling
164from langchain.tools import StructuredTool
165from pydantic import BaseModel, Field
166
167class SearchInput(BaseModel):
168    query: str = Field(description="Search query")
169    max_results: int = Field(default=5, description="Maximum results", ge=1, le=10)
170
171def search_with_validation(query: str, max_results: int = 5) -> str:
172    """Search with Pydantic validation"""
173    return search_web(query, max_results)
174
175# Structured tool with validation
176structured_tool = StructuredTool.from_function(
177    func=search_with_validation,
178    name="search",
179    description="Search the web",
180    args_schema=SearchInput
181)
182
183# Agent with structured tools
184structured_agent = initialize_agent(
185    tools=[structured_tool] + tools[1:],
186    llm=llm,
187    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
188    verbose=True
189)

Key Features:

  • Schema validation: Tools define expected inputs
  • Retry logic: Automatic retries with exponential backoff
  • Error recovery: LLM attempts to fix invalid arguments
  • Detailed results: Track attempts and errors

Q5: How do you evaluate LLM outputs in production?

Answer:

Evaluation Strategies:

1. Automated Metrics

Implementation:

 1from typing import List, Dict
 2import numpy as np
 3
 4class LLMEvaluator:
 5    def __init__(self, embedding_model):
 6        self.embedding_model = embedding_model
 7    
 8    def semantic_similarity(self, generated: str, reference: str) -> float:
 9        """Measure semantic similarity between generated and reference"""
10        emb1 = self.embedding_model.encode([generated])[0]
11        emb2 = self.embedding_model.encode([reference])[0]
12        
13        similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
14        return float(similarity)
15    
16    def factual_consistency(self, generated: str, context: str) -> float:
17        """Check if generated text is consistent with context"""
18        # Use NLI model to check entailment
19        from transformers import pipeline
20        
21        nli = pipeline("text-classification", 
22                      model="microsoft/deberta-base-mnli")
23        
24        result = nli(f"{context} [SEP] {generated}")
25        
26        # Return probability of entailment
27        if result[0]["label"] == "ENTAILMENT":
28            return result[0]["score"]
29        return 0.0
30    
31    def toxicity_score(self, text: str) -> float:
32        """Measure toxicity of generated text"""
33        from transformers import pipeline
34        
35        toxicity = pipeline("text-classification",
36                           model="unitary/toxic-bert")
37        
38        result = toxicity(text)
39        return result[0]["score"] if result[0]["label"] == "toxic" else 0.0
40    
41    def evaluate_batch(self, generations: List[Dict]) -> Dict:
42        """Evaluate a batch of generations"""
43        results = {
44            "semantic_similarity": [],
45            "factual_consistency": [],
46            "toxicity": [],
47            "length": []
48        }
49        
50        for gen in generations:
51            if "reference" in gen:
52                results["semantic_similarity"].append(
53                    self.semantic_similarity(gen["generated"], gen["reference"])
54                )
55            
56            if "context" in gen:
57                results["factual_consistency"].append(
58                    self.factual_consistency(gen["generated"], gen["context"])
59                )
60            
61            results["toxicity"].append(self.toxicity_score(gen["generated"]))
62            results["length"].append(len(gen["generated"].split()))
63        
64        # Aggregate
65        summary = {}
66        for metric, values in results.items():
67            if values:
68                summary[metric] = {
69                    "mean": np.mean(values),
70                    "std": np.std(values),
71                    "min": np.min(values),
72                    "max": np.max(values)
73                }
74        
75        return summary

2. LLM-as-Judge

Implementation:

 1def llm_judge(generated: str, criteria: str, judge_llm) -> Dict:
 2    """Use LLM to evaluate another LLM's output"""
 3    prompt = f"""Evaluate the following response based on these criteria:
 4{criteria}
 5
 6Response to evaluate:
 7{generated}
 8
 9Provide:
101. Score (1-10)
112. Reasoning
123. Specific issues (if any)
13
14Format as JSON:
15{{
16    "score": <1-10>,
17    "reasoning": "<explanation>",
18    "issues": ["<issue1>", "<issue2>"]
19}}"""
20    
21    response = judge_llm.generate(prompt)
22    
23    try:
24        evaluation = json.loads(response)
25        return evaluation
26    except:
27        return {"score": 0, "reasoning": "Failed to parse", "issues": []}
28
29# Usage
30criteria = """
31- Accuracy: Is the information correct?
32- Relevance: Does it answer the question?
33- Clarity: Is it easy to understand?
34- Completeness: Does it cover all aspects?
35"""
36
37evaluation = llm_judge(generated_text, criteria, judge_llm)
38print(f"Score: {evaluation['score']}/10")
39print(f"Reasoning: {evaluation['reasoning']}")

3. Human Feedback Loop

Implementation:

 1class FeedbackCollector:
 2    def __init__(self):
 3        self.feedback_db = []
 4    
 5    def collect_feedback(self, query: str, response: str, 
 6                        rating: int, comments: str = ""):
 7        """Collect user feedback"""
 8        feedback = {
 9            "timestamp": datetime.now().isoformat(),
10            "query": query,
11            "response": response,
12            "rating": rating,  # 1-5 stars
13            "comments": comments
14        }
15        self.feedback_db.append(feedback)
16    
17    def get_low_rated_samples(self, threshold: int = 3) -> List[Dict]:
18        """Get samples that need improvement"""
19        return [f for f in self.feedback_db if f["rating"] < threshold]
20    
21    def analyze_feedback(self) -> Dict:
22        """Analyze feedback trends"""
23        if not self.feedback_db:
24            return {}
25        
26        ratings = [f["rating"] for f in self.feedback_db]
27        
28        return {
29            "total_samples": len(self.feedback_db),
30            "average_rating": np.mean(ratings),
31            "rating_distribution": {
32                i: ratings.count(i) for i in range(1, 6)
33            },
34            "low_rated_count": len(self.get_low_rated_samples())
35        }

4. A/B Testing

Implementation:

 1import random
 2
 3class ABTester:
 4    def __init__(self):
 5        self.results = {"A": [], "B": []}
 6    
 7    def get_variant(self, user_id: str) -> str:
 8        """Assign user to variant (consistent hashing)"""
 9        hash_val = hash(user_id)
10        return "A" if hash_val % 2 == 0 else "B"
11    
12    def log_result(self, variant: str, metric: float):
13        """Log result for variant"""
14        self.results[variant].append(metric)
15    
16    def analyze(self) -> Dict:
17        """Statistical analysis of A/B test"""
18        from scipy import stats
19        
20        a_results = self.results["A"]
21        b_results = self.results["B"]
22        
23        if not a_results or not b_results:
24            return {"error": "Insufficient data"}
25        
26        # T-test
27        t_stat, p_value = stats.ttest_ind(a_results, b_results)
28        
29        return {
30            "variant_A": {
31                "mean": np.mean(a_results),
32                "std": np.std(a_results),
33                "count": len(a_results)
34            },
35            "variant_B": {
36                "mean": np.mean(b_results),
37                "std": np.std(b_results),
38                "count": len(b_results)
39            },
40            "p_value": p_value,
41            "significant": p_value < 0.05,
42            "winner": "B" if np.mean(b_results) > np.mean(a_results) and p_value < 0.05 else "A"
43        }

Production Evaluation Checklist:

  • ✅ Automated metrics (similarity, consistency, toxicity)
  • ✅ LLM-as-judge for qualitative assessment
  • ✅ Human feedback collection
  • ✅ A/B testing for improvements
  • ✅ Monitoring dashboards
  • ✅ Alert thresholds
  • ✅ Regular manual review

Summary

Medium-level LLM/Agent topics:

  • ReAct agents: Reasoning + acting pattern
  • Advanced RAG: Hybrid search, re-ranking, compression
  • Agent memory: Short-term and long-term storage
  • Tool calling: Robust execution with error handling
  • Evaluation: Automated metrics, LLM-as-judge, human feedback

Key Skills:

  • Design agent architectures
  • Optimize RAG pipelines
  • Handle errors gracefully
  • Evaluate quality in production
  • Iterate based on feedback

Related Snippets