Skip to main content

LLM Evaluation

Automated metrics (perplexity, BLEU, ROUGE), LLM-as-judge, benchmarks (MMLU, HumanEval, HELM), A/B testing, eval frameworks (LangSmith, Ragas)

~45 min
Listen to this lesson

LLM Evaluation

Evaluating LLM applications is fundamentally harder than evaluating traditional ML models. There's no single accuracy number --- the quality of a response depends on correctness, helpfulness, safety, style, and context. This lesson covers the full spectrum of evaluation techniques from automated metrics to human evaluation frameworks.

Why Evaluation Matters

Without robust evaluation, you're flying blind:

  • Before deployment: Is this model good enough for production?
  • During development: Did my prompt change improve or hurt quality?
  • In production: Is the system degrading over time? Are users satisfied?
  • Model selection: Which model offers the best quality-cost-latency tradeoff?
  • Automated Reference-Based Metrics

    These metrics compare generated text against reference (ground truth) text.

    Perplexity

    Measures how "surprised" the model is by the text. Lower = better.

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    def compute_perplexity(text: str, model_name: str = "gpt2") -> float: """Compute perplexity of text under a language model.""" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) model.eval()

    encodings = tokenizer(text, return_tensors="pt") input_ids = encodings.input_ids

    with torch.no_grad(): outputs = model(input_ids, labels=input_ids) loss = outputs.loss # Cross-entropy loss

    perplexity = torch.exp(loss).item() return perplexity

    Lower perplexity = more fluent/natural text

    text_good = "The cat sat on the mat and watched the birds outside." text_bad = "Mat the on sat cat the birds outside watched and."

    ppl_good = compute_perplexity(text_good) ppl_bad = compute_perplexity(text_bad) print(f"Good text perplexity: {ppl_good:.2f}") # ~30-50 print(f"Bad text perplexity: {ppl_bad:.2f}") # ~200-500

    BLEU Score

    Measures n-gram overlap between generated and reference text (originally for machine translation):

    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

    reference = "The cat is sitting on the mat".split() candidate = "The cat sat on the mat".split()

    BLEU-4 (considers 1-gram through 4-gram overlap)

    smooth = SmoothingFunction().method1 score = sentence_bleu( [reference], # Can have multiple references candidate, smoothing_function=smooth, ) print(f"BLEU score: {score:.4f}") # 0-1, higher is better

    Corpus-level BLEU

    from nltk.translate.bleu_score import corpus_bleu

    references_list = [ [["The", "cat", "sat", "on", "the", "mat"]], [["A", "dog", "is", "running", "in", "the", "park"]], ] candidates_list = [ ["The", "cat", "is", "on", "the", "mat"], ["The", "dog", "runs", "in", "the", "park"], ]

    corpus_score = corpus_bleu(references_list, candidates_list, smoothing_function=smooth) print(f"Corpus BLEU: {corpus_score:.4f}")

    ROUGE Score

    Measures recall-oriented overlap (originally for summarization):

    from rouge_score import rouge_scorer

    scorer = rouge_scorer.RougeScorer( ["rouge1", "rouge2", "rougeL"], use_stemmer=True, )

    reference = "The cat sat on the mat and watched the birds outside the window." hypothesis = "A cat was sitting on a mat, watching birds through the window."

    scores = scorer.score(reference, hypothesis)

    for metric, score in scores.items(): print(f"{metric}: Precision={score.precision:.3f}, Recall={score.recall:.3f}, F1={score.fmeasure:.3f}")

    rouge1: Unigram overlap

    rouge2: Bigram overlap

    rougeL: Longest Common Subsequence

    Limitations of Automated Metrics

    BLEU and ROUGE only measure surface-level text overlap. They cannot assess semantic correctness, helpfulness, or reasoning quality. A response saying 'The feline rested upon the rug' would score low against 'The cat sat on the mat' despite having the same meaning. These metrics are useful for translation and summarization but are poor proxies for general LLM quality. Use them as one signal among many, not as the sole evaluator.

    LLM-as-Judge

    Use a powerful LLM to evaluate the outputs of another LLM. This is currently the most practical approach for evaluating open-ended generation.

    Single-Answer Grading

    from openai import OpenAI
    from pydantic import BaseModel, Field
    import json

    client = OpenAI()

    class EvalScore(BaseModel): relevance: int = Field(description="1-5 score for relevance to the question") accuracy: int = Field(description="1-5 score for factual accuracy") completeness: int = Field(description="1-5 score for completeness of the answer") clarity: int = Field(description="1-5 score for clarity of explanation") reasoning: str = Field(description="Brief explanation of the scores")

    def llm_judge(question: str, answer: str, reference: str = None) -> EvalScore: """Use GPT-4o as a judge to evaluate an answer.""" system_prompt = """You are an expert evaluator. Score the given answer on a 1-5 scale for each criterion.

    Scoring guide: 1 = Very poor, 2 = Poor, 3 = Adequate, 4 = Good, 5 = Excellent

    Be critical but fair. Provide brief reasoning for your scores."""

    user_prompt = f"Question: {question}\n\nAnswer to evaluate: {answer}" if reference: user_prompt += f"\n\nReference answer: {reference}"

    response = client.beta.chat.completions.parse( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ], response_format=EvalScore, )

    return response.choices[0].message.parsed

    Evaluate

    question = "What is gradient descent?" good_answer = "Gradient descent is an optimization algorithm that iteratively adjusts model parameters by computing the gradient of the loss function and moving in the direction that reduces the loss. The learning rate controls the step size." bad_answer = "Gradient descent is when the computer learns stuff by going down a hill."

    eval_good = llm_judge(question, good_answer) eval_bad = llm_judge(question, bad_answer)

    print(f"Good answer: R={eval_good.relevance} A={eval_good.accuracy} Co={eval_good.completeness} Cl={eval_good.clarity}") print(f"Bad answer: R={eval_bad.relevance} A={eval_bad.accuracy} Co={eval_bad.completeness} Cl={eval_bad.clarity}")

    Pairwise Comparison

    class PairwiseResult(BaseModel):
        winner: str = Field(description="'A', 'B', or 'tie'")
        reasoning: str = Field(description="Why this answer is better")
        confidence: float = Field(description="Confidence 0-1")

    def pairwise_judge(question: str, answer_a: str, answer_b: str) -> PairwiseResult: """Compare two answers and pick the better one.""" response = client.beta.chat.completions.parse( model="gpt-4o", messages=[ {"role": "system", "content": "Compare two answers to the same question. Pick the better one or call it a tie. Be specific about why."}, {"role": "user", "content": f"Question: {question}\n\nAnswer A: {answer_a}\n\nAnswer B: {answer_b}"}, ], response_format=PairwiseResult, ) return response.choices[0].message.parsed

    Mitigate position bias: run both orderings and check consistency

    result_ab = pairwise_judge(question, good_answer, bad_answer) result_ba = pairwise_judge(question, bad_answer, good_answer)

    print(f"A vs B: Winner={result_ab.winner}, Confidence={result_ab.confidence:.2f}") print(f"B vs A: Winner={result_ba.winner}, Confidence={result_ba.confidence:.2f}")

    LLM-as-Judge Best Practices

    Use the strongest available model as the judge (GPT-4o or Claude 3.5 Sonnet). Mitigate position bias by running pairwise comparisons in both orderings. Use structured output (JSON/Pydantic) for consistent scoring. Include a rubric in the system prompt so the judge scores consistently. For critical evaluations, use multiple judge models and take the consensus. Always validate your judge against human judgments on a sample set.

    Standard Benchmarks

    MMLU (Massive Multitask Language Understanding)

    Tests knowledge across 57 subjects (STEM, humanities, social sciences, etc.):

    # Example MMLU question
    question = {
        "subject": "machine_learning",
        "question": "Which of the following is NOT a common activation function?",
        "choices": ["ReLU", "Sigmoid", "Softmax", "Gradient"],
        "answer": 3,  # "Gradient" is not an activation function
    }

    Running MMLU evaluation

    from datasets import load_dataset

    Load a subject

    ds = load_dataset("cais/mmlu", "machine_learning", split="test")

    correct = 0 total = 0

    for example in ds: prompt = f"""Question: {example['question']} A. {example['choices'][0]} B. {example['choices'][1]} C. {example['choices'][2]} D. {example['choices'][3]}

    Answer with just the letter (A, B, C, or D):"""

    response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=1, )

    predicted = response.choices[0].message.content.strip().upper() expected = chr(65 + example["answer"]) # 0->A, 1->B, etc.

    if predicted == expected: correct += 1 total += 1

    print(f"MMLU Machine Learning: {correct}/{total} = {correct/total:.2%}")

    HumanEval (Code Generation)

    Tests code generation with 164 Python programming problems:

    # Example HumanEval problem
    problem = {
        "task_id": "HumanEval/0",
        "prompt": '''from typing import List

    def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if in given list of numbers, any two numbers are closer to each other than the given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ ''', "test": """ assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False assert has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) == True """, }

    pass@k metric: probability that at least 1 of k samples passes all tests

    Other Important Benchmarks

    BenchmarkTestsKey Metric
    MMLUKnowledge across 57 subjectsAccuracy
    HumanEvalPython code generationpass@1, pass@10
    HELMHolistic eval (accuracy, fairness, robustness)Multi-metric
    GSM8KGrade school math reasoningAccuracy
    MT-BenchMulti-turn conversation qualityJudge score 1-10
    TruthfulQAResistance to common misconceptions% truthful
    BigBench200+ diverse reasoning tasksPer-task accuracy

    A/B Testing for LLM Applications

    import random
    import statistics
    from dataclasses import dataclass

    @dataclass class ABTestResult: variant: str # "A" or "B" query: str response: str latency_ms: float judge_score: float user_rating: float = None

    class LLMABTest: def __init__(self, model_a: str, model_b: str, split_ratio: float = 0.5): self.model_a = model_a self.model_b = model_b self.split_ratio = split_ratio self.results: list[ABTestResult] = []

    def assign_variant(self) -> str: """Randomly assign a user to variant A or B.""" return "A" if random.random() < self.split_ratio else "B"

    def run_query(self, query: str) -> ABTestResult: """Route query to the assigned model and evaluate.""" import time

    variant = self.assign_variant() model = self.model_a if variant == "A" else self.model_b

    start = time.perf_counter() response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": query}], ) latency = (time.perf_counter() - start) * 1000

    answer = response.choices[0].message.content

    # Auto-evaluate with LLM judge judge_result = llm_judge(query, answer) avg_score = (judge_result.relevance + judge_result.accuracy + judge_result.completeness + judge_result.clarity) / 4

    result = ABTestResult( variant=variant, query=query, response=answer, latency_ms=latency, judge_score=avg_score, ) self.results.append(result) return result

    def analyze(self): """Analyze A/B test results.""" a_results = [r for r in self.results if r.variant == "A"] b_results = [r for r in self.results if r.variant == "B"]

    print(f"\nA/B Test Results ({len(self.results)} total queries)") print("=" * 50) print(f"Model A ({self.model_a}): {len(a_results)} queries") print(f" Avg Judge Score: {statistics.mean(r.judge_score for r in a_results):.2f}") print(f" Avg Latency: {statistics.mean(r.latency_ms for r in a_results):.0f}ms")

    print(f"\nModel B ({self.model_b}): {len(b_results)} queries") print(f" Avg Judge Score: {statistics.mean(r.judge_score for r in b_results):.2f}") print(f" Avg Latency: {statistics.mean(r.latency_ms for r in b_results):.0f}ms")

    Usage

    test = LLMABTest("gpt-4o-mini", "gpt-4o") queries = ["Explain RAG", "What is RLHF?", "How do transformers work?"] for q in queries: test.run_query(q) test.analyze()

    Evaluation Frameworks

    LangSmith

    LangSmith provides tracing, monitoring, and evaluation for LLM applications:

    from langsmith import Client
    from langsmith.evaluation import evaluate

    ls_client = Client()

    Create a dataset of test cases

    dataset = ls_client.create_dataset("rag-eval-set")

    ls_client.create_examples( inputs=[ {"question": "What is RAG?"}, {"question": "How does attention work?"}, {"question": "What is fine-tuning?"}, ], outputs=[ {"answer": "RAG is Retrieval-Augmented Generation, a technique that grounds LLM responses in retrieved documents."}, {"answer": "Attention computes weighted sums of value vectors based on query-key compatibility."}, {"answer": "Fine-tuning adapts a pre-trained model to a specific task by training on task-specific data."}, ], dataset_id=dataset.id, )

    Define your application as a function

    def my_rag_app(inputs: dict) -> dict: question = inputs["question"] # Your RAG pipeline here answer = rag_chain.invoke(question) return {"answer": answer}

    Define custom evaluators

    def correctness_evaluator(run, example): """Check if the answer is factually correct using LLM judge.""" prediction = run.outputs["answer"] reference = example.outputs["answer"]

    judge_response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": f"Is this answer factually consistent with the reference?\n\nAnswer: {prediction}\nReference: {reference}\n\nRespond with just 'yes' or 'no'.", }], )

    is_correct = "yes" in judge_response.choices[0].message.content.lower() return {"key": "correctness", "score": 1.0 if is_correct else 0.0}

    Run evaluation

    results = evaluate( my_rag_app, data="rag-eval-set", evaluators=[correctness_evaluator], experiment_prefix="rag-v1", )

    print(results)

    Building a Custom Eval Pipeline

    import json
    from datetime import datetime
    from pathlib import Path

    class EvalPipeline: """A lightweight evaluation pipeline for LLM applications."""

    def __init__(self, name: str): self.name = name self.results = [] self.timestamp = datetime.now().isoformat()

    def add_test_case(self, question: str, expected: str, tags: list[str] = None): """Add a test case to the pipeline.""" self.results.append({ "question": question, "expected": expected, "tags": tags or [], "actual": None, "scores": {}, })

    def run(self, app_fn, evaluators: dict): """Run all test cases through the app and evaluators.""" for i, case in enumerate(self.results): print(f"Running test {i+1}/{len(self.results)}: {case['question'][:50]}...")

    # Get the app's response actual = app_fn(case["question"]) case["actual"] = actual

    # Run each evaluator for eval_name, eval_fn in evaluators.items(): score = eval_fn(case["question"], actual, case["expected"]) case["scores"][eval_name] = score

    return self

    def summary(self): """Print evaluation summary.""" print(f"\nEvaluation: {self.name}") print(f"Timestamp: {self.timestamp}") print("=" * 60)

    all_scores = {} for case in self.results: for metric, score in case["scores"].items(): all_scores.setdefault(metric, []).append(score)

    for metric, scores in all_scores.items(): avg = sum(scores) / len(scores) print(f" {metric}: {avg:.3f} (min={min(scores):.3f}, max={max(scores):.3f})")

    # Per-tag breakdown tags = set() for case in self.results: tags.update(case["tags"])

    if tags: print("\nPer-tag breakdown:") for tag in sorted(tags): tag_cases = [c for c in self.results if tag in c["tags"]] for metric in all_scores: tag_scores = [c["scores"][metric] for c in tag_cases] avg = sum(tag_scores) / len(tag_scores) print(f" [{tag}] {metric}: {avg:.3f} ({len(tag_cases)} cases)")

    def save(self, path: str): """Save results to JSON.""" output = { "name": self.name, "timestamp": self.timestamp, "results": self.results, } Path(path).write_text(json.dumps(output, indent=2)) print(f"Results saved to {path}")

    The Evaluation Hierarchy

    Build your evaluation stack in layers: (1) Unit tests --- deterministic checks for format, length, required fields. (2) Automated metrics --- BLEU/ROUGE for translation/summarization, exact match for factual QA. (3) LLM-as-judge --- for open-ended quality assessment. (4) Human evaluation --- gold standard for subjective quality, but expensive and slow. Start with layers 1-2 for fast iteration, add layer 3 for CI/CD, and use layer 4 for periodic deep audits.