Testing AI Agents: Deterministic Evaluation in a Non-Deterministic World


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
How to Test AI Agents with Deterministic Evaluation
- Separate deterministic agent logic (tool routing, parsing) from LLM output tests using Pytest markers.
- Install DeepEval and Ragas with pinned versions in a dedicated
requirements-eval.txt. - Define evaluation datasets as question / expected answer / context triples versioned in source control.
- Configure scored metrics (faithfulness, relevancy, hallucination) with minimum threshold scores per metric.
- Run deterministic tests on every commit and LLM evaluation tests on merges to main or nightly.
- Average evaluation scores across 3+ runs to absorb non-deterministic variance.
- Integrate threshold-based quality gates into CI/CD to block deployment when scores drop.
Testing AI agents has become one of the most pressing unsolved problems in production software engineering. Teams deploy LLM-powered agents to handle customer support, code generation, document retrieval, and autonomous workflows, yet in our experience, most of these systems ship without meaningful test coverage. The reason is straightforward: traditional testing assumes deterministic behavior, and large language models do not comply. What follows is a practical, framework-driven approach to building repeatable evaluation pipelines around inherently non-deterministic agent behavior, using Pytest, DeepEval, and Ragas.
Table of Contents
- Why Traditional Unit Tests Fail for AI Agents
- Rethinking Testing: Evaluation Over Assertion
- Building a Repeatable Evaluation Pipeline with Pytest + DeepEval
- Scaling Evaluation with Ragas for RAG Agents
- The Agent CI/CD Pipeline: Making It Automatic
- Checklist: The Agent CI/CD Pipeline
- What to Do Monday Morning
Why Traditional Unit Tests Fail for AI Agents
The Determinism Assumption
Conventional software tests rest on a simple contract: given input X, the system produces output Y, every time. A unit test calls a function, compares the return value to an expected result, and passes or fails. This model works because the underlying code is deterministic.
LLMs violate this assumption at multiple levels. Temperature and top-p sampling introduce controlled randomness into token selection, meaning the same prompt can yield different phrasing, structure, or even factual content across invocations. Model updates from providers (OpenAI, Anthropic, Google) can silently shift output distributions. Minor wording changes or differences in context window content produce materially different responses. Even at temperature zero, some inference APIs do not guarantee bitwise-identical outputs due to floating-point non-determinism in batched computation, as OpenAI has documented.
What Breaks When You assertEqual on LLM Output
Consider a retrieval-augmented agent asked "What is the capital of France?" Three valid responses might be "The capital of France is Paris," "Paris is the capital of France," and "Paris." All are correct. A string equality assertion would accept at most one of these and reject the rest as failures. The result is flaky tests that erode confidence in the test suite, generating false negatives that teams learn to ignore. Agent unit testing cannot survive on brittle string matching. The fundamental shift required is from "is this the right answer?" to "is this a good enough answer?"
The fundamental shift required is from "is this the right answer?" to "is this a good enough answer?"
import pytest
def get_agent_response(query: str) -> str:
"""Invoke the agent and return its string response.
Replace the body with your actual agent call, e.g.:
from my_agent import agent
return agent.run(query)
"""
pytest.skip("Agent implementation not wired — set get_agent_response body before running.")
def test_exact_match_fails():
response = get_agent_response("What is the capital of France?")
# This assertion will fail intermittently — the agent may phrase it differently each run
assert response == "The capital of France is Paris."
With a non-zero temperature setting, this test will likely fail intermittently across multiple runs as the agent rephrases its response. The output is correct but not identical, and the test cannot distinguish between a regression and a paraphrase.
Rethinking Testing: Evaluation Over Assertion
From Binary Pass/Fail to Scored Evaluation
LLM evaluation replaces exact-match assertions with scored quality dimensions. Rather than checking whether an output matches a string, evaluation frameworks measure properties like faithfulness (is the response grounded in provided context?), relevance (does it address the query?), coherence (is it logically structured?), and hallucination (does the agent fabricate information not present in any source?). Each dimension yields a score between 0.0 and 1.0, and you set a threshold against each score. Tests still pass or fail, but the pass condition is a quality threshold rather than an exact value.
The Three Layers of Agent Testing
Not everything in an agent system requires LLM evaluation. A practical testing strategy recognizes three distinct layers:
- Deterministic logic -- tool call routing, argument parsing, response formatting, and state machine transitions. Traditional unit tests work perfectly here. Run these on every commit.
- LLM output quality -- where evaluation frameworks like DeepEval measure faithfulness, relevance, and hallucination. These tests call an external judge model, so they are slower and carry real cost.
- End-to-end agent behavior -- scenario-based evaluation suites that simulate multi-turn conversations or complex tool-use sequences. These validate that the full pipeline holds together, not just individual outputs.
Layers 2 and 3 are where the new tooling earns its value.
Building a Repeatable Evaluation Pipeline with Pytest + DeepEval
Prerequisites
Before running any evaluation tests, ensure the following:
- Python ≥ 3.10 (examples below use 3.11)
- An OpenAI API key (or another supported LLM provider key). DeepEval uses an external LLM (default: OpenAI GPT-4) as a judge to score metrics. Set
OPENAI_API_KEYin your environment, or configure viadeepeval login. - Pinned dependencies. Create a
requirements-eval.txtin your repository:
# requirements-eval.txt — pin to the versions you have tested against
deepeval==1.1.6
ragas==0.1.21
datasets==2.18.0
pytest==8.1.1
Note: The version numbers above are illustrative. Pin to whichever versions you have verified your test suite against, and update deliberately. For supply-chain safety, consider generating a hash-pinned lock file with pip-compile --generate-hashes from pip-tools and installing with pip install --require-hashes -r requirements-eval.txt.
Install with:
pip install -r requirements-eval.txt
Setting Up DeepEval with Pytest
DeepEval integrates directly as a Pytest plugin, which means Python teams can adopt it without abandoning their existing test runner, CI configuration, or reporting infrastructure. Test cases use familiar Pytest conventions with the addition of metric objects and an assert_test function that evaluates LLM outputs against scored criteria.
# conftest.py
import os
import pytest
# Guard against missing deepeval — allows Layer 1 deterministic tests to run
# even when deepeval is not installed.
try:
import deepeval # Registers DeepEval Pytest plugin automatically.
except ImportError:
pytest.skip(
"deepeval not installed — skipping LLM evaluation tests. "
"Run: pip install -r requirements-eval.txt",
allow_module_level=True,
)
def pytest_configure(config):
config.addinivalue_line(
"markers",
"llm_eval: marks tests that call external LLM APIs (deselect with -m 'not llm_eval')",
)
@pytest.fixture(autouse=False, scope="session")
def require_openai_key():
"""Fail fast with an actionable message if OPENAI_API_KEY is absent."""
key = os.environ.get("OPENAI_API_KEY", "").strip()
if not key:
pytest.fail(
"OPENAI_API_KEY is not set. Export it before running evaluation tests:
"
" export OPENAI_API_KEY=sk-..."
)
yield
# eval_utils.py
import signal
from contextlib import contextmanager
from typing import Generator
class EvalTimeout(Exception):
pass
@contextmanager
def evaluation_timeout(seconds: int = 120) -> Generator[None, None, None]:
"""Raise EvalTimeout if the block exceeds `seconds`.
Note: signal.SIGALRM is Unix-only. On Windows CI, replace with
concurrent.futures.ThreadPoolExecutor with a timeout parameter.
"""
def _handler(signum, frame):
raise EvalTimeout(f"LLM evaluation call exceeded {seconds}s timeout.")
old_handler = signal.signal(signal.SIGALRM, _handler)
signal.alarm(seconds)
try:
yield
finally:
signal.alarm(0)
signal.signal(signal.SIGALRM, old_handler)
# test_agent_basic.py (Layer 2 — LLM output quality)
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from eval_utils import evaluation_timeout, EvalTimeout
@pytest.mark.llm_eval
def test_agent_response_quality(require_openai_key):
test_case = LLMTestCase(
input="What are the benefits of solar energy?",
actual_output=(
"Solar energy reduces electricity bills and lowers carbon emissions. "
"It is a renewable resource that requires minimal maintenance."
),
retrieval_context=[
"Solar energy is a renewable resource. It reduces electricity bills, "
"lowers carbon footprint, and solar panels require low maintenance.",
"Wind energy is another renewable source unrelated to solar panels.", # distractor
],
)
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
faithfulness_metric = FaithfulnessMetric(threshold=0.8)
try:
with evaluation_timeout(120):
assert_test(test_case, [relevancy_metric, faithfulness_metric])
except EvalTimeout as exc:
pytest.fail(str(exc))
Run this with deepeval test run test_agent_basic.py (which wraps Pytest with additional DeepEval output) or pytest test_agent_basic.py (standard Pytest output). The test passes if both metric scores meet or exceed their thresholds.
To run only deterministic Layer 1 tests without LLM calls:
pytest -m "not llm_eval"
To run only LLM evaluation tests:
pytest -m "llm_eval"
Key Metrics for Agent Quality Assurance
Four metrics form the baseline for agent quality assurance. Answer Relevancy scores how directly the response addresses the user's query, penalizing off-topic or tangential content. If a claim in the response cannot be traced back to the provided context, Faithfulness catches it. Hallucination goes further: it explicitly flags fabricated entities, facts, or figures that appear in the output but exist in no source material. Tool Correctness rounds out the set by verifying that the agent invoked the right tool with the correct parameters, which matters most for agents that call APIs, query databases, or execute code. See the DeepEval documentation for ToolCorrectnessMetric implementation details.
If a claim in the response cannot be traced back to the provided context, Faithfulness catches it. Hallucination goes further: it explicitly flags fabricated entities, facts, or figures that appear in the output but exist in no source material.
# Layer 2 — Faithfulness and Hallucination evaluation
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, HallucinationMetric
from eval_utils import evaluation_timeout, EvalTimeout
@pytest.mark.llm_eval
def test_rag_agent_faithfulness_and_hallucination(require_openai_key):
context = [
"Acme Corp reported Q3 2024 revenue of $4.2 billion, up 12% year-over-year.",
"The company's operating margin was 18.5%.",
]
test_case = LLMTestCase(
input="Summarize Acme Corp's Q3 2024 financial performance.",
actual_output=(
"Acme Corp achieved Q3 2024 revenue of $4.2 billion, "
"a 12% increase year-over-year, with an operating margin of 18.5%."
),
retrieval_context=context,
)
faithfulness = FaithfulnessMetric(threshold=0.85)
# HallucinationMetric in DeepEval scores the PROPORTION of response
# statements that are hallucinated (0.0 = none hallucinated).
# assert_test passes when score >= threshold.
# A threshold of 0.0 means zero hallucination is required (strictest).
# Relax to e.g. 0.1 only if minor paraphrase is acceptable.
# WARNING: Do NOT set threshold=0.5 expecting "pass if score <= 0.5" —
# that logic is inverted relative to DeepEval's assert_test behavior.
# Verify exact scoring direction against your pinned deepeval version.
hallucination = HallucinationMetric(threshold=0.0)
try:
with evaluation_timeout(120):
assert_test(test_case, [faithfulness, hallucination])
except EvalTimeout as exc:
pytest.fail(str(exc))
If the agent's response had included an invented claim such as "net income rose 20%," the hallucination metric would score higher (indicating more fabricated content detected), causing the test to fail because the hallucination score would no longer meet the strict threshold of 0.0.
Scaling Evaluation with Ragas for RAG Agents
When Your Agent Retrieves and Generates
Retrieval-augmented generation agents introduce a two-stage quality problem. The retrieval step must surface relevant context from a vector store or search index, and the generation step must synthesize that context into an accurate response. A failure at either stage degrades the final output, but the failure modes differ: poor retrieval feeds irrelevant context to the LLM, while poor generation ignores or misrepresents good context. Ragas provides evaluation primitives designed for this dual challenge: context precision (what fraction of retrieved contexts are relevant), context recall (were all necessary contexts retrieved), and answer correctness (does the final answer match the expected ground truth).
Integrating Ragas into Your Test Suite
Ragas evaluations operate on datasets of question/answer/context triples, making them well-suited to batch evaluation alongside Pytest. Combining Ragas metrics with DeepEval's per-test-case approach covers both sides of the RAG problem: Ragas evaluates retrieval precision and recall, while DeepEval scores generation faithfulness and hallucination on individual cases.
Important: Ragas underwent significant API changes across versions. The code below targets ragas==0.1.21. If you use a different version, verify the import paths and evaluate() return type against that version's documentation.
# test_rag_pipeline.py
# pip install ragas==0.1.21 datasets==2.18.0
import pytest
from ragas import evaluate
from ragas.metrics import context_precision, context_recall, answer_correctness
from datasets import Dataset
# Per-metric minimum thresholds — tune to your domain requirements.
METRIC_THRESHOLDS = {
"context_precision": 0.80,
"context_recall": 0.75,
"answer_correctness": 0.80,
}
@pytest.mark.llm_eval
def test_rag_pipeline_with_ragas(require_openai_key):
eval_data = {
"question": [
"What is photosynthesis?",
"What causes tides?",
"What is the speed of light?",
],
"answer": [
"Photosynthesis is the process by which plants convert sunlight into energy.",
"Tides are caused by the gravitational pull of the moon and sun on Earth's oceans.",
"The speed of light is approximately 299,792 kilometers per second in a vacuum.",
],
"contexts": [
["Photosynthesis is a process used by plants to convert light energy into chemical energy."],
["Tides result from the gravitational forces exerted by the moon and the sun."],
["Light travels at approximately 299,792 km/s in a vacuum."],
],
"ground_truth": [
"Photosynthesis converts sunlight into chemical energy in plants.",
"The gravitational pull of the moon and sun causes ocean tides.",
"The speed of light in a vacuum is about 299,792 km/s.",
],
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset,
metrics=[context_precision, context_recall, answer_correctness],
)
# result.to_pandas() is the stable public API in ragas==0.1.x.
# Column names match metric object names.
scores_df = result.to_pandas()
for metric_name, threshold in METRIC_THRESHOLDS.items():
if metric_name not in scores_df.columns:
pytest.fail(f"Expected metric column '{metric_name}' not found in Ragas result.")
mean_score = float(scores_df[metric_name].mean())
assert mean_score >= threshold, (
f"{metric_name} mean score {mean_score:.3f} is below threshold {threshold:.2f}"
)
This test fails the entire suite if any Ragas metric drops below its configured threshold, providing a hard quality gate on the retrieval-generation pipeline. Each metric has its own threshold in METRIC_THRESHOLDS to reflect that context recall and answer correctness may have different acceptable floors in your domain.
The Agent CI/CD Pipeline: Making It Automatic
Where Evaluation Fits in CI/CD
Deterministic logic tests (Layer 1) belong in the fast feedback loop, running on every commit. LLM evaluation tests (Layers 2 and 3) are slower -- expect 5 to 30 seconds per test case depending on judge model latency -- and carry real API costs. A rough formula: multiply the number of test cases by the average tokens per evaluation (prompt + completion, often 2,000-4,000 tokens) by your provider's per-token price. For GPT-4 as judge at current pricing, a 50-case suite might cost $1-3 per run. Estimate this before enabling evaluation on every PR; for large suites, consider using GPT-3.5-turbo or a local judge model to reduce per-run cost. Quality gates block deployment when scores drop below configured minimums.
Handling Flakiness and Score Variance
Non-determinism does not disappear just because thresholds replace exact matches. Evaluation scores vary between runs. The practical mitigation: run evaluations at least three times (the minimum needed to compute a stable mean) and average the scores, then set acceptable variance bands rather than hard cutoffs. DeepEval does not natively average across runs; implement by running deepeval test run N times and aggregating the JSON output files, then asserting on the mean. Version evaluation datasets in source control alongside prompt templates so you can correlate changes to either with metric movements.
# .github/workflows/agent-eval.yml
name: Agent Evaluation
on:
push:
branches:
- main
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements-eval.txt
- run: deepeval test run tests/eval/ --verbose
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Note on secrets: Ensure OPENAI_API_KEY is scoped to evaluation-only usage with the minimum required permissions. If the secret is absent, deepeval will raise an AuthenticationError. Consider adding a pre-flight step to verify the key is set before running the evaluation suite.
GitHub Actions will fail the job automatically if deepeval test run exits with a non-zero code (the default behavior for any step), so no additional exit-code checking is required.
# pytest.ini
[pytest]
markers =
llm_eval: LLM evaluation tests (slow, requires OPENAI_API_KEY)
Checklist: The Agent CI/CD Pipeline
- ☐ Pin all evaluation dependencies in
requirements-eval.txt - ☐ Separate deterministic tests (tool calls, parsing, routing) from evaluation tests using
pytest.mark.llm_eval - ☐ Define evaluation dataset with question / expected answer / context triples
- ☐ Configure DeepEval metrics: relevancy, faithfulness, hallucination
- ☐ Configure Ragas metrics for RAG pipelines: context precision, context recall
- ☐ Set minimum threshold scores per metric (e.g., faithfulness ≥ 0.85)
- ☐ Run deterministic tests on every commit (
pytest -m "not llm_eval") - ☐ Run evaluation tests on push to main or nightly schedule (
pytest -m "llm_eval") - ☐ Average scores across 3+ runs to reduce variance (aggregate JSON output and assert on means)
- ☐ Version evaluation datasets alongside prompt templates
- ☐ Set CI/CD quality gate to block deploy on threshold failure
- ☐ Log and track metric trends over time for regression detection
What to Do Monday Morning
Traditional unit tests remain essential for agent logic: tool call routing, argument parsing, and state transitions. They do not cover the quality of LLM-generated outputs. DeepEval plugs into the same pytest CLI, uses the same assertion patterns, and runs in the same CI configuration your team already maintains, which makes adoption straightforward. Ragas extends evaluation to RAG pipelines specifically, measuring context precision, context recall, and answer correctness. Together, CI/CD integration with threshold-based quality gates turns evaluation from an optional manual step into an enforceable deployment requirement.
Together, CI/CD integration with threshold-based quality gates turns evaluation from an optional manual step into an enforceable deployment requirement.
Teams with domain-specific quality requirements should build custom evaluation metrics -- regulatory compliance scoring or tone consistency, for example. Human reviewers calibrate scores that automated metrics miss, especially for subjective quality dimensions. Running evaluation metrics continuously against production traffic -- treating evaluation as observability -- catches degradation between scheduled test runs in real time.