LLMOps: Production-Grade Testing for AI Applications

The first time your AI agent hallucinates in production, you'll understand why "it works on my machine" isn't a testing strategy.

Traditional software testing is deterministic: given input X, expect output Y. But LLMs are probabilistic. The same prompt can yield different responses. Worse, the quality of a response - whether it's helpful, grounded in facts, or completely fabricated - isn't something you can check with a simple assertEquals.

This post documents what we learned building production-grade testing pipelines for two recent projects: Signale (a WhatsApp bot powered by a multi-agent system) and an AI Shopping Assistant. Both required catching regressions before deployment without burning through our entire CI budget.

The Problem: "Vibes-Based" AI Testing

Most AI projects start the same way: engineers run a few manual tests, eyeball the responses, and declare it "good enough." This works until:

Silent regressions - A prompt change breaks edge cases nobody manually tested
Hallucination drift - The model starts confidently stating false information
Context window bloat - Response quality degrades as you add more tools

You can't catch these with unit tests. And running comprehensive LLM evaluations on every commit is prohibitively expensive.

The Solution: Hybrid Assertion Strategy

We use DeepEval (an open-source LLM evaluation framework) combined with a pragmatic tiering strategy. The key insight: not all tests need to run all the time.

Two Types of Assertions

1. Technical Assertions (Fast, Cheap)

# These run on every commit - no LLM calls needed
assert response.status_code == 200
assert "products" in response.json()
assert len(response.json()["products"]) >= 2
assert any("smartwatch" in p["name"].lower() for p in products)

2. Semantic Assertions (Slow, Expensive)

# These use LLM-as-a-Judge - run selectively
# See: https://docs.confident-ai.com/docs/metrics-introduction
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
 
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.9)
 
# Does the answer actually help the user?
# Is the answer supported by the retrieved context?

Technical assertions catch broken code. Semantic assertions catch broken reasoning.

Beyond Generic Metrics: Custom Judges with GEval

Out-of-the-box metrics like AnswerRelevancyMetric and FaithfulnessMetric are useful, but they're often logic-blind. Consider our shopping assistant: if a user says "I have an iPhone 14" and asks for smartwatch recommendations, a standard relevancy metric might score a Samsung Galaxy Watch highly - it is a smartwatch, so it's "relevant" to the query. But it fails the business logic: Samsung watches are incompatible with iOS. The metric sees keywords; it doesn't understand ecosystem constraints.

This is why real-world agents need context-specific evaluation. A shopping assistant has different "correct" behavior than a code reviewer, and generic metrics can't encode that domain knowledge.

This is where GEval shines. It lets you define custom LLM judges with domain-specific criteria:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
 
# Custom judge for Turn 2: Context + Ecosystem Handling
turn_2_eval = GEval(
    name="Context Handling",
    criteria="""
    1. The agent MUST acknowledge the user's constraint ("iPhone 14").
    2. Compatibility Check:
       - Primary recommendation should be Apple Watch.
       - Garmin, Fitbit are ACCEPTABLE (iOS compatible).
       - Samsung Galaxy Watch 4/5/6/7 MUST NOT be recommended (iOS incompatible).
       - Disclaimer Tolerance: Mentioning incompatible products WITH a warning is acceptable.
    3. It is ACCEPTABLE to either:
       a) Show relevant Apple-compatible products immediately.
       b) Ask a follow-up question (budget, usage) to narrow the search.
    """,
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o-mini",  # or your preferred judge model
    threshold=0.7,
)
 
test_case = LLMTestCase(input="I have an iPhone 14.", actual_output=agent_response)
turn_2_eval.measure(test_case)

Why this matters:

Noise Tolerance - RAG retrieval often surfaces irrelevant items. A generic metric would fail. A custom judge understands that retrieved context having noise is OK as long as the actual output filters it.
Flexible Logic - "Ask a follow-up question OR show products" are both valid behaviors. Hard assertions would fail one; a custom judge accepts both.
Persona Enforcement - Inject your agent's personality prompt into the evaluation criteria to verify tone consistency.

Choosing the Right Judge Model

You don't need GPT-4 to judge your tests. For CI pipelines, smaller, faster, cheaper models are ideal - they're good enough for pass/fail decisions and won't blow your budget.

Think of it this way: you wouldn't put a Ferrari engine in a robot vacuum that runs daily. It's overkill, expensive, and the vacuum doesn't need 800 horsepower to pick up dust. Same logic applies here - during heavy development, you're deploying multiple times per day. Running 48+ LLM evaluations on each deploy doesn't need frontier-model reasoning.

For this project, we used Mistral's Ministral 8B:

JUDGE_MODEL = "ministral-8b-2512"  # $0.15/1M tokens input & output

Model	Input (per 1M)	Output (per 1M)	Context	Good For
Ministral 8B	$0.15	$0.15	256k	CI/CD judge, high volume

A note on judge quality: There's a known concern with LLM-as-a-Judge: smaller models can be more lenient or miss subtle logic errors that frontier models catch. I'm aware of this trade-off. For ci-core (daily gating), Ministral 8B is the gatekeeper - it's fast, cheap, and catches obvious regressions. For periodic Deep Audits or Golden Dataset validation, the judge promotes to a frontier model (GPT-4o or Claude Sonnet). This isn't just cost optimization - it's matching judge capability to the type of failure you're hunting.

CI/CD Tiering: Pragmatism Over Purity

Here's where we separate production engineers from hobbyists: you don't run everything on every deploy.

The Two-Tier System

Tier	Folder	When	Tests	Time
ci-core	`tests/integration/ci_core/`	Every deploy	Critical paths only	5-7 min
Full Suite	`tests/integration/`	Opt-in	All scenarios	30+ min

ci-core contains only tests that guard the critical path:

Intent routing (does the agent understand the user?)
Code quality (does the generated code work?)
API contract (does the client break?)
False positive prevention (no irrelevant results?)

Everything else lives in the full E2E suite, triggered manually or on major changes.

GitHub Actions Implementation

# .github/workflows/deploy-backend.yml
on:
  workflow_dispatch:
    inputs:
      skip_integration_tests:
        description: 'Skip Integration Tests'
        default: false
        type: boolean
      run_all_integration_tests:
        description: 'Run ALL integration tests (default: ci-core only)'
        default: false
        type: boolean
 
jobs:
  run_integration_tests:
    steps:
      - name: Run Evaluation Integration Tests (DeepEval)
        run: |
          if [ "${{ inputs.run_all_integration_tests }}" == "true" ]; then
            echo "Running ALL integration tests..."
            uv run pytest tests/integration -svx
          else
            echo "Running CI-CORE integration tests only..."
            uv run pytest tests/integration/ci_core -svx
          fi

This gives us three modes:

Default deploy: Run ci-core only (fast feedback)
Full validation: Run everything (before major releases)
Break-glass protocol: Skip tests entirely - reserved for documented P0 incidents where the risk of downtime outweighs the risk of a regression

Infrastructure: Pragmatic Security for PoC Delivery

Running LLM tests in CI requires real infrastructure access - databases, embedding models, LLM endpoints. This creates a classic networking paradox: tests need to access a private database (Azure PostgreSQL) to validate RAG logic, but CI runners (GitHub Actions) are hosted on unpredictable public IP addresses.

Dynamic Firewall Whitelisting

For this PoC, I implemented a dynamic whitelisting approach:

# ... 
 
- name: Add Firewall Rule for Runner IP
  run: |
    IP_ADDR=$(curl -s ipv4.icanhazip.com)
    az postgres flexible-server firewall-rule create \
      --name ${{ env.POSTGRES_SERVER_NAME }} \
      --rule-name "ci-allow-${{ github.run_id }}" \
      --start-ip-address $IP_ADDR \
      --end-ip-address $IP_ADDR
 
- name: Run Integration Tests (DeepEval)
  run: uv run pytest tests/integration/ci_core -svx
 
- name: Remove Firewall Rule
  if: always()  # Runs on SUCCESS or FAILURE - no holes left behind
  run: |
    az postgres flexible-server firewall-rule delete \
      --rule-name "ci-allow-${{ github.run_id }}" --yes
 
# ...

The workflow:

Detect - The CI runner fetches its own public IPv4 at runtime
Authorize - Azure CLI creates a temporary firewall rule for that specific IP
Clean Up - The if: always() hook ensures the rule is deleted even if tests fail, minimizing the attack surface window

Why this was the pragmatic choice:

Speed of Delivery - We delivered a fully tested system in 5 weeks. Setting up VNets and private endpoints for a PoC adds days of overhead and significant extra cloud costs.
Cost Efficiency - Public-facing managed runners are cheaper than maintaining 24/7 self-hosted infrastructure.
Designed for Failure - The if: always() pattern ensures no permanent firewall holes are left behind if tests crash.

Production Path: Zero-Trust Networking

To be clear: dynamic firewall whitelisting is a PoC pattern, not a production one. I chose it deliberately because I understand what I'm trading off. For enterprise clients or compliance-sensitive workloads, the architecture shifts to identity-based Zero-Trust networking:

Approach	Description	When to Use
OIDC Authentication	GitHub Actions authenticates to Azure via federated identity - no secrets stored, no firewall changes	Default for any production deployment
Azure Private Link	Database accessible only via private IP within a VNet - completely off the public internet	Compliance-heavy workloads (SOC2, HIPAA)
VNet-Injected Runners	GitHub runners deployed directly into your Azure VNet, accessing resources over the private backbone	When CI needs access to private resources without any firewall rules
Self-Hosted Scale Sets	Ephemeral CI runners in an Azure VM Scale Set within the target VNet	Maximum security - code never leaves your network

The infrastructure is defined in Terraform, allowing a seamless transition from the current firewall model to VNet integration by swapping networking modules. The point isn't that whitelisting is "good enough" - it's that I know the rules before I break them for velocity.

Secrets Management

All sensitive values live in GitHub Repository Secrets - never in code:

Secret	Purpose
`AZURE_CREDENTIALS`	Service Principal for Azure CLI
`DATABASE_URL`	PostgreSQL connection string
`MISTRAL_API_KEY`	LLM provider access
`AZURE_OPENAI_API_KEY`	Embeddings and fallback

These are injected as environment variables in the workflow, keeping credentials out of logs and version control.

The Golden Dataset Goal

The end state we're building toward: parameterized regression testing from a curated dataset.

@pytest.mark.parametrize("test_case", load_dataset("golden_dataset.json"))
def test_scenarios(test_case):
    response = run_agent(test_case["input"])
    assert_test(
        test_case,
        metrics=[faithfulness, relevancy]
    )

Each entry in golden_dataset.json represents a known-good interaction. When the agent's behavior regresses, the test fails with a clear diff of what changed.

Closing the Loop: From Production Traces to Golden Data

Testing is only half of LLMOps. The Golden Dataset shouldn't just be hand-crafted - it should be harvested from production. Tools like Arize Phoenix or LangSmith capture real user interactions, and the failures (or near-misses) become the next generation of test cases. This creates a feedback loop where production teaches your test suite what to watch for.

We haven't implemented this loop yet in this PoC - but it's the natural next step once the agent is live and generating real traffic. The architecture supports it; we just haven't needed it at this stage.

A Note on Problem-First Engineering

I'll be honest: I haven't explored the full capabilities of DeepEval. They have dashboards, tracing, synthetic data generation, and more features I haven't touched. And that's intentional.

I had a specific problem: how do I run reliable integration tests on AI agents in CI/CD? The GEval approach solved that problem. I now have high confidence in my deployments. The tests catch regressions. The cost is low, delivering real value. That's perfect.

This is how we approach tooling: we avoid feature-bloat fatigue. We start with the core requirement - can this agent correctly route a search query? - and only adopt the complex parts of a tool when the simple approach no longer scales. If new needs emerge - maybe we want observability dashboards or synthetic test generation - then we'll dive deeper. But until that need exists, we ship what works.

I've worked with engineers who are religious about their tools - "we only use X" or "Y is the only proper way." That's the opposite of how we operate. We're not loyal to frameworks, libraries, or vendors. We're loyal to solving problems efficiently. The tool could be DeepEval, could be LangSmith, could be a custom pytest harness. The principle remains: solve the problem in front of you, not the imaginary ones.

Need help implementing DeepEval into your production CI/CD pipeline? Let's talk.

— Pedro

Founder & Principal Engineer