Case Study: Architecting Hybrid RAG Agents

This case study is a technical deep-dive into the architecture behind a Battlefield Capital collaboration, building an enterprise-grade "Chat with your Data" platform. It explores how we solved the problem of routing queries across data warehouses (like Google Analytics) and documents (like PDFs, Excel files, contracts).

The Semantic Gap: Bridging Data Silos

Most AI chat systems can only search one type of data: either documents (PDFs, reports) or databases (spreadsheets, analytics).

The enterprise reality is a Semantic Gap. Valuable insights live in the intersection of transactional data (databases) and tribal knowledge (documents like PDFs and contracts).

A typical executive query requires crossing this bridge:

A typical query looks like this:

"How did the new Q3 Marketing Campaign (from PDF) impact our lead generation trend in Google Analytics (from SQL Warehouse) last month?"

This query requires:

Vector Search: To understand what the "Q3 Marketing Campaign" is from unstructured documents.
SQL Execution: To query the Google Analytics 4 (GA4) database for lead counts.
Synthesis: To combine these two distinct modalities into a coherent answer.

If you throw this at a standard RAG chain, it fails (no SQL access). If you throw it at a standard LangChain SQL Agent, it fails (no context on "Q3 Campaign").

We needed a Hybrid Agentic System.

The Solution: A Router-Based Architecture

We architected a solution centered around a Multi-Agent Supervisor. Instead of a single monolithic agent, we built a fleet of specialized sub-agents orchestrated by a central brain.

{
  "nodes": [
    { "id": "query", "label": "User Query", "x": 300, "y": 0, "type": "input" },
    { "id": "ambiguity", "label": "Ambiguity Detection", "x": 300, "y": 100 },
    { "id": "ask", "label": "Ask User for Details", "x": 580, "y": 100, "type": "decision" },
    { "id": "supervisor", "label": "Multi-Agent Supervisor", "x": 300, "y": 200 },
    { "id": "dispatcher", "label": "Dispatcher", "x": 300, "y": 300, "type": "decision" },
    { "id": "sql", "label": "SQL Agent", "x": 80, "y": 420 },
    { "id": "vector", "label": "Vector Agent", "x": 300, "y": 420 },
    { "id": "multi", "label": "Multi-Handler", "x": 520, "y": 420 },
    { "id": "combiner", "label": "Result Combiner", "x": 300, "y": 540 },
    { "id": "stream", "label": "SSE Stream", "x": 300, "y": 660, "type": "output" }
  ],
  "edges": [
    { "source": "query", "target": "ambiguity", "animated": true },
    { "source": "ambiguity", "target": "ask", "label": "Ambiguous" },
    { "source": "ambiguity", "target": "supervisor", "label": "Clear", "animated": true },
    { "source": "supervisor", "target": "dispatcher", "animated": true },
    { "source": "dispatcher", "target": "sql", "label": "Structured" },
    { "source": "dispatcher", "target": "vector", "label": "Unstructured", "animated": true },
    { "source": "dispatcher", "target": "multi", "label": "Hybrid" },
    { "source": "sql", "target": "combiner" },
    { "source": "vector", "target": "combiner", "animated": true },
    { "source": "multi", "target": "combiner" },
    { "source": "combiner", "target": "stream", "animated": true }
  ]
}

1. The Supervisor: Deterministic Routing

Use strict Python types to handle routing logic, avoiding the unreliability of LLM-based routing for critical paths. In MultiAgentSupervisor, we mapped specific combinations of data sources to dedicated handler functions.

The use of frozenset allows us to handle multi-modal queries (e.g., GA4 + Vector Store) regardless of order.

# backend-ai-api/src/domain/agents/qna/qna_multi_agent_system/qna_multi_agent_system.py
 
# Build a lookup table for source‐combos
self._combination_handlers: dict[
    frozenset[DataSourceType], CombinationHandler
] = {
    frozenset(
        {DataSourceType.GOOGLE_ANALYTICS, DataSourceType.FILE_UPLOADED}
    ): self._handle_ga_and_file,
    frozenset(
        {DataSourceType.GOOGLE_ANALYTICS, DataSourceType.VECTORSTORE}
    ): self._handle_ga_and_vectorstore,
    # ... other combinations
    frozenset({DataSourceType.GOOGLE_ANALYTICS}): self._handle_ga_only,
    frozenset({DataSourceType.VECTORSTORE}): self._handle_vectorstore_only,
}

This deterministic approach increased reliability significantly compared to letting an LLM "choose" the tool every time.

2. The "Shadow" Challenge: Managing Inter-Agent State

One of the primary failure modes in Multi-Agent systems is Context Contamination. If the SQL Agent returns a 500-row table, passing that entire payload to the 'Result Combiner' causes the LLM to lose focus (The "Lost in the Middle" phenomenon).

The Solution: Iterative Summarization We implemented a Distiller Node. If a sub-agent's output exceeds a specific token threshold, it is automatically routed through a summarization layer that extracts only the parameters requested by the Supervisor before moving to the next state.

3. The Guardrails: Production-Grade SQL Agents

Giving an LLM access to a SQL database is risky. To mitigate this, we implemented strict guardrails in the GA4SQLAgentFactory.

We inject specific constraints into the system prompt:

Schema Restrictions: Only query tables with the ga4_ prefix.
Identifier Stability: Force the use of stable property_ids over mutable display_names.

Architectural Trade-off: Why we rejected LangChain's SQL Agent

We deliberately built a custom SQL Agent instead of using LangChain's default create_sql_agent.

The Risk: Standard agents are "black boxes" that often attempt to query information_schema or hallucinate table names, leading to runtime errors.
The Fix: We treat SQL generation as a constrained classification problem, not a creative writing task.

Security & Sanitization

Beyond prompt engineering, we implemented strict PII sanitization at the database driver level. The agent connects via a read-only role with access only to specific materialized views, ensuring that no raw user data (PII) is ever exposed to the context window.

4. The Frontend: Streaming "Thought Bubbles"

Latency in Agentic systems is high (often 10-30 seconds). To keep the user engaged, we implemented a granular Server-Sent Events (SSE) stream that exposes the agent's "thinking process."

We define a protocol of event: step to push state updates without cluttering the final response.

// frontend-app/src/hooks/chat/useChatSubmission.ts
 
while (reading) {
  const { done, value } = await reader.read();
  // ... decoding logic ...
 
  if (eventType === 'step') {
    // Updates the "Thinking..." pill in the UI
    onStreamingStep(eventData);
  } else if (eventType === 'final') {
    // Renders the final markdown response
    const finalResult = JSON.parse(eventData);
    handleApiResponseWithSuggestions(finalResult);
    break;
  }
}

This turns a 20-second wait into an interactive experience where the user sees:

"Analyzing query clarity..."
"Deciding data sources..."
"Refining query..."
"Generating SQL..."

5. Handling Ambiguity (The Economic Logic)

Before spinning up expensive agents, we use a lightweight AmbiguityDetectionAgent. This isn't just UX; it's Economic Logic.

The ROI: This layer reduces total token consumption by ~40% by intercepting malformed or vague queries before they hit the expensive "Reasoning" agents (Supervisor + SQL).

# backend-ai-api/src/domain/agents/aux/ambiguity_detection/ambiguity_detection_agent.py
 
# If we determined it's ambiguous, use the reasoning from the highest confidence system
if is_ambiguous and system_with_max_confidence:
    reason = specialized_responses[system_with_max_confidence]["decision_reason"]
    clarification_message = f"Your query needs clarification: {reason}"
 
    # Early exit - don't burn tokens on the Supervisor
    yield f"event: final\ndata: {json.dumps(ambiguity_result)}\n\n"

Conclusion

Building a production-grade Agentic system is less about "prompt engineering" and more about Systems Engineering.

By breaking the monolith into specialized components (Ambiguity Detector, Supervisor, SQL Agent, Vector Agent) and binding them with strict Pythonic glue code, we created a system that is robust, debuggable, and scalable.