Architecting Hybrid RAG Agents
An AI assistant that answers questions by searching both data warehouses (Google Analytics structured data) and documents (like PDFs, contracts, word reports, etc) at the same time.
Case Study: Architecting Hybrid RAG Agents
This case study is a technical deep-dive into the architecture behind a Battlefield Capital collaboration, building an enterprise-grade "Chat with your Data" platform. It explores how we solved the problem of routing queries across data warehouses (like Google Analytics) and documents (like PDFs, Excel files, contracts).
The Semantic Gap: Bridging Data Silos
Most AI chat systems can only search one type of data: either documents (PDFs, reports) or databases (spreadsheets, analytics).
The enterprise reality is a Semantic Gap. Valuable insights live in the intersection of transactional data (databases) and tribal knowledge (documents like PDFs and contracts).
A typical executive query requires crossing this bridge:
A typical query looks like this:
"How did the new Q3 Marketing Campaign (from PDF) impact our lead generation trend in Google Analytics (from SQL Warehouse) last month?"
This query requires:
- Vector Search: To understand what the "Q3 Marketing Campaign" is from unstructured documents.
- SQL Execution: To query the Google Analytics 4 (GA4) database for lead counts.
- Synthesis: To combine these two distinct modalities into a coherent answer.
If you throw this at a standard RAG chain, it fails (no SQL access). If you throw it at a standard LangChain SQL Agent, it fails (no context on "Q3 Campaign").
We needed a Hybrid Agentic System.
The Solution: A Router-Based Architecture
We architected a solution centered around a Multi-Agent Supervisor. Instead of a single monolithic agent, we built a fleet of specialized sub-agents orchestrated by a central brain.
1. The Supervisor: Deterministic Routing
Use strict Python types to handle routing logic, avoiding the unreliability of LLM-based routing for critical paths. In MultiAgentSupervisor, we mapped specific combinations of data sources to dedicated handler functions.
The use of frozenset allows us to handle multi-modal queries (e.g., GA4 + Vector Store) regardless of order.
# backend-ai-api/src/domain/agents/qna/qna_multi_agent_system/qna_multi_agent_system.py
# Build a lookup table for source‐combos
self._combination_handlers: dict[
frozenset[DataSourceType], CombinationHandler
] = {
frozenset(
{DataSourceType.GOOGLE_ANALYTICS, DataSourceType.FILE_UPLOADED}
): self._handle_ga_and_file,
frozenset(
{DataSourceType.GOOGLE_ANALYTICS, DataSourceType.VECTORSTORE}
): self._handle_ga_and_vectorstore,
# ... other combinations
frozenset({DataSourceType.GOOGLE_ANALYTICS}): self._handle_ga_only,
frozenset({DataSourceType.VECTORSTORE}): self._handle_vectorstore_only,
}This deterministic approach increased reliability significantly compared to letting an LLM "choose" the tool every time.
2. The "Shadow" Challenge: Managing Inter-Agent State
One of the primary failure modes in Multi-Agent systems is Context Contamination. If the SQL Agent returns a 500-row table, passing that entire payload to the 'Result Combiner' causes the LLM to lose focus (The "Lost in the Middle" phenomenon).
The Solution: Iterative Summarization We implemented a Distiller Node. If a sub-agent's output exceeds a specific token threshold, it is automatically routed through a summarization layer that extracts only the parameters requested by the Supervisor before moving to the next state.
3. The Guardrails: Production-Grade SQL Agents
Giving an LLM access to a SQL database is risky. To mitigate this, we implemented strict guardrails in the GA4SQLAgentFactory.
We inject specific constraints into the system prompt:
- Schema Restrictions: Only query tables with the
ga4_prefix. - Identifier Stability: Force the use of stable
property_ids over mutabledisplay_names.
Architectural Trade-off: Why we rejected LangChain's SQL Agent
We deliberately built a custom SQL Agent instead of using LangChain's default create_sql_agent.
- The Risk: Standard agents are "black boxes" that often attempt to query
information_schemaor hallucinate table names, leading to runtime errors. - The Fix: We treat SQL generation as a constrained classification problem, not a creative writing task.
Security & Sanitization
Beyond prompt engineering, we implemented strict PII sanitization at the database driver level. The agent connects via a read-only role with access only to specific materialized views, ensuring that no raw user data (PII) is ever exposed to the context window.
4. The Frontend: Streaming "Thought Bubbles"
Latency in Agentic systems is high (often 10-30 seconds). To keep the user engaged, we implemented a granular Server-Sent Events (SSE) stream that exposes the agent's "thinking process."
We define a protocol of event: step to push state updates without cluttering the final response.
// frontend-app/src/hooks/chat/useChatSubmission.ts
while (reading) {
const { done, value } = await reader.read();
// ... decoding logic ...
if (eventType === 'step') {
// Updates the "Thinking..." pill in the UI
onStreamingStep(eventData);
} else if (eventType === 'final') {
// Renders the final markdown response
const finalResult = JSON.parse(eventData);
handleApiResponseWithSuggestions(finalResult);
break;
}
}This turns a 20-second wait into an interactive experience where the user sees:
"Analyzing query clarity...""Deciding data sources...""Refining query...""Generating SQL..."
5. Handling Ambiguity (The Economic Logic)
Before spinning up expensive agents, we use a lightweight AmbiguityDetectionAgent. This isn't just UX; it's Economic Logic.
The ROI: This layer reduces total token consumption by ~40% by intercepting malformed or vague queries before they hit the expensive "Reasoning" agents (Supervisor + SQL).
# backend-ai-api/src/domain/agents/aux/ambiguity_detection/ambiguity_detection_agent.py
# If we determined it's ambiguous, use the reasoning from the highest confidence system
if is_ambiguous and system_with_max_confidence:
reason = specialized_responses[system_with_max_confidence]["decision_reason"]
clarification_message = f"Your query needs clarification: {reason}"
# Early exit - don't burn tokens on the Supervisor
yield f"event: final\ndata: {json.dumps(ambiguity_result)}\n\n"Conclusion
Building a production-grade Agentic system is less about "prompt engineering" and more about Systems Engineering.
By breaking the monolith into specialized components (Ambiguity Detector, Supervisor, SQL Agent, Vector Agent) and binding them with strict Pythonic glue code, we created a system that is robust, debuggable, and scalable.