🧠 Agentic RAG (agentic-rag)¶
Multi-source retrieval orchestrator that reasons across a Neo4j knowledge graph, an Azure AI Search hybrid index, and live web search, with optional SOTA reranking, citation-aware truncation, and a cost-efficient LLM-as-a-Judge evaluation gate.
Tools¶
| Tool | Purpose |
|---|---|
graph_retrieval |
Run a Cypher template against Neo4j and project rows into chunks. |
graph_schema |
List node labels and relationship types so the agent can plan queries. |
verify_graph_plugins |
Sanity-check that APOC and GDS are active before running advanced templates. |
list_cypher_templates |
Surface the named Cypher templates shipped with the skill. |
enterprise_search |
Hybrid (BM25 + vector) search against Azure AI Search with optional OData filter and Semantic Ranker. |
web_intelligence |
Delegate to the configured WebSearchProvider for real-time external context. |
parallel_retrieval |
Fan out across every configured source via asyncio.gather, rerank the merged pool, and citation-cap the result. |
Reasoning flow¶
┌──────────────────┐
│ intent triage │
└────────┬─────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
graph_retrieval enterprise_search web_intelligence
│ │ │
└─────────────┼─────────────┘
▼
evaluate sufficiency
│
insufficient?
│
▼
parallel_retrieval
│
▼
reranker (Qwen3 / Cohere)
│
▼
citation-aware truncation
│
▼
synthesise
Reranking layer¶
The RerankerProvider ABC lets callers attach a high-precision second-stage
scorer that runs over the merged retrieval pool before truncation.
| Provider | Model | Endpoint | Long-context |
|---|---|---|---|
Qwen3RerankerProvider |
Qwen/Qwen3-Reranker-4B |
OpenAI-compatible inference (vLLM / TGI / Together AI) | 32 768 tokens |
CohereRerankProvider |
rerank-v3.5 |
https://api.cohere.com/v2/rerank |
Long-form documents |
NoOpRerankerProvider |
— | passthrough | n/a |
Both production providers expose RerankerConfig.max_context_tokens
(default 32 768) so long enterprise documents are ranked end-to-end without
relevance loss from naive head-truncation.
Raw retrieval vs. reranked (illustrative)¶
The numbers below are reference values from internal benchmarking on the
rag_golden_set.json cases; rerun the eval suite (see below) to refresh
them against your environment.
| Metric | Raw retrieval | + Reranker | Δ |
|---|---|---|---|
| Faithfulness | 0.78 | 0.91 | +0.13 |
| Answer Relevancy | 0.82 | 0.93 | +0.11 |
| Contextual Precision | 0.71 | 0.89 | +0.18 |
Citation-aware truncation¶
truncate_chunks_to_budget reserves headroom for source attribution before
packing chunk content:
estimate_citation_tokens sums the source label, identifier, and JSON
serialisation of each chunk's metadata.extra, then divides by 4 to
approximate tokens. The default safety buffer is 64 tokens to absorb
structural overhead (commas, JSON braces, surrounding prose). The function
returns an empty list when the citation overhead exhausts the budget so
downstream synthesis never produces uncited claims.
Configuration¶
from mirai_shared_skills.agentic_rag import (
AgenticRAGSkill,
AzureSearchConfig,
AzureSearchProvider,
BrowserWebSearchProvider,
CohereRerankProvider,
Neo4jConnection,
Neo4jGraphProvider,
RerankerConfig,
)
azure = AzureSearchProvider(
AzureSearchConfig(
endpoint="https://acme.search.windows.net",
index_name="docs",
api_key="...",
semantic_configuration="default",
)
)
neo4j = Neo4jGraphProvider(
Neo4jConnection(uri="bolt://neo4j:7687", user="neo4j", password="..."),
)
reranker = CohereRerankProvider(
api_key="...",
config=RerankerConfig(top_k=8, max_context_tokens=32_768),
)
skill = AgenticRAGSkill(
neo4j=neo4j,
azure=azure,
web=BrowserWebSearchProvider(),
reranker=reranker,
)
Local Neo4j setup (Docker)¶
docker-compose.test.yml ships a Neo4j Enterprise container pre-loaded with
APOC and GDS:
docker compose -f docker-compose.test.yml up -d
# wait for the healthcheck to pass, then:
uv run pytest tests/integration -m integration
The integration suite auto-detects the live Bolt endpoint at
bolt://localhost:7687 and skips when the container is offline. Override
the endpoint with MIRAI_TEST_NEO4J_URI (e.g. when running against a
remote sandbox).
The verify_graph_plugins tool uses apoc.help and gds.list calls to
confirm both plugins are active before complex Cypher templates run, and
returns {apoc, gds, ok, detail} so the agent can halt cleanly if the
graph is misconfigured.
Evaluation — DeepEval + Flash judge¶
The mirai_shared_skills.agentic_rag.eval module ships an opinionated
LLM-as-a-Judge pipeline tuned for CI economics:
JudgeLLMABC plusGeminiFlashJudge(gemini-1.5-flash) andGPT4oMiniJudge(gpt-4o-mini) implementations. Both judges talk JSON over HTTP and return a{score, reason}envelope.MockJudgefor hermetic CI runs.evaluate_dataset(cases, candidates, judge)scores every case against three metrics — Faithfulness, Answer Relevancy, and Contextual Precision — and aggregates per-metric averages.DeepEvalJudgeAdapterexposes anyJudgeLLMas aDeepEvalBaseLLM, so the same judge powers both the lightweight scorer and the full DeepEval metric suite when[eval]is installed.
Why a small judge¶
CI runs grade every PR. Switching the judge from a frontier model to Gemini Flash or GPT-4o-mini reduces cost roughly 50–100× while staying strongly correlated with human judgement on the three RAG metrics shipped here. The pipeline is judge-agnostic, so swap in a frontier model for nightly precision runs.
Quality gate¶
tests/test_rag_eval.py enforces a minimum threshold of 0.85 on the
average score for every metric across the
golden dataset at tests/data/rag_golden_set.json (12 cases). Run it
with the default MockJudge for hermetic CI; opt in to a real judge by
exporting MIRAI_RUN_DEEPEVAL=1 plus the corresponding API key
(GEMINI_API_KEY or OPENAI_API_KEY).
Categorisation¶
standard — read-only retrieval. Safe to inject directly into a
DynamicAgentEngine without SecureSkill wrapping.
Performance & safety¶
- All providers are async;
parallel_retrievalruns them concurrently viaasyncio.gather(..., return_exceptions=True)so one failed source never cancels the others. - The Neo4j driver, Azure
httpx.AsyncClient, and reranker clients are managed as per-skill singletons and disposed viaawait skill.aclose(). - Retrieved chunks pass through the configured reranker before
citation-aware truncation, then through
truncate_chunks_to_budgetwhich derives the chunk budget by subtracting estimated citation overhead.