Skip to content

🧠 Agentic RAG (agentic-rag)

Multi-source retrieval orchestrator that reasons across a Neo4j knowledge graph, an Azure AI Search hybrid index, and live web search, with optional SOTA reranking, citation-aware truncation, and a cost-efficient LLM-as-a-Judge evaluation gate.

Tools

Tool Purpose
graph_retrieval Run a Cypher template against Neo4j and project rows into chunks.
graph_schema List node labels and relationship types so the agent can plan queries.
verify_graph_plugins Sanity-check that APOC and GDS are active before running advanced templates.
list_cypher_templates Surface the named Cypher templates shipped with the skill.
enterprise_search Hybrid (BM25 + vector) search against Azure AI Search with optional OData filter and Semantic Ranker.
web_intelligence Delegate to the configured WebSearchProvider for real-time external context.
parallel_retrieval Fan out across every configured source via asyncio.gather, rerank the merged pool, and citation-cap the result.

Reasoning flow

        ┌──────────────────┐
        │  intent triage   │
        └────────┬─────────┘
   ┌─────────────┼─────────────┐
   ▼             ▼             ▼
graph_retrieval enterprise_search web_intelligence
   │             │             │
   └─────────────┼─────────────┘
       evaluate sufficiency
           insufficient?
        parallel_retrieval
   reranker (Qwen3 / Cohere)
   citation-aware truncation
            synthesise

Reranking layer

The RerankerProvider ABC lets callers attach a high-precision second-stage scorer that runs over the merged retrieval pool before truncation.

Provider Model Endpoint Long-context
Qwen3RerankerProvider Qwen/Qwen3-Reranker-4B OpenAI-compatible inference (vLLM / TGI / Together AI) 32 768 tokens
CohereRerankProvider rerank-v3.5 https://api.cohere.com/v2/rerank Long-form documents
NoOpRerankerProvider passthrough n/a

Both production providers expose RerankerConfig.max_context_tokens (default 32 768) so long enterprise documents are ranked end-to-end without relevance loss from naive head-truncation.

Raw retrieval vs. reranked (illustrative)

The numbers below are reference values from internal benchmarking on the rag_golden_set.json cases; rerun the eval suite (see below) to refresh them against your environment.

Metric Raw retrieval + Reranker Δ
Faithfulness 0.78 0.91 +0.13
Answer Relevancy 0.82 0.93 +0.11
Contextual Precision 0.71 0.89 +0.18

Citation-aware truncation

truncate_chunks_to_budget reserves headroom for source attribution before packing chunk content:

chunk_budget = token_budget - (estimated_citation_tokens + citation_buffer)

estimate_citation_tokens sums the source label, identifier, and JSON serialisation of each chunk's metadata.extra, then divides by 4 to approximate tokens. The default safety buffer is 64 tokens to absorb structural overhead (commas, JSON braces, surrounding prose). The function returns an empty list when the citation overhead exhausts the budget so downstream synthesis never produces uncited claims.

Configuration

from mirai_shared_skills.agentic_rag import (
    AgenticRAGSkill,
    AzureSearchConfig,
    AzureSearchProvider,
    BrowserWebSearchProvider,
    CohereRerankProvider,
    Neo4jConnection,
    Neo4jGraphProvider,
    RerankerConfig,
)

azure = AzureSearchProvider(
    AzureSearchConfig(
        endpoint="https://acme.search.windows.net",
        index_name="docs",
        api_key="...",
        semantic_configuration="default",
    )
)
neo4j = Neo4jGraphProvider(
    Neo4jConnection(uri="bolt://neo4j:7687", user="neo4j", password="..."),
)
reranker = CohereRerankProvider(
    api_key="...",
    config=RerankerConfig(top_k=8, max_context_tokens=32_768),
)
skill = AgenticRAGSkill(
    neo4j=neo4j,
    azure=azure,
    web=BrowserWebSearchProvider(),
    reranker=reranker,
)

Local Neo4j setup (Docker)

docker-compose.test.yml ships a Neo4j Enterprise container pre-loaded with APOC and GDS:

docker compose -f docker-compose.test.yml up -d
# wait for the healthcheck to pass, then:
uv run pytest tests/integration -m integration

The integration suite auto-detects the live Bolt endpoint at bolt://localhost:7687 and skips when the container is offline. Override the endpoint with MIRAI_TEST_NEO4J_URI (e.g. when running against a remote sandbox).

The verify_graph_plugins tool uses apoc.help and gds.list calls to confirm both plugins are active before complex Cypher templates run, and returns {apoc, gds, ok, detail} so the agent can halt cleanly if the graph is misconfigured.

Evaluation — DeepEval + Flash judge

The mirai_shared_skills.agentic_rag.eval module ships an opinionated LLM-as-a-Judge pipeline tuned for CI economics:

  • JudgeLLM ABC plus GeminiFlashJudge (gemini-1.5-flash) and GPT4oMiniJudge (gpt-4o-mini) implementations. Both judges talk JSON over HTTP and return a {score, reason} envelope.
  • MockJudge for hermetic CI runs.
  • evaluate_dataset(cases, candidates, judge) scores every case against three metrics — Faithfulness, Answer Relevancy, and Contextual Precision — and aggregates per-metric averages.
  • DeepEvalJudgeAdapter exposes any JudgeLLM as a DeepEvalBaseLLM, so the same judge powers both the lightweight scorer and the full DeepEval metric suite when [eval] is installed.

Why a small judge

CI runs grade every PR. Switching the judge from a frontier model to Gemini Flash or GPT-4o-mini reduces cost roughly 50–100× while staying strongly correlated with human judgement on the three RAG metrics shipped here. The pipeline is judge-agnostic, so swap in a frontier model for nightly precision runs.

Quality gate

tests/test_rag_eval.py enforces a minimum threshold of 0.85 on the average score for every metric across the golden dataset at tests/data/rag_golden_set.json (12 cases). Run it with the default MockJudge for hermetic CI; opt in to a real judge by exporting MIRAI_RUN_DEEPEVAL=1 plus the corresponding API key (GEMINI_API_KEY or OPENAI_API_KEY).

Categorisation

standard — read-only retrieval. Safe to inject directly into a DynamicAgentEngine without SecureSkill wrapping.

Performance & safety

  • All providers are async; parallel_retrieval runs them concurrently via asyncio.gather(..., return_exceptions=True) so one failed source never cancels the others.
  • The Neo4j driver, Azure httpx.AsyncClient, and reranker clients are managed as per-skill singletons and disposed via await skill.aclose().
  • Retrieved chunks pass through the configured reranker before citation-aware truncation, then through truncate_chunks_to_budget which derives the chunk budget by subtracting estimated citation overhead.