ADR-0006: Token-Budgeted RAG Context Assembly¶

Date: 2026-05-06
Authors: Matteo Rizzo
Status: Accepted
Approval State: Approved (Approved by: Matteo Rizzo on 2026-05-06)
Implementation State: Completed

1. Context and Problem Statement¶

AgenticRAGSkill retrieves chunks from multiple providers (graph, vector, web), reranks them, and synthesizes them into a single context payload that the calling agent injects into its prompt. The total size of that payload can vary by an order of magnitude depending on the query: a tightly-scoped graph lookup might return 5 short chunks; a broad web search might return 50 long ones, plus a graph expansion, plus reranker rationale.

If the skill emits the entire concatenated retrieval output, the calling agent's context window blows out. Worse, the LLM provider charges per token of input, so unbounded context growth turns a $0.01 turn into a $1.00 turn silently. We need a bound on payload size — but the bound has to be aware of citations, because every chunk the skill cites also costs tokens (source label + identifier + JSON-serialized metadata extras).

We also want the skill to be tokeniser-agnostic: depending on which model the calling agent uses (Claude, GPT-4, Llama 3), the actual tokenizer differs. We don't want mirai-shared-skills to ship with tiktoken and transformers and anthropic-tokenizer as deps just to count tokens.

2. Decision Drivers (Forces)¶

Cost predictability: Total injected tokens must respect a configurable upper bound.
Citation integrity: Every cited chunk must fit within the budget — no half-truncated citations.
Tokenizer independence: Estimation must be cheap and model-agnostic.
Configurable per call: Different agents have different context budgets; the skill must accept a budget override at construction or per call.
Determinism: Given the same input, the skill must always produce the same output (no random drop-after-budget-exceeded).
Observability: When the skill drops chunks to fit budget, that fact must be visible — both to the LLM (so it knows context is partial) and to operators (so they can tune the budget).

3. Considered Options¶

Option 1: No budget; emit everything.
Option 2: Truncate the concatenated text at the budget boundary. Possibly mid-citation.
Option 3: Budget by chunk count (top_k). Drop reranker tail until you have ≤ N chunks.
Option 4: Token-budgeted chunk selection with a citation safety buffer (chosen). Greedy fill from highest-rank chunks; reserve space for the citation envelope of every chunk admitted.

4. Decision Outcome¶

Chosen option: Option 4 (token-budgeted chunk selection with a citation safety buffer), because it respects both the cost ceiling and the citation contract, and fails gracefully (dropping low-rank chunks rather than truncating high-rank text).

The implementation in mirai_shared_skills/agentic_rag/skill.py defines three constants:

DEFAULT_TOKEN_BUDGET = 8_000
APPROX_CHARS_PER_TOKEN = 4
DEFAULT_CITATION_SAFETY_BUFFER_TOKENS = 64

And the estimate_citation_tokens(chunks) helper:

def estimate_citation_tokens(chunks: Sequence[RAGContextChunk]) -> int:
    """Return the approximate token cost of citing every chunk in `chunks`.

    Each citation reserves space for the source label, the identifier (often a
    URL), and the JSON-serialised metadata extras the LLM may surface back to
    the user. Token counts are estimated as `chars / 4` to stay tokeniser-free.
    """

The skill's flow:

Run all providers (graph, vector, web). Collect candidate chunks.
Rerank (or pass through NoOpRerankerProvider).
Greedy-fill, highest rank first: estimate the chunk's token cost (len(chunk.text) / 4) plus the citation envelope. If admitting the chunk plus its citation cost stays within budget - DEFAULT_CITATION_SAFETY_BUFFER_TOKENS, keep it. Otherwise drop it.
Emit the selected chunks plus a metadata field reporting dropped_count and dropped_total_tokens.

The chars / 4 heuristic intentionally over-estimates (most modern subword tokenizers compress better than 4 chars/token for English prose), which means the actual injected context is comfortably under the budget. We accept this slack as the price of tokenizer-independence.

The DEFAULT_CITATION_SAFETY_BUFFER_TOKENS = 64 reserves space for the citation footer the calling agent renders ("Sources: …"). Without it, a chunk could be admitted whose citation pushes the total over budget after the agent appends its citation list.

4.1. Validation / Compliance¶

Unit tests assert: budget is respected for both small (5 chunks) and large (500 chunks) inputs; chunk admission is deterministic given the same rerank order; the metadata field reports dropped chunks accurately.
Property test: for any random input, tokens_admitted + buffer + citation_envelope ≤ budget.

5. Pros and Cons of the Options¶

Option 1: No budget¶

Pros: Trivial.
Cons: Unbounded cost; context-window blowouts.

Option 2: Truncate concatenated text¶

Pros: Simple to implement.
Cons: Mid-citation truncation; agent receives malformed source attribution; may quote partial chunks.

Option 3: `top_k` only¶

Pros: Predictable count.
Cons: Doesn't respect token cost — five 2,000-token chunks blows the budget while five 50-token chunks under-uses it.

Option 4 (chosen): Token-budgeted with citation buffer¶

Pros:
Respects cost ceiling.
Citations stay whole.
Tokenizer-independent (chars / 4).
Observable (dropped-chunk metadata).
Cons:
Heuristic over-estimates token cost — actual context is ~10–20% under budget.
Char-count estimation isn't accurate for non-English text. We accept this; clients with non-English corpora can subclass and override the estimate.

6. Consequences¶

Positive Consequences:
The skill is safe to wire into any agent without surprise cost spikes.
The dropped-chunk metadata gives operators a knob: if drops are frequent, raise DEFAULT_TOKEN_BUDGET; if they're never used, lower it.
Citation integrity means the agent can reliably surface "see source X" answers.
Negative Consequences / Trade-offs:
chars / 4 is wrong for some languages and some content (code, JSON). We accept the imprecision.
Risks & Mitigations:
Risk: A reranker bug ranks low-quality chunks first, the budget fills with junk, the actually-relevant chunk is dropped. Mitigation: the rerank score is preserved on every chunk; tests can assert rank ordering.

7. Implementation Plan & Status Updates¶

Target Milestone/Release: v0.1.0 (current).
Implementation Notes:
2026-05-06: ADR formalizes the existing implementation in agentic_rag/skill.py. No code changes.

mirai_shared_skills/agentic_rag/skill.py — constants and estimate_citation_tokens.
ADR-0002: Pluggable Provider Pattern — what feeds chunks into this budgeting step.