ADR-0009: Skill Metadata for Compliance and Production Routing¶

Date: 2026-05-06
Authors: Matteo Rizzo
Status: Accepted
Approval State: Approved (Approved by: Matteo Rizzo on 2026-05-06)
Implementation State: Completed

1. Context and Problem Statement¶

SkillDescriptor (see ADR-0004) shipped with five fields: name, description, instructions, category, import_path, tools, references. That schema was sufficient when the catalog was a handful of demo skills and downstream clients composed the bundle by hand. Two pressures changed the requirement:

EU AI Act Article 16 (August 2026 deadline) makes downstream operators responsible for declaring the risk profile of every "AI system component" they ship. Operators need a structured way to enumerate which catalog skills carry destructive or data-egress risk versus pure read-only metadata access. Scanning import paths or grepping docstrings is not a defensible audit position.
Stub skills shipped to production by accident. weather-api and database-operations are deliberately mock implementations during the development phase — weather-api returns a hardcoded "22 °C and sunny" string regardless of input, and database-operations returns no-op JSON envelopes for every mutation. When a downstream agent's SemanticRouter happens to pick them, the user gets confidently wrong answers. There was no first-class way to tell the catalog "this exists, but do not route to it in production".

category (standard vs raw per ADR-0001) addresses neither concern: standard/raw distinguishes "safe-by-default" from "must-be-wrapped-in-SecureSkill", which is a runtime safety property, not a regulatory risk tier or a production-readiness flag. A skill can be standard (read-only) and still need to be excluded from production routing because it is a stub; a skill can be raw (destructive) and have a high regulatory tier because it is destructive — they are orthogonal axes.

2. Decision Drivers (Forces)¶

Regulatory forward-compat. The EU AI Act risk tiers (minimal / limited / high) are the lingua franca downstream compliance teams already use. Aligning the catalog field with those terms means a future attestation report can read directly from find-skills output.
Single source of truth. The risk classification belongs to the catalog, not to every downstream client's local copy. Adding the field to SkillDescriptor keeps it versioned alongside the skill itself.
No silent stubs in production. Stub-flagged skills must be invisible to the default find-skills discovery path; an operator who explicitly wants to inspect them should still be able to.
Backward compatibility. Existing downstream code that calls all_descriptors() or find() must continue to work without modification. New behaviour is opt-in via a keyword argument.
No new ADR axis. The existing category and the new risk_tier/stub fields cover three orthogonal concerns. We do not want to invent a fourth field next quarter — see Considered Options for rejected alternatives.

3. Considered Options¶

Option 1: A single risk_level field that subsumes category, risk_tier, and stub-ness.
Option 2: Two new orthogonal fields — risk_tier: Literal["minimal", "limited", "high"] and stub: bool — added to SkillDescriptor (chosen).
Option 3: Out-of-band metadata file (e.g. risk_classifications.yaml shipped beside the catalog).
Option 4: Skill-side decorator (@high_risk, @stub) that the registry inspects via attribute lookup.

4. Decision Outcome¶

Chosen option: Option 2 (two orthogonal fields on SkillDescriptor), because each concern has a distinct audience (compliance reviewers vs. production operators) and the fields' Pydantic-style typing keeps the catalog schema searchable, validated at import, and immune to drift between the description and the reality.

The SkillDescriptor dataclass gains:

RiskTier = Literal["minimal", "limited", "high"]

@dataclass(frozen=True, slots=True)
class SkillDescriptor:
    # ... existing fields ...
    risk_tier: RiskTier = "minimal"
    stub: bool = False

risk_tier defaults to "minimal" so newly-added skills must opt-in to higher tiers explicitly (deny-by-default for the regulatory axis). stub defaults to False so a missing flag never accidentally hides a real skill.

Concrete annotations on the catalog at acceptance time:

Skill	`category`	`risk_tier`	`stub`	Rationale
`find-skills`	standard	minimal	false	Pure metadata read
`authentication-gates`	standard	minimal	false	Frozen schema, signal-only
`pdf-extraction`	standard	minimal	false	Local file read; no egress
`agent-browser`	standard	limited	false	Outbound HTTP; data egress surface
`agentic-rag`	standard	limited	false	Touches enterprise indexes + public web
`execution-debugging`	raw	high	false	Arbitrary command execution (sandboxed)
`database-operations`	raw	high	true	Stub mutations; production must replace
`weather-api`	standard	minimal	true	Mock implementation

all_descriptors(), find(), and the discovery skill's list_skills / search_skills tools accept include_stubs: bool (default True for all_descriptors and find to preserve back-compat for admin tooling; default False for the agent-facing discovery tools so a production router never picks a stub).

4.1. Validation / Compliance¶

risk_tier is a Literal so an unknown tier name is a type error caught by mypy and at runtime by the dataclass's __post_init__ of frozen dataclasses (Python rejects assignment).
The discovery skill returns risk_tier and stub in the JSON envelope of list_skills, search_skills, and load_skill_instructions so a compliance pipeline can enumerate the catalog by tier without importing the skill modules.
A registry-completeness test asserts every shipped skill has risk_tier set explicitly (no relying on the default in production code, only in third-party extensions).

5. Pros and Cons of the Options¶

Option 1: Single `risk_level` field¶

Pros: One concept, one field.
Cons: Conflates regulatory risk with stub-ness; breaks the "stubs default to hidden" rule (you would need a magic value like "stub-high"); makes future axes (e.g. multi-region availability) require yet another magic value.

Option 2 (chosen): Two orthogonal fields¶

Pros: Each field has a distinct audience and lifecycle. Compliance reads risk_tier; production ops reads stub. Defaults are safe ("minimal" + False). Searchable via the existing registry helpers.
Cons: Two fields to remember. Mitigated by the bootstrap() test that asserts both are present on every shipped descriptor.

Option 3: Out-of-band YAML¶

Pros: Compliance team can edit without touching code.
Cons: Two sources of truth; drift inevitable; type checking impossible.

Option 4: Skill-side decorators¶

Pros: Co-located with the skill class.
Cons: Doubles the surface (decorator + descriptor); the registry already exists as the single source of truth. Decorators on a class do not survive serialization to the discovery JSON envelope without parallel logic in SkillDiscoverySkill.

6. Consequences¶

Positive Consequences:
Compliance teams can grep risk_tier once, not import-walk the catalog quarterly.
find-skills hides stubs from production routers automatically. The two stubs that ship today (weather-api, database-operations) are no longer auto-discoverable; admin tooling sets include_stubs=True explicitly.
Future telemetry can label spans by risk tier so dashboards split high-risk skill executions out of the default mean-latency view.
Negative Consequences / Trade-offs:
Every new skill must explicitly choose a tier. PRs adding a skill without setting risk_tier rely on the default; we accept that as a code-review concern rather than a compile-time error so prototypes can ship faster.
Risks & Mitigations:
Risk: a downstream client overrides risk_tier after registration, masking a high-tier skill. Mitigation: the descriptor is a frozen dataclass — re-registration replaces wholesale, no partial mutation.
Risk: the stub field becomes a dumping ground for "things we have not gotten around to". Mitigation: documentation guidance in docs/guides/adding-skills.md explicitly limits stub=True to placeholder implementations destined to be replaced; ADR-0001 (raw vs standard) remains the security axis.

7. Implementation Plan & Status Updates¶

Target Milestone/Release: v0.2.0 (current).
Implementation Notes:
2026-05-06: SkillDescriptor extended; every shipped descriptor annotated; discovery skill updated to surface both fields and honour include_stubs. Tests updated.

mirai_shared_skills/_registry.py — SkillDescriptor, all_descriptors, find.
mirai_shared_skills/discovery/skill.py — SkillDiscoverySkill with include_stubs plumbing.
ADR-0001: Standard vs Raw Skill Categorization — orthogonal runtime safety axis.
ADR-0004: In-Process Skill Descriptor Registry — registry that holds the new fields.
EU AI Act Article 16 — risk-tier vocabulary.