Skip to content

ADR-0009: Skill Metadata for Compliance and Production Routing

  • Date: 2026-05-06
  • Authors: Matteo Rizzo
  • Status: Accepted
  • Approval State: Approved (Approved by: Matteo Rizzo on 2026-05-06)
  • Implementation State: Completed

1. Context and Problem Statement

SkillDescriptor (see ADR-0004) shipped with five fields: name, description, instructions, category, import_path, tools, references. That schema was sufficient when the catalog was a handful of demo skills and downstream clients composed the bundle by hand. Two pressures changed the requirement:

  1. EU AI Act Article 16 (August 2026 deadline) makes downstream operators responsible for declaring the risk profile of every "AI system component" they ship. Operators need a structured way to enumerate which catalog skills carry destructive or data-egress risk versus pure read-only metadata access. Scanning import paths or grepping docstrings is not a defensible audit position.
  2. Stub skills shipped to production by accident. weather-api and database-operations are deliberately mock implementations during the development phase — weather-api returns a hardcoded "22 °C and sunny" string regardless of input, and database-operations returns no-op JSON envelopes for every mutation. When a downstream agent's SemanticRouter happens to pick them, the user gets confidently wrong answers. There was no first-class way to tell the catalog "this exists, but do not route to it in production".

category (standard vs raw per ADR-0001) addresses neither concern: standard/raw distinguishes "safe-by-default" from "must-be-wrapped-in-SecureSkill", which is a runtime safety property, not a regulatory risk tier or a production-readiness flag. A skill can be standard (read-only) and still need to be excluded from production routing because it is a stub; a skill can be raw (destructive) and have a high regulatory tier because it is destructive — they are orthogonal axes.

2. Decision Drivers (Forces)

  • Regulatory forward-compat. The EU AI Act risk tiers (minimal / limited / high) are the lingua franca downstream compliance teams already use. Aligning the catalog field with those terms means a future attestation report can read directly from find-skills output.
  • Single source of truth. The risk classification belongs to the catalog, not to every downstream client's local copy. Adding the field to SkillDescriptor keeps it versioned alongside the skill itself.
  • No silent stubs in production. Stub-flagged skills must be invisible to the default find-skills discovery path; an operator who explicitly wants to inspect them should still be able to.
  • Backward compatibility. Existing downstream code that calls all_descriptors() or find() must continue to work without modification. New behaviour is opt-in via a keyword argument.
  • No new ADR axis. The existing category and the new risk_tier/stub fields cover three orthogonal concerns. We do not want to invent a fourth field next quarter — see Considered Options for rejected alternatives.

3. Considered Options

  1. Option 1: A single risk_level field that subsumes category, risk_tier, and stub-ness.
  2. Option 2: Two new orthogonal fields — risk_tier: Literal["minimal", "limited", "high"] and stub: bool — added to SkillDescriptor (chosen).
  3. Option 3: Out-of-band metadata file (e.g. risk_classifications.yaml shipped beside the catalog).
  4. Option 4: Skill-side decorator (@high_risk, @stub) that the registry inspects via attribute lookup.

4. Decision Outcome

Chosen option: Option 2 (two orthogonal fields on SkillDescriptor), because each concern has a distinct audience (compliance reviewers vs. production operators) and the fields' Pydantic-style typing keeps the catalog schema searchable, validated at import, and immune to drift between the description and the reality.

The SkillDescriptor dataclass gains:

RiskTier = Literal["minimal", "limited", "high"]

@dataclass(frozen=True, slots=True)
class SkillDescriptor:
    # ... existing fields ...
    risk_tier: RiskTier = "minimal"
    stub: bool = False

risk_tier defaults to "minimal" so newly-added skills must opt-in to higher tiers explicitly (deny-by-default for the regulatory axis). stub defaults to False so a missing flag never accidentally hides a real skill.

Concrete annotations on the catalog at acceptance time:

Skill category risk_tier stub Rationale
find-skills standard minimal false Pure metadata read
authentication-gates standard minimal false Frozen schema, signal-only
pdf-extraction standard minimal false Local file read; no egress
agent-browser standard limited false Outbound HTTP; data egress surface
agentic-rag standard limited false Touches enterprise indexes + public web
execution-debugging raw high false Arbitrary command execution (sandboxed)
database-operations raw high true Stub mutations; production must replace
weather-api standard minimal true Mock implementation

all_descriptors(), find(), and the discovery skill's list_skills / search_skills tools accept include_stubs: bool (default True for all_descriptors and find to preserve back-compat for admin tooling; default False for the agent-facing discovery tools so a production router never picks a stub).

4.1. Validation / Compliance

  • risk_tier is a Literal so an unknown tier name is a type error caught by mypy and at runtime by the dataclass's __post_init__ of frozen dataclasses (Python rejects assignment).
  • The discovery skill returns risk_tier and stub in the JSON envelope of list_skills, search_skills, and load_skill_instructions so a compliance pipeline can enumerate the catalog by tier without importing the skill modules.
  • A registry-completeness test asserts every shipped skill has risk_tier set explicitly (no relying on the default in production code, only in third-party extensions).

5. Pros and Cons of the Options

Option 1: Single risk_level field

  • Pros: One concept, one field.
  • Cons: Conflates regulatory risk with stub-ness; breaks the "stubs default to hidden" rule (you would need a magic value like "stub-high"); makes future axes (e.g. multi-region availability) require yet another magic value.

Option 2 (chosen): Two orthogonal fields

  • Pros: Each field has a distinct audience and lifecycle. Compliance reads risk_tier; production ops reads stub. Defaults are safe ("minimal" + False). Searchable via the existing registry helpers.
  • Cons: Two fields to remember. Mitigated by the bootstrap() test that asserts both are present on every shipped descriptor.

Option 3: Out-of-band YAML

  • Pros: Compliance team can edit without touching code.
  • Cons: Two sources of truth; drift inevitable; type checking impossible.

Option 4: Skill-side decorators

  • Pros: Co-located with the skill class.
  • Cons: Doubles the surface (decorator + descriptor); the registry already exists as the single source of truth. Decorators on a class do not survive serialization to the discovery JSON envelope without parallel logic in SkillDiscoverySkill.

6. Consequences

  • Positive Consequences:
  • Compliance teams can grep risk_tier once, not import-walk the catalog quarterly.
  • find-skills hides stubs from production routers automatically. The two stubs that ship today (weather-api, database-operations) are no longer auto-discoverable; admin tooling sets include_stubs=True explicitly.
  • Future telemetry can label spans by risk tier so dashboards split high-risk skill executions out of the default mean-latency view.
  • Negative Consequences / Trade-offs:
  • Every new skill must explicitly choose a tier. PRs adding a skill without setting risk_tier rely on the default; we accept that as a code-review concern rather than a compile-time error so prototypes can ship faster.
  • Risks & Mitigations:
  • Risk: a downstream client overrides risk_tier after registration, masking a high-tier skill. Mitigation: the descriptor is a frozen dataclass — re-registration replaces wholesale, no partial mutation.
  • Risk: the stub field becomes a dumping ground for "things we have not gotten around to". Mitigation: documentation guidance in docs/guides/adding-skills.md explicitly limits stub=True to placeholder implementations destined to be replaced; ADR-0001 (raw vs standard) remains the security axis.

7. Implementation Plan & Status Updates

  • Target Milestone/Release: v0.2.0 (current).
  • Implementation Notes:
  • 2026-05-06: SkillDescriptor extended; every shipped descriptor annotated; discovery skill updated to surface both fields and honour include_stubs. Tests updated.