Appearance
Skill Loading
Skill Loading checks prove which skill entered an agent run.
They exist because task success alone does not prove the agent used the expected skill. Prove the skill first, then check the task result.
The simple model is:
- A skill is made available to an agent.
- The agent runs.
- Agent Skill Evals checks the evidence.
There are two practical ways to make a skill available:
native: the agent has its own skill mechanism.mcp: the skill is served through MCP.
The examples cover three paths:
- Claude Code over HTTP MCP.
- Codex over HTTP MCP.
- Pi native skills with
--no-skills --skill.
Agent Skill Evals should not guess what the model was thinking. It should check what the run can prove.
What To Prove
Before checking task success, prove the context:
- The expected skill was loaded into the run.
- Nearby or unrelated skills were not loaded into the run.
This keeps the test grounded in observable evidence instead of guesswork.
Test Shape
Use skill.loaded inside should.
yaml
should:
- skill.loaded:
should_include:
- brand-deck
should_exclude:
- bugfix-workflowThat is the normal test shape. Tool and server details stay in evidence.json for debugging, but the test can stay focused on the skill name.
In a full skill test, keep the same skill.loaded check and add the normal task checks after it: verifier commands, file checks, tool checks, or whatever proves the work was done.
MCP Skill Loading
The MCP examples start a local server, make the example skills available, run the agent, and check skill.loaded.
bash
pnpm --filter @agent-skill-evals/examples mcp:setup
pnpm run eval:mcpAgent Skill Evals can turn MCP skill-loader tool calls and MCP resource reads into loaded-skill evidence. For example, if the run records load_brand_deck_skill, Agent Skill Evals records brand-deck as loaded.
The raw tool call still appears in toolCalls, so failures are inspectable in evidence.json, but the assertion remains skill.loaded.
Most users do not need custom mapping. If your setup records skill loading with different tool names or resource URLs, use the skillEvidence config shown in Runtime Checks.
Native Skill Loading
Native skill loading should produce the same kind of evidence:
json
{"skill":"brand-deck","delivery":"native","provider":"pi-json","source":"--skill","startedAt":1760000000000}Only record native skill evidence when the run can prove it. Do not treat the test's expected skill name as proof that the skill loaded.
Pi has a deterministic native shape:
yaml
args:
- --mode
- json
- --no-skills
- --skill
- ./skills/brand-deckWhen --no-skills is present, Agent Skill Evals can record each --skill path as skill.loaded evidence because unrelated native skills were disabled by the same invocation. Run the example with:
bash
pnpm run eval:native:piRun every routing example with:
bash
pnpm run eval:routingUse MCP for Codex and Claude Code routing tests unless their native CLIs expose a way to load exactly the skills under test.
Native argument mapping is also configurable for custom agents through skillEvidence.
The Loop
text
Was the right skill loaded?
Were the wrong skills excluded?
Did the final task pass?That is the skill-loading test: prove the skill, exclude the wrong skills, then check the task.
