Appearance
Metrics
Metrics are the names you put in Promptfoo assertions.
They answer: which check should this assertion run?
The metric value tells Agent Skill Evals which skill-specific check to run.
Most users start with two names:
skill.checks: check aSKILL.mdfile and its tests.skill.test: check the result after an agent run.
Use skill.budget when you also want to limit real-agent token usage:
skill.budget: check real-agent token usage after an agent run.
skill.test
Use this in agent tests.
yaml
assert:
- type: javascript
metric: skill.test
value: file://./agent-skill-evals/assertions.js
config:
metric: skill.testskill.test runs the checks listed in preconditions, should, and should_not.
Example:
yaml
should:
- verifier.succeeds:
run: ./verify_login_redirect.shskill.budget
Use this in real-agent tests when the run records token usage.
yaml
assert:
- type: javascript
metric: skill.budget
value: file://./agent-skill-evals/assertions.js
config:
metric: skill.budget
agentSkillEvals:
maxTotalTokens: 300000
maxCompletionTokens: 15000skill.budget fails closed when token usage is missing.
skill.checks
Use this to check a skill file and its tests before an agent runs.
yaml
assert:
- type: javascript
metric: skill.checks
value: file://./agent-skill-evals/assertions.js
config:
metric: skill.checksskill.checks runs the full set of skill checks.
Skill Loading
skill.activation checks whether a skill describes when it should be used. It does not prove that the skill was loaded into a real agent run.
Use the runtime check skill.loaded inside skill.test when the run records which skill was loaded. See Skill Context Checks.
Full Metric List
| Metric | What it checks |
|---|---|
skill.test | Checks preconditions, should, and should_not after an agent run. |
skill.budget | Checks real-agent token usage after an agent run. |
skill.checks | Runs all Skill Checks below. |
skill.activation | Checks whether the skill can be chosen at the right time. |
skill.budgets | Checks whether test files declare skill.budget when required. |
skill.context | Checks whether referenced files exist and the skill is not too large. |
skill.instructions | Checks whether risky work has safe instructions. |
skill.tests | Checks whether test files are valid and include needed negative tests. |
skill.verifiers | Checks whether sample projects and verifier scripts exist and can run. |
Default Settings
These settings tell Agent Skill Evals which check names need extra safety review:
yaml
agentSkillEvals:
maxSkillLines: 200
riskyEffects:
- file.changes_outside_scope
- tool.called
destructiveEffects:
- file.changes_outside_scope
- tool.calledThese settings mean:
maxSkillLines: warns when a skill file is too large.riskyEffects: requires a negative test when these checks are used.destructiveEffects: requires safe instructions and at least oneshould_notwhen these checks are used.
Most projects can keep these defaults.
For focused reports, replace skill.checks with one of the individual skill.* metric names.
