Skip to content

Agent Skill EvalsTest agent skills with Promptfoo.

Check the skill, run the agent on a copied sample project, and prove the result with evidence.

Who Agent Skill Evals Is For

Agent Skill Evals is for teams that write reusable skills for agents.

Use it when a skill can edit files, run commands, call tools, or make changes you want to check before trusting it.

Promptfoo is the test runner

Promptfoo is an open-source eval framework. Agent Skill Evals plugs into normal Promptfoo configs, so you keep running promptfoo eval and add skill-specific checks. Use the Promptfoo docs for Promptfoo's own config reference.

How It Works

Agent Skill Evals has two jobs:

  1. Check the skill before an agent runs.
  2. Check evidence after an agent runs.

That split exists because a bad skill test can make a bad skill look good, and an agent's final message is not proof that the right work happened.

The model is:

  1. Check the skill and its tests.
  2. Start with a known sample project.
  3. Ask the agent to do a realistic task.
  4. Record evidence: changed files, tool calls, command results, output, and run details.
  5. Assert what must happen and what must not happen.

Agent Skill Evals runs that loop through Promptfoo. There is no separate runner to learn.

What A Test Looks Like

This example checks that an agent creates a PowerPoint deck and only changes the allowed files:

yaml
preconditions:
  - verifier.fails:
      run: ./verify_brand_deck.cjs
should:
  - verifier.succeeds:
      run: ./verify_brand_deck.cjs
  - file.created:
      path: launch-deck.pptx
  - file.created:
      path: deck.js
should_not:
  - file.changes_outside_scope:
      scope:
        - deck.js
        - launch-deck.pptx

Start with Getting Started, then read Core Concepts.

Use Set Up Tests For An Existing Skill when you already have a skill and want an agent to set up the Promptfoo configs, agent tests, verifier scripts, and evidence checks for you.

Use Runtime Checks, Skill Loading, Metrics, Package Reference, and the Promptfoo docs as reference pages.