Skip to content

Add skill validator CLI improvements and create-skill-evaluation meta-skill#376

Open
javiercn wants to merge 3 commits intodotnet:mainfrom
javiercn:javiercn/skill-validator-improvements
Open

Add skill validator CLI improvements and create-skill-evaluation meta-skill#376
javiercn wants to merge 3 commits intodotnet:mainfrom
javiercn:javiercn/skill-validator-improvements

Conversation

@javiercn
Copy link
Copy Markdown
Member

@javiercn javiercn commented Mar 16, 2026

Summary

Three improvements to the skill development workflow:

  1. Skill validator CLI enhancements — five new flags for faster iteration and easier debugging
  2. --environment scenario filtering — control which scenarios run in different environments (CI, local, daily)
  3. create-skill-evaluation meta-skill — teaches agents to write high-quality eval.yaml files

Skill validator CLI improvements

--environment <name>

Filter scenarios by environment. Scenarios in eval.yaml can declare an environment field. When --environment is passed, only scenarios whose environment matches (or that have no environment set) will run. Scenarios without an environment always run regardless.

scenarios:
  - name: Quick smoke test        # runs everywhere (no environment set)
    prompt: ...
  - name: Full integration test
    environment: local             # skipped with --environment ci
    prompt: ...
  - name: Expensive stress test
    environment: daily             # only runs with --environment daily
    prompt: ...
# CI: skip local-only and daily scenarios
skill-validator --environment ci path/to/skill

# Local development: run everything (no --environment flag)
skill-validator path/to/skill

This enables limiting expensive scenarios on CI while keeping them available for local development. Future use: daily or weekly environments for periodic expensive evaluations.

--scenario <name> (repeatable)

Filter scenarios by name substring (case-insensitive). Run a subset of scenarios without modifying the eval.yaml:

dotnet run -- path/to/skill --tests-dir tests/plugin --scenario "Kanban" --scenario "catalog"
# Runs only matching scenarios, skips the rest

Useful for iterating on a specific failing scenario without waiting for all 5 to complete.

--keep-work-dirs

Preserve temporary working directories after the run instead of cleaning them up. Enables post-run inspection of what the baseline vs skilled agent actually produced — essential for understanding why a scenario scored low.

--work-dir <path>

Set a custom base directory for working directories instead of the system temp folder. Keeps artifacts organized and easy to find:

dotnet run -- path/to/skill --keep-work-dirs --work-dir artifacts/work-dirs

--readable-work-dirs

Use human-readable directory names instead of GUIDs:

artifacts/work-dirs/2026-03-16_11-42-12/
  kanban-board/
    run-1/
      baseline/
      isolated/
      plugin/
  e-commerce-catalog/
    run-1/
      baseline/
      isolated/
      plugin/

Instead of sv-a1b2c3d4... GUIDs scattered in temp.

Files changed

  • eng/skill-validator/src/Commands/ValidateCommand.cs — option definitions, wiring, scenario filtering logic, environment filtering
  • eng/skill-validator/src/Models/Models.csKeepWorkDirs, WorkDirBase, ReadableWorkDirs, ScenarioFilters, Environment config properties; Environment field on EvalScenario
  • eng/skill-validator/src/Services/AgentRunner.cs — readable directory naming, Slugify(), SetWorkDirBase(), SetReadableWorkDirs(), WorkDirCount, variant labeling (baseline/isolated/plugin/noise-all)
  • eng/skill-validator/src/Services/EvalSchema.csEnvironment field on RawScenario, parsing into EvalScenario

create-skill-evaluation meta-skill

A skill at .agents/skills/create-skill-evaluation/SKILL.md that teaches agents how to write eval.yaml files for the skill validator. Covers:

  • Eval design principles: goal-oriented prompts (describe what, not how), non-overlapping domains with SKILL.md examples, baseline calibration
  • eval.yaml structure: setup approaches (commands, copy_test_files, inline files), YAML anchors, assertion types, rubric item format ("Does [correct thing] — not [common mistake]")
  • Rubric guidelines: 10–13 items per scenario, one behavior per item, name specific APIs, include anti-pattern phrasing
  • Iteration workflow: fast 1-run feedback, baseline score analysis, structural diversity comparison, judge reasoning inspection

…s to skill validator

--scenario <substring>: filter scenarios by name substring (case-insensitive,
repeatable). Useful for iterating on a specific scenario without running the
full eval suite.

--keep-work-dirs: preserve temporary working directories after the run instead
of cleaning them up. Enables post-run inspection of baseline vs skilled output.

--work-dir <path>: set a custom base directory for working directories instead
of the system temp folder. Helps keep artifacts organized.

--readable-work-dirs: use human-readable directory names
(<timestamp>/<scenario-slug>/run-N/<variant>) instead of GUIDs. Makes it easy
to find and compare baseline/isolated/plugin outputs.
A skill that teaches agents how to write eval.yaml files for the skill
validator. Covers eval design principles (goal-oriented prompts, non-overlapping
domains, baseline calibration), eval.yaml structure (setup, assertions, rubric
items), and an iteration workflow for fast feedback and result analysis.
Copilot AI review requested due to automatic review settings March 16, 2026 10:57
@github-actions
Copy link
Copy Markdown
Contributor

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

var noiseMaxScenarioDegradationOpt = new Option<double>("--noise-max-scenario-degradation") { Description = "Maximum acceptable quality degradation (0-1) for any single noise-test scenario", DefaultValueFactory = _ => 0.4 };
var keepWorkDirsOpt = new Option<bool>("--keep-work-dirs") { Description = "Preserve temporary working directories after the run (paths printed when --verbose is set)" };
var workDirBaseOpt = new Option<string?>("--work-dir") { Description = "Base directory for temporary working directories (defaults to system temp)" };
var readableWorkDirsOpt = new Option<bool>("--readable-work-dirs") { Description = "Use human-readable directory names (<scenario>/<run-N>/<variant>) instead of GUIDs" };
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other options make sense but I don't see much value in this one. Is this for some kind of deterministic runs?

We intentionally chose a GUID not just for randomness of the dir name but because the LLM draw meaning from the folder name. So anything that is not a GUID impacts the decision tree.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so while you are iterating over the scenarios you can quickly and easily inspect the outputs per scenario. I think the LLM inferring stuff is a bit of a stretch, and this is mainly for development purposes, when running on CI, this is never passed, so it runs with the temp folder and the hashed folder.

Note that you could add an explicit instruction to not make assumptions based on the path.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you could add an explicit instruction to not make assumptions based on the path.

We saw the inference of the path in validation runs - this isn't theoretical. You can give it any kind of instructions but it can always decide to just ignore them. Avoiding to encode any meaning in the path is the best we can do here, even for dev purposes. The results between a local and a CI run shouldn't differ.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it super hard to see the actual output from the tool. I don't understand why this is such a big problem, given that it's not a default.

I'm not saying it's not a real thing, I'm suggesting that its impact is limited, and that during development people need to look at the output from skills, they can run without the flag at the end locally too. But being able to easily review the different outputs and also have the LLM separately compare (despite the scores) the outputs against the skill contents is an important quality to be able to iterate quickly.

So, my point is:

  • It's not on by default
  • You can run without it during development and observe the delta.
  • The CI runs without it and won't have this potential bias built in.

I don't see an obvious drawback here, but without it, the development experience is not good. You either don't look at the actual outputs, or you spend a significant amount of time trying to dig them out.

Having the outputs be so complex to look at and compare doing development encourages people to not review the actual outputs and just take the judgement results, which in my experience are not reliable.

The judgement results help when they point out issues, but they don't help when the issues don't surface, and the only way to make the issues surface is to be able to look at the outputs.

Ultimately people need to be reviewing the output from skills to ensure some level of quality on the actual results. People is ultimately responsible for the code we ship, not machines, so we should create an environment where it's easy for people to do the right thing. After all, a doctor would not perform surgery just because a machine tells it that there is an issue, nor we would ship features without actually testing them.

We are paid for shipping working software, so we should have a process that makes it easy for people to do the right thing.

Scenarios in eval.yaml can now declare an 'environment' field (e.g. 'ci',
'local', 'daily'). When --environment is passed, only scenarios whose
environment matches (or that have no environment set) will run. Scenarios
without an environment always run regardless of the flag.

This enables limiting expensive scenarios on CI while keeping them
available for local development.
@ViktorHofer
Copy link
Copy Markdown
Member

ViktorHofer commented Mar 23, 2026

@JanKrivanek would you mind taking a look when you have some time? These switches are definitely helpful. Just want another pair of eyes (especially on the eval.yml schema changes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants