Add skill validator CLI improvements and create-skill-evaluation meta-skill by javiercn · Pull Request #376 · dotnet/skills

javiercn · 2026-03-16T10:57:43Z

Summary

Three improvements to the skill development workflow:

Skill validator CLI enhancements — five new flags for faster iteration and easier debugging
--environment scenario filtering — control which scenarios run in different environments (CI, local, daily)
create-skill-evaluation meta-skill — teaches agents to write high-quality eval.yaml files

Skill validator CLI improvements

`--environment <name>`

Filter scenarios by environment. Scenarios in eval.yaml can declare an environment field. When --environment is passed, only scenarios whose environment matches (or that have no environment set) will run. Scenarios without an environment always run regardless.

scenarios:
  - name: Quick smoke test        # runs everywhere (no environment set)
    prompt: ...
  - name: Full integration test
    environment: local             # skipped with --environment ci
    prompt: ...
  - name: Expensive stress test
    environment: daily             # only runs with --environment daily
    prompt: ...

# CI: skip local-only and daily scenarios
skill-validator --environment ci path/to/skill

# Local development: run everything (no --environment flag)
skill-validator path/to/skill

This enables limiting expensive scenarios on CI while keeping them available for local development. Future use: daily or weekly environments for periodic expensive evaluations.

`--scenario <name>` (repeatable)

Filter scenarios by name substring (case-insensitive). Run a subset of scenarios without modifying the eval.yaml:

dotnet run -- path/to/skill --tests-dir tests/plugin --scenario "Kanban" --scenario "catalog"
# Runs only matching scenarios, skips the rest

Useful for iterating on a specific failing scenario without waiting for all 5 to complete.

`--keep-work-dirs`

Preserve temporary working directories after the run instead of cleaning them up. Enables post-run inspection of what the baseline vs skilled agent actually produced — essential for understanding why a scenario scored low.

`--work-dir <path>`

Set a custom base directory for working directories instead of the system temp folder. Keeps artifacts organized and easy to find:

dotnet run -- path/to/skill --keep-work-dirs --work-dir artifacts/work-dirs

`--readable-work-dirs`

Use human-readable directory names instead of GUIDs:

artifacts/work-dirs/2026-03-16_11-42-12/
  kanban-board/
    run-1/
      baseline/
      isolated/
      plugin/
  e-commerce-catalog/
    run-1/
      baseline/
      isolated/
      plugin/

Instead of sv-a1b2c3d4... GUIDs scattered in temp.

Files changed

eng/skill-validator/src/Commands/ValidateCommand.cs — option definitions, wiring, scenario filtering logic, environment filtering
eng/skill-validator/src/Models/Models.cs — KeepWorkDirs, WorkDirBase, ReadableWorkDirs, ScenarioFilters, Environment config properties; Environment field on EvalScenario
eng/skill-validator/src/Services/AgentRunner.cs — readable directory naming, Slugify(), SetWorkDirBase(), SetReadableWorkDirs(), WorkDirCount, variant labeling (baseline/isolated/plugin/noise-all)
eng/skill-validator/src/Services/EvalSchema.cs — Environment field on RawScenario, parsing into EvalScenario

`create-skill-evaluation` meta-skill

A skill at .agents/skills/create-skill-evaluation/SKILL.md that teaches agents how to write eval.yaml files for the skill validator. Covers:

Eval design principles: goal-oriented prompts (describe what, not how), non-overlapping domains with SKILL.md examples, baseline calibration
eval.yaml structure: setup approaches (commands, copy_test_files, inline files), YAML anchors, assertion types, rubric item format ("Does [correct thing] — not [common mistake]")
Rubric guidelines: 10–13 items per scenario, one behavior per item, name specific APIs, include anti-pattern phrasing
Iteration workflow: fast 1-run feedback, baseline score analysis, structural diversity comparison, judge reasoning inspection

…s to skill validator --scenario <substring>: filter scenarios by name substring (case-insensitive, repeatable). Useful for iterating on a specific scenario without running the full eval suite. --keep-work-dirs: preserve temporary working directories after the run instead of cleaning them up. Enables post-run inspection of baseline vs skilled output. --work-dir <path>: set a custom base directory for working directories instead of the system temp folder. Helps keep artifacts organized. --readable-work-dirs: use human-readable directory names (<timestamp>/<scenario-slug>/run-N/<variant>) instead of GUIDs. Makes it easy to find and compare baseline/isolated/plugin outputs.

A skill that teaches agents how to write eval.yaml files for the skill validator. Covers eval design principles (goal-oriented prompts, non-overlapping domains, baseline calibration), eval.yaml structure (setup, assertions, rubric items), and an iteration workflow for fast feedback and result analysis.

github-actions · 2026-03-16T10:57:53Z

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

ViktorHofer · 2026-03-16T12:12:44Z

eng/skill-validator/src/Commands/ValidateCommand.cs

        var noiseMaxScenarioDegradationOpt = new Option<double>("--noise-max-scenario-degradation") { Description = "Maximum acceptable quality degradation (0-1) for any single noise-test scenario", DefaultValueFactory = _ => 0.4 };
+        var keepWorkDirsOpt = new Option<bool>("--keep-work-dirs") { Description = "Preserve temporary working directories after the run (paths printed when --verbose is set)" };
+        var workDirBaseOpt = new Option<string?>("--work-dir") { Description = "Base directory for temporary working directories (defaults to system temp)" };
+        var readableWorkDirsOpt = new Option<bool>("--readable-work-dirs") { Description = "Use human-readable directory names (<scenario>/<run-N>/<variant>) instead of GUIDs" };


The other options make sense but I don't see much value in this one. Is this for some kind of deterministic runs?

We intentionally chose a GUID not just for randomness of the dir name but because the LLM draw meaning from the folder name. So anything that is not a GUID impacts the decision tree.

This is so while you are iterating over the scenarios you can quickly and easily inspect the outputs per scenario. I think the LLM inferring stuff is a bit of a stretch, and this is mainly for development purposes, when running on CI, this is never passed, so it runs with the temp folder and the hashed folder.

Note that you could add an explicit instruction to not make assumptions based on the path.

Note that you could add an explicit instruction to not make assumptions based on the path.

We saw the inference of the path in validation runs - this isn't theoretical. You can give it any kind of instructions but it can always decide to just ignore them. Avoiding to encode any meaning in the path is the best we can do here, even for dev purposes. The results between a local and a CI run shouldn't differ.

It makes it super hard to see the actual output from the tool. I don't understand why this is such a big problem, given that it's not a default.

I'm not saying it's not a real thing, I'm suggesting that its impact is limited, and that during development people need to look at the output from skills, they can run without the flag at the end locally too. But being able to easily review the different outputs and also have the LLM separately compare (despite the scores) the outputs against the skill contents is an important quality to be able to iterate quickly.

So, my point is:

It's not on by default

You can run without it during development and observe the delta.

The CI runs without it and won't have this potential bias built in.

I don't see an obvious drawback here, but without it, the development experience is not good. You either don't look at the actual outputs, or you spend a significant amount of time trying to dig them out.

Having the outputs be so complex to look at and compare doing development encourages people to not review the actual outputs and just take the judgement results, which in my experience are not reliable.

The judgement results help when they point out issues, but they don't help when the issues don't surface, and the only way to make the issues surface is to be able to look at the outputs.

Ultimately people need to be reviewing the output from skills to ensure some level of quality on the actual results. People is ultimately responsible for the code we ship, not machines, so we should create an environment where it's easy for people to do the right thing. After all, a doctor would not perform surgery just because a machine tells it that there is an issue, nor we would ship features without actually testing them.

We are paid for shipping working software, so we should have a process that makes it easy for people to do the right thing.

Scenarios in eval.yaml can now declare an 'environment' field (e.g. 'ci', 'local', 'daily'). When --environment is passed, only scenarios whose environment matches (or that have no environment set) will run. Scenarios without an environment always run regardless of the flag. This enables limiting expensive scenarios on CI while keeping them available for local development.

ViktorHofer · 2026-03-23T12:28:25Z

@JanKrivanek would you mind taking a look when you have some time? These switches are definitely helpful. Just want another pair of eyes (especially on the eval.yml schema changes).

javiercn added 2 commits March 16, 2026 11:55

javiercn requested review from JanKrivanek, ViktorHofer, dbreshears and timheuer as code owners March 16, 2026 10:57

Copilot AI review requested due to automatic review settings March 16, 2026 10:57

Copilot started reviewing on behalf of javiercn March 16, 2026 10:58 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

ViktorHofer reviewed Mar 16, 2026

View reviewed changes

ViktorHofer added the infrastructure label Mar 16, 2026

github-actions bot mentioned this pull request Mar 30, 2026

🏥 Repository Health Dashboard #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add skill validator CLI improvements and create-skill-evaluation meta-skill#376

Add skill validator CLI improvements and create-skill-evaluation meta-skill#376
javiercn wants to merge 3 commits intodotnet:mainfrom
javiercn:javiercn/skill-validator-improvements

javiercn commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ViktorHofer Mar 16, 2026

Uh oh!

javiercn Mar 16, 2026

Uh oh!

ViktorHofer Mar 18, 2026

Uh oh!

javiercn Mar 19, 2026

Uh oh!

ViktorHofer commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

javiercn commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Skill validator CLI improvements

--environment <name>

--scenario <name> (repeatable)

--keep-work-dirs

--work-dir <path>

--readable-work-dirs

Files changed

create-skill-evaluation meta-skill

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

ViktorHofer Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

javiercn Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

ViktorHofer Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

javiercn Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ViktorHofer commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

javiercn commented Mar 16, 2026 •

edited

Loading

`--environment <name>`

`--scenario <name>` (repeatable)

`--keep-work-dirs`

`--work-dir <path>`

`--readable-work-dirs`

`create-skill-evaluation` meta-skill

ViktorHofer commented Mar 23, 2026 •

edited

Loading