Add skill validator CLI improvements and create-skill-evaluation meta-skill#376
Add skill validator CLI improvements and create-skill-evaluation meta-skill#376javiercn wants to merge 3 commits intodotnet:mainfrom
Conversation
…s to skill validator --scenario <substring>: filter scenarios by name substring (case-insensitive, repeatable). Useful for iterating on a specific scenario without running the full eval suite. --keep-work-dirs: preserve temporary working directories after the run instead of cleaning them up. Enables post-run inspection of baseline vs skilled output. --work-dir <path>: set a custom base directory for working directories instead of the system temp folder. Helps keep artifacts organized. --readable-work-dirs: use human-readable directory names (<timestamp>/<scenario-slug>/run-N/<variant>) instead of GUIDs. Makes it easy to find and compare baseline/isolated/plugin outputs.
A skill that teaches agents how to write eval.yaml files for the skill validator. Covers eval design principles (goal-oriented prompts, non-overlapping domains, baseline calibration), eval.yaml structure (setup, assertions, rubric items), and an iteration workflow for fast feedback and result analysis.
|
Note This PR is from a fork and modifies infrastructure files ( Changes to infrastructure typically need to be submitted from a branch in Please consider recreating this PR from an upstream branch. If you don't have push access to |
| var noiseMaxScenarioDegradationOpt = new Option<double>("--noise-max-scenario-degradation") { Description = "Maximum acceptable quality degradation (0-1) for any single noise-test scenario", DefaultValueFactory = _ => 0.4 }; | ||
| var keepWorkDirsOpt = new Option<bool>("--keep-work-dirs") { Description = "Preserve temporary working directories after the run (paths printed when --verbose is set)" }; | ||
| var workDirBaseOpt = new Option<string?>("--work-dir") { Description = "Base directory for temporary working directories (defaults to system temp)" }; | ||
| var readableWorkDirsOpt = new Option<bool>("--readable-work-dirs") { Description = "Use human-readable directory names (<scenario>/<run-N>/<variant>) instead of GUIDs" }; |
There was a problem hiding this comment.
The other options make sense but I don't see much value in this one. Is this for some kind of deterministic runs?
We intentionally chose a GUID not just for randomness of the dir name but because the LLM draw meaning from the folder name. So anything that is not a GUID impacts the decision tree.
There was a problem hiding this comment.
This is so while you are iterating over the scenarios you can quickly and easily inspect the outputs per scenario. I think the LLM inferring stuff is a bit of a stretch, and this is mainly for development purposes, when running on CI, this is never passed, so it runs with the temp folder and the hashed folder.
Note that you could add an explicit instruction to not make assumptions based on the path.
There was a problem hiding this comment.
Note that you could add an explicit instruction to not make assumptions based on the path.
We saw the inference of the path in validation runs - this isn't theoretical. You can give it any kind of instructions but it can always decide to just ignore them. Avoiding to encode any meaning in the path is the best we can do here, even for dev purposes. The results between a local and a CI run shouldn't differ.
There was a problem hiding this comment.
It makes it super hard to see the actual output from the tool. I don't understand why this is such a big problem, given that it's not a default.
I'm not saying it's not a real thing, I'm suggesting that its impact is limited, and that during development people need to look at the output from skills, they can run without the flag at the end locally too. But being able to easily review the different outputs and also have the LLM separately compare (despite the scores) the outputs against the skill contents is an important quality to be able to iterate quickly.
So, my point is:
- It's not on by default
- You can run without it during development and observe the delta.
- The CI runs without it and won't have this potential bias built in.
I don't see an obvious drawback here, but without it, the development experience is not good. You either don't look at the actual outputs, or you spend a significant amount of time trying to dig them out.
Having the outputs be so complex to look at and compare doing development encourages people to not review the actual outputs and just take the judgement results, which in my experience are not reliable.
The judgement results help when they point out issues, but they don't help when the issues don't surface, and the only way to make the issues surface is to be able to look at the outputs.
Ultimately people need to be reviewing the output from skills to ensure some level of quality on the actual results. People is ultimately responsible for the code we ship, not machines, so we should create an environment where it's easy for people to do the right thing. After all, a doctor would not perform surgery just because a machine tells it that there is an issue, nor we would ship features without actually testing them.
We are paid for shipping working software, so we should have a process that makes it easy for people to do the right thing.
Scenarios in eval.yaml can now declare an 'environment' field (e.g. 'ci', 'local', 'daily'). When --environment is passed, only scenarios whose environment matches (or that have no environment set) will run. Scenarios without an environment always run regardless of the flag. This enables limiting expensive scenarios on CI while keeping them available for local development.
|
@JanKrivanek would you mind taking a look when you have some time? These switches are definitely helpful. Just want another pair of eyes (especially on the eval.yml schema changes). |
Summary
Three improvements to the skill development workflow:
--environmentscenario filtering — control which scenarios run in different environments (CI, local, daily)create-skill-evaluationmeta-skill — teaches agents to write high-quality eval.yaml filesSkill validator CLI improvements
--environment <name>Filter scenarios by environment. Scenarios in
eval.yamlcan declare anenvironmentfield. When--environmentis passed, only scenarios whose environment matches (or that have no environment set) will run. Scenarios without an environment always run regardless.This enables limiting expensive scenarios on CI while keeping them available for local development. Future use:
dailyorweeklyenvironments for periodic expensive evaluations.--scenario <name>(repeatable)Filter scenarios by name substring (case-insensitive). Run a subset of scenarios without modifying the eval.yaml:
Useful for iterating on a specific failing scenario without waiting for all 5 to complete.
--keep-work-dirsPreserve temporary working directories after the run instead of cleaning them up. Enables post-run inspection of what the baseline vs skilled agent actually produced — essential for understanding why a scenario scored low.
--work-dir <path>Set a custom base directory for working directories instead of the system temp folder. Keeps artifacts organized and easy to find:
--readable-work-dirsUse human-readable directory names instead of GUIDs:
Instead of
sv-a1b2c3d4...GUIDs scattered in temp.Files changed
eng/skill-validator/src/Commands/ValidateCommand.cs— option definitions, wiring, scenario filtering logic, environment filteringeng/skill-validator/src/Models/Models.cs—KeepWorkDirs,WorkDirBase,ReadableWorkDirs,ScenarioFilters,Environmentconfig properties;Environmentfield onEvalScenarioeng/skill-validator/src/Services/AgentRunner.cs— readable directory naming,Slugify(),SetWorkDirBase(),SetReadableWorkDirs(),WorkDirCount, variant labeling (baseline/isolated/plugin/noise-all)eng/skill-validator/src/Services/EvalSchema.cs—Environmentfield onRawScenario, parsing intoEvalScenariocreate-skill-evaluationmeta-skillA skill at
.agents/skills/create-skill-evaluation/SKILL.mdthat teaches agents how to write eval.yaml files for the skill validator. Covers: