Skip to main content

spec-coding-skills demo benchmark

This project includes a small 3-prompt demo benchmark comparing:

  • baseline generic agent output
  • output guided by spec-coding-skills

Result summary

EvalBaselineWith skills
Planning an existing feature33.3%100.0%
Correcting a failing test0.0%100.0%
Saving a reusable root cause0.0%100.0%
Overall mean11.1%100.0%

What this measures

This demo does not claim final code quality improved by the same amount.

It measures whether the agent produced the workflow artifacts that make real development safer:

  • clear scope
  • testable acceptance criteria
  • explicit validation steps
  • reusable project memory
  • structured correction outputs

Use it as a demo signal, not a statistically rigorous benchmark.