spec-coding-skills demo benchmark

This project includes a small 3-prompt demo benchmark comparing:

Result summary

Eval	Baseline	With skills
Planning an existing feature	33.3%	100.0%
Correcting a failing test	0.0%	100.0%
Saving a reusable root cause	0.0%	100.0%
Overall mean	11.1%	100.0%

This demo does not claim final code quality improved by the same amount.

It measures whether the agent produced the workflow artifacts that make real development safer:

Use it as a demo signal, not a statistically rigorous benchmark.