Discussion about this post

User's avatar
Rohan Jaiswal's avatar

The 16-point gap between Claude Haiku's 42.9% and Opus's 58.2% on Meta's Harvest benchmark matters less than your finding that a richer tooling harness outperformed model swaps. Builders have been asking 'which model?' when the actual variable is 'which harness?' The 47.7% Hack, 18.9% Python distribution in production also quietly rebukes the Python-centric research community that designed the benchmarks everyone uses for procurement decisions. What would need to change about benchmark design for enterprise teams to actually revise model contracts based on production-derived evals, rather than just noting the paper and moving on? Covering this from the builder side at theaifounder.substack.com.

Eddy Bogomolov's avatar

The funnel chart is the part that should keep eng leads up at night. 50K to 100s is a 500x gap between what runs and what’s actually production-ready. Public benchmarks select for the top of that funnel because the bottom is expensive to measure. Most procurement decisions live in the gap.

3 more comments...

No posts

Ready for more?