Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is this speculation or do you work at Anthropic? It would be cool to see the prompts used for this.


What he is describing has become the 'standard' way to run that kind of benchmark, so he is almost certainly correct. SWE Bench [1] is the best open source benchmark.

[1] https://www.swebench.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: