Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I thought the shifts in certain areas between versions to be interesting. Claude sonnet 37 to 45, as an example.


Due to the small question bank, it's very easy for a model to go from 0% to 100% in some category between model versions just by flipping their answer to 1 or 2 questions, especially if they refuse to answer yes/no to one or more questions in that category.

It's hard to take away much from this without a large, diverse question bank.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: