Do you have benchmarks demonstrating this? In my own personal/team benchmarks, I...

maeil · on Aug 7, 2024

I'm building a product that requires complex LLM flows and out of OpenAI's "cheap" tier models, the old versions of Turbo-3.5 are far better than the last versions of it and 4o-mini. I have a number of tasks that the former consistently succeed at and the latter consistently fail at regardless of prompting.

Leaderboards and benchmarks are very misleading as OpenAI is optimizing for them, like in the past when certain CPU manufacturers would optimize for synthetic benchmarks.

Fwif these aren't chat usecases, for which the newer models may well be better.