Previous discussions about "pelican on a bicycle" always mention this, but it's not something they can do without being blatantly obvious. You can always do other x riding y tests. A juggler riding a barrel. A bear riding a unicycle. An anteater riding a horse, etc.
A couple of days ago, inspired by Simon and those discussions, I had Claude create 30 such tests. I posted a Show HN with the results from six models, but it didn’t get any traction. Here it is again:
Oh man, that’s hilarious. I dunno what qwen is doing most of the time. Gemini seems to be either a perfect win or complete nonsense. Claude seems to lean towards “good enough”.
simon has said multiple times he has hidden tests he runs for precisely this eventuality (because of course it will happen someday, and he'll write a banger article calling them out for it)
Under a different account? They could special case his account and even give him more computing power because they know he'll not rest until every dog on this planet has a subscription to at least ChatGPT and Claude Code.
> [...] but it's not something they can do without being blatantly obvious.
Thing is, it is being done by certain labs and it is blatantly obvious [0]. It has for months now been very easy to tell which labs either "train to the test" or (if you wanna give them the benefit of a doubt they have most certainly not earned) simply fail to keep their datasets sanitized because of this.
I still see value in and will continue to use odd scenarios to prompt for SVG as it can have value as a spatial reasoning focused benchmark. But I will never use the same scenario a second time and have a hard time taking anyone seriously who does, again, because I have seen consistent evidence that certain labs "bench-max" to a fault.
You are right, it is obvious and easily proven, yet despite that your correctness, the fact remains that a lot of hype masters simply do not care.
Say what you want about Anthropic, I have and will continue to do so, but they appear to take the most steps of any industry member in not training solely with a focus on beating benchmarks. Their models in my experience usually perform better than established, public benchmarks would make one think. They also, from what I have seen, take the most precautions to ensure to the best of their abilities that e.g. their own research papers on model "misaligned behavior"/unprompted agentic output do not find their way into the training corpus via canary strings [1], etc.
Overall, if I were asked whether any lab is doing everything they truly can to avoid any unintentionally training with a focus on popular, eye catching benchmarks, I'd say none, partly because it is likely impossible to avoid when using the open web as a source.
On the other hand, if I were asked whether any labs are intentionally and clearly training specifically with a focus on popular, eye catching benchmarks, I'd have a few names at the top of my mind right away. Just do what you suggested, try other out there scenarios as SVGs and see for yourself the discrepancy to e.g. the panda burger or cycle pelican. It is blatant, shameless and I ask any person with an audience in the LLM space to do the same. The fact that few if any seem to is annoying to say the least.
[1] In fairness, not knowing their data acquisition pipeline, etc. it is hard to tell how effective such measures can truly be considering reporting on their papers on the open web is unlikely to include the canary strings.