Nah, its all pattern matching. This is how automated theorem provers like Isabelle are built, applying operations to lemmas/expressions to reach proofs.
I'm sure if you pick a sufficiently broad definition of pattern matching your argument is true by definition!
Unfortunately that has nothing to do with the topic of discussions, which is the capabilities of LLMs, which may require a more narrow definition of pattern matching.
This is our earlier work. Since May we've made it really easy for the community to build their own agents to play the game: you can now hook up your terminal to get Claude Code to play the game.
(just for clarity: links to past threads in no way imply that the new post isn't welcome! They're just because some readers enjoy poking back through past related discussions as well)
I am really keen on plugging into Age of Empires 2 - although practically I think we need a couple of years of improvements before LLMs would be smart/fast enough to react to the game in realtime. Currently they can't react fast enough - although specially trained networks could be viable.
I'm pretty sure that AI did take at least a few games off of the pros. IIRC the professional team only had one win, the last match.
I do agree that the game was terribly dumbed down to make it tractable. I keep hoping they'll revisit Dota 2 to see if they can find meaningful improvements and tackle the full game.
Yes, the OpenAI Five bots won a best of three in their custom format, back in 2019. The bots won the first two games, then a third game was played which the humans won, which is the point I was trying to make (I'm not the GP).
Unless you know of another time the bots were deployed formally against a pro team more recently, which I'd love to hear about.
As I recently commented on Bluesky, I want to write a contemporary choral setting of Spem in Alium (hope in another) but write the title Spem in Allium (hope in garlic) and see if it can make it to publication before anyone notices).
1. These are additions to our existing Factorio Learning Environment, which is an extensive agent environment for evaluating pre-trained LLM agents in an unbounded/open-ended setting in the game of Factorio. I don't agree that it is trivial, as there is significant infrastructure in place to support Factorio as an LLM eval.
2. Factorio is an unsolved game in multi-agent research.
3. This is a research environment. You can read our paper on Arxiv if you're interested! Nobody will make any money of this.
It's Mart, Neel and Jack from the Factorio Learning Environment team.
Since our initial release, we have been working hard to expand the environment to support multi-agent scenarios, reasoning models and MCP for human-in-the-loop evals.
We have also spent time experimenting with different ways to elicit more performance out of agents in the game, namely tools for vision and reflection.
Today, we are proud to release v0.2.0, which includes several exciting new features and improvements.
This is true - there are simpler benchmarks that can saturate planning for these models. We were motivated to create a broader spectrum eval, to test multiple capabilities at once and remain viable into the future.
That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.
For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.