For me, Claude Code was the most impressive innovation this year. Cursor was a good proof of concept but Claude Code is the tool that actually got me to use LLMs for coding.
The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.
With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.
Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.
I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.
I don't even see much reason to use Cursor. I am used to IntelliJ IDEA, so I just downloaded the Claude Code plugin and basically now I use the IDE only for navigating in the code, finding references and reviewing the code. I can't even remember the last time I wrote more than 2 lines of code. Claude Code has catapulted my performance at least 5x if not more. And now that the cost of writing test is so minimal I am also able to achieve much better (and meaningful!) test coverage too. The AI agents is where the most productivity is.
I just create a plan with Claude, iterate over, ask questions, then let it implement the plan, review, ask to do some adjustments. No manual writing of code at all. Zero.
I first got into agentic properly with GLM coding plan (it's like $2/month), but I found myself very consistently asking Claude to make the code more elegant and readable. At which point I realized I was being silly and just switched to Claude code.
(GLM etc. get surprisingly close with good prompting but... $0.60/day to not worry about that is a no brainer.)
You are missing an entire agentic experience. And I wouldn't call it vibe coding for an engineer; you're more or less empowered to truly orchestrate the development of your system.
Cursor has agent, but that's like whoever else tried to copy the Model T while Ford was developing it.
I have only compared Claude Code with Crush and a tool of my own design. In my experience, Claude code is optimized for giant codebases and long tasks. It loves launching dozens of agents in parallel. So it's a bit heavy for smaller, surgical stuff, though it works decent for that too.
If you mostly have small codebases that fit in context, or make many small changes interactively, it's not really great for that (though it can handle it too). It'll just be spending most of its time poking around the codebase, when the whole thing should have just been loaded... (Too bad there's no small repo mode. I made startup hook that just dumps cat dir into context, but yeah, should be a toggle.)
You're not missing much. You can generally use Cursor like Claude Code for normal day to day use. I prefer Cursor because I like reviewing changes in an IDE, and I like being able to switch to the current SOTA model.
Though for more automated work, one thing you miss with Cursor is sub agents. And then to a lesser extent skills (these are pretty easy to emulate in other tools). I'm sure it's only a matter of time though.
The big limitation is that you have to approve/disapprove at every step. With Cursor you can iterate on changes and it updates the diffs until you approve the whole batch.
If you switch to Codex you will get a lot of tokens for $200, enough to more consistently use high reasoning as well. Cursor is simply far more expensive so you end up using less or using dumber models.
Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2
I think this is a total misunderstanding of Anthropic’s place in the AI race. Opus 4.5 is absolutely a state of the art model. I won’t knock anyone for preferring Codex, but I think you’re ignoring official and unofficial benchmarks.
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.
I wonder how model competence and/or user preference on web development (that leaderboard) carries over to more complex and larger projects, or more generally anything other than web development ?
In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?
Yes, I personally feel that the "official" benchmarks are increasingly diverging from the everyday reality of using these models. My theory is that we are reaching a point where all the models are intelligent enough for day-to-day queries, so points like style/personality and proper use of web queries and other capabilities are better differentiators than intelligence alone.
I disagree, the claude models seem the best at tool calling, opus 4.5 seems the smartest, and claude code (+ claude model) seems to make good use of subagents and planning in a way that codex doesn't
I noticed that despite really liking Karpathy and the blog, I was am kind of wincing/involuntarily reacting to the LLM-like "It's not X, its Y"-phrases:
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me
I hated these sentences way before LLMs, at least in the context of an explanation.
> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer
This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.
Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.
I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.
> it's not just about the image generation itself, it's about the joint capability coming from text generation.
That's unjustified conceptual stress.
That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.
"I find image generation is cooler when paired with text generation."
Karpathy should go back to what he does best: educating people about AI on a deep level. Running experiments and sharing how they work, that sort of stuff. It seems lately he is closer to an influencer who reviews AI-based products. Hopefully it is not too late to go back.
It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.
Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).
Same here, had to configure ChatGPT to stop making these statements. Also had to configure bunch of other stuff to make it bland when answering questions.
The way to make AI not sound like ChatGPT is to use Claude.
I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."
It'll still sound like AI, but 90% of the cringe is gone.
If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.
That being said, I feel very self conscious using emdashes in current decade ;)
I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.
The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.
Agree with the GP, though -- you ought to make that clearer. It really reads like you're saying that CC runs locally, which is confusing since you obviously know better.
I think we need to shift our mindset on what an agent is. The LLM is a brain in a vat connected far away. The agent sits on your device, as a mech suit for that brain, and can pretty much do damn near anything on that machine. It's there, with you. The same way any desktop software is.
One of the most interesting coding agents to run locally is actually OpenAI Codex, since it has the ability to run against their gpt-oss models hosted by Ollama.
Think of it as the early years of UNIX & PC. Running inferences and tools locally and offline opens doors to new industries. We might not even need client/server paradigm locally. LLM is just a probabilistic library we can call.
What he meant was, agents will probably not be these web abstractions that run in deployed services (langchain, crew); agents meaning the Harnesses (software wrapper) specifically that call the LLM API.
It runs on your computer because of its tooling. It can call Bash. It can literally do anything on the operating system and file system. That's what makes it different. You should think of it like a mech suit. The model is just the brain in a vat connected far away.
The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
> The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR
I don't see these descriptions as very insightful.
The difference between general/animal intelligence and jagged/LLM intelligence is simply that humans/animals really ARE intelligent (the word was created to describe this human capability), while LLMs are just echoing narrow portions of the intelligent output of humans (those portions that are amenable to RLVR capture).
For an artificial intelligence to be intelligent in it's own right, and therefore be generally intelligent, it would need to need - like an animal - to be embodied (even if only virtually), autonomous, predicting the outcomes of it's own actions (not auto-regressively trained), learning incrementally and continually, built with innate traits like curiosity and boredom to put and keep itself in learning situations, etc.
Of course not all animals are generally intelligent - many (insects, fish, reptiles, many birds) just have narrow "hard coded" instinctual behaviors, but others like humans are generalists who evolution have therefore honed for adaptive lifetime learning and general intelligence.
> In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc.
You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!
On the other side of the spectrum, I see some of the latest agents, like Codex, take care to get accessibility right -- something not even many humans bother to do.
It's an extension of how I've noticed that AIs will generally write very buttoned-down, cross-the-ts-and-dot-the-is code. Everything gets commented, every method has a try-catch with a log statement, every return type is checked, etc. I think it's a consequence of them not feeling fatigue. These things (accessibility included) are all things humans generally know they 'should' do, but there never seems to be enough time in the day; we'll get to it later when we're less tired. But the ghost in the machine doesn't care. It operates at the same level all the time
I would love Andrej's take on the fast models we got this year. Gemini 3 flash and Grok 4 fast have no business being as good + cheap + fast as they are. For Andrej's prediction about LLMs communicating with us via a visual interface we're going to need fast models, but I feel like AI twitter/HN has mostly ignored these.
Just guessing here, but these small models may well be essentially distillations of larger ones, with this being where their power comes from. e.g. Use a large model to generate synthetic reasoning traces, then train a small model on those.
The bit about o3 being the turning point is very interesting. I heard someone say that o3 (or perhaps the cheaper o4-mini) should have been called gpt-5, and that people would have been mind blown. Instead it kind of went under the radar as far as the mainstream goes.
Whereas we just got the incremental progress with gpt-5 instead and it was very underwhelming. (Plus like 5 other issues at launch, but that's a separate story ;)
I'm not sure if o4-mini would have made a good default gpt though. (Most use is conversational and its language is very awkward.) So they could have just called it gpt-5 pro or something, and put it on the $20 tier. I don't know.
Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.
> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
I don't think it will become well rounded because that is not cost sensitive. Intelligence is sensitive to cost, it is the core constraint shaping it. Any action has a cost - energy, materials, time, opportunity or social. Intelligence is solving the cost equation, if we can't solve it we die. Cost is also why we specialize, in a group we can offload some intelligence to others. LLMs also have their own costs, and are shaped by it into some kind of jagged intelligence, they are no spherical cows either.
What's interesting about Nano Banana (and even more so video models like Veo 3) is that they act as a weird kind of world model when you consider that they accept images as input and return images as output.
Give it an image of a maze, it can output that same image with the maze completed (maybe).
> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.
I think he is referring to capability, not architecture, and say that NB is at the point that it is suggestive of the near-future capability of using GenAI models to create their own UI as needed.
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.
I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
Beyond graduating students, I see model labs as “accelerators/incubators” bundling, launching, and productizing observed ideas that gain traction. The sheer strength of their platforms, the number of eyes watching them, near-zero marginal costs, and seemingly unlimited budgets mean that only slow decision-making can prevent them from becoming the next Amazons of everything.
Something I’ve been thinking about is how as end stage users (eg building our own “thing” on top of an LLM) we can broadly verify it’s doing what we need without benchmarks. Does a set of custom evals built out over time solve this? Is there more we can do?
Friendly reminder: There is no ghost in the machine. It is a system executing code, not a being having thoughts. Let’s admire the tool without projecting a personality onto it.
Consciousness is weird and nobody understands it. There is no good reason to assume that these systems have it. But there is also no good reason to rule it out.
LLMs still need to bring clear added value to enterprise and corporate work; otherwise, they remain a geek’s toy.
Big media agencies that claim to use AI rely on strong creative teams who fine-tune prompts and spend weeks doing so. Even then, they don’t fully trust AI to slice long videos into shorter clips for social media.
Heavy administrative functions like HR or Finance still don’t get approval to expose any of their data to LLMs.
What I’m trying to say is that we are still in the early stages of LLM development, and as promising as this looks, it’s still far from delivering the real value that is often claimed.
Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.
I’ve found incredible productivity gains writing (vibe coding) tools for myself that will never need to be “productionised” or even used by another person. Heck even I will probably never use the latest log retrieval tool, which exists purely for Claude code to invoke it. There is a ton of useful software yet to be written for which there _is_ no “last 20%”.
These tools are so useful and make you so much more "productive" that you don't think anyone else would want to pay anything for them huh? Did your boss at least give you a big raise for your "productivity" increase, or maybe lay off some of your underperforming coworkers bc you are just so much better now?
Do you mean vibe coding as-in producing unreviewed code with LLMs and prompting at it until it appears to work, or vibe coding as a catch-all for any time someone uses AI-assistance to help them write code?
The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.
With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.
Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.
I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.
reply