I don’t have much time to evaluate tools every months and I have settled on Curs...

afro88 · 2025-12-20T10:22:24 1766226144

You're not missing much. You can generally use Cursor like Claude Code for normal day to day use. I prefer Cursor because I like reviewing changes in an IDE, and I like being able to switch to the current SOTA model.

Though for more automated work, one thing you miss with Cursor is sub agents. And then to a lesser extent skills (these are pretty easy to emulate in other tools). I'm sure it's only a matter of time though.

Ozzie_osman · 2025-12-20T13:12:07 1766236327

Claude Code's VS Code integration is very easy to set up and pretty helpful if you want to see/review changes in an IDE.

ollysb · 2025-12-20T13:29:59 1766237399

The big limitation is that you have to approve/disapprove at every step. With Cursor you can iterate on changes and it updates the diffs until you approve the whole batch.

fzzzy · 2025-12-20T16:08:00 1766246880

There is an auto accept diffs mode

ramoz · 2025-12-20T13:20:12 1766236812

You are missing an entire agentic experience. And I wouldn't call it vibe coding for an engineer; you're more or less empowered to truly orchestrate the development of your system.

Cursor has agent, but that's like whoever else tried to copy the Model T while Ford was developing it.

senordevnyc · 2025-12-20T16:35:30 1766248530

This hasn’t been my experience at all. I’m finding Cursor with Opus 4.5 and plan mode to be just as capable as CC. And I prefer the UI/UX.

andai · 2025-12-20T09:54:17 1766224457

I have only compared Claude Code with Crush and a tool of my own design. In my experience, Claude code is optimized for giant codebases and long tasks. It loves launching dozens of agents in parallel. So it's a bit heavy for smaller, surgical stuff, though it works decent for that too.

If you mostly have small codebases that fit in context, or make many small changes interactively, it's not really great for that (though it can handle it too). It'll just be spending most of its time poking around the codebase, when the whole thing should have just been loaded... (Too bad there's no small repo mode. I made startup hook that just dumps cat dir into context, but yeah, should be a toggle.)

wahnfrieden · 2025-12-20T08:04:22 1766217862

If you switch to Codex you will get a lot of tokens for $200, enough to more consistently use high reasoning as well. Cursor is simply far more expensive so you end up using less or using dumber models.

Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2

MrOrelliOReilly · 2025-12-20T09:46:15 1766223975

I think this is a total misunderstanding of Anthropic’s place in the AI race. Opus 4.5 is absolutely a state of the art model. I won’t knock anyone for preferring Codex, but I think you’re ignoring official and unofficial benchmarks.

See: https://artificialanalysis.ai

woadwarrior01 · 2025-12-20T11:20:29 1766229629

> Opus 4.5 is absolutely a state of the art model.

> See: https://artificialanalysis.ai

The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.

MrOrelliOReilly · 2025-12-20T14:26:58 1766240818

Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.

gessha · 2025-12-20T15:30:03 1766244603

One thing to remember when comparing ML models of any kind is that single value metrics obscure a lot of nuance and you really have to go through the model results one by one to see how it performs. This is true for vision, NLP, and other modalities.

dr_dshiv · 2025-12-20T13:36:26 1766237786

https://lmarena.ai/leaderboard/webdev

LM Arena shows Claude Opus 4.5 on top

HarHarVeryFunny · 2025-12-20T13:57:58 1766239078

I wonder how model competence and/or user preference on web development (that leaderboard) carries over to more complex and larger projects, or more generally anything other than web development ?

In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?

ramoz · 2025-12-20T14:06:28 1766239588

https://x.com/giansegato/status/2002203155262812529/photo/1

https://x.com/METR_Evals/status/2002203627377574113

> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.

What an insane take for anybody uses these models daily.

MrOrelliOReilly · 2025-12-20T14:18:59 1766240339

Yes, I personally feel that the "official" benchmarks are increasingly diverging from the everyday reality of using these models. My theory is that we are reaching a point where all the models are intelligent enough for day-to-day queries, so points like style/personality and proper use of web queries and other capabilities are better differentiators than intelligence alone.

fzzzy · 2025-12-20T16:09:15 1766246955

is x-high fast enough to use as a coding agent?

wahnfrieden · 2025-12-20T16:14:36 1766247276

Yes, if you parallelize your work, which you must learn to do if you want the best quality

wahnfrieden · 2025-12-20T16:24:34 1766247874

What am I missing? As suspicious as benchmarks are, your link shows GPT 5.2 to be superior.

It is also out of date as it does not include 5.2 Codex.

Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.

ccmcarey · 2025-12-20T08:34:06 1766219646

I disagree, the claude models seem the best at tool calling, opus 4.5 seems the smartest, and claude code (+ claude model) seems to make good use of subagents and planning in a way that codex doesn't

wahnfrieden · 2025-12-20T16:33:32 1766248412

Opus 4.5 is so bad at instruction following (30% worse per benchmark shared above) that it requires a manual toggle for plan mode.

GPT 5.2 simply obeys instruction to assemble a plan and avoids the need to compensate for poor steerability that would require the user to manually manage modalities.

Opus has improved though so the plan mode is less necessary than it was before, but it is still far behind state of art steerability.