Claude 3 surpasses GPT-4 on Chatbot Arena for the first time

neonate · on March 27, 2024

http://web.archive.org/web/20240327190610/https://arstechnic...

mrbishalsaha · on March 27, 2024

I use claude sonnet for coding and it's better than GPT4 most of the time. Something I am realising is that LLMs doesn't have any moat. Today OpenAI, tomorrow someone else.

thomasahle · on March 27, 2024

I agree. My personal experience is that 80% of the time Opus is better than GPT-4 on coding.

Honestly the only thing that keeps me to sometimes prefer GPT-4 now is the UI. I like being able to Edit my messages, and to Stop the model if I gave it the wrong prompt. Please improve Claude's UI!

The interoperability between LLMs right now is amazing. When I write a program I can quickly test it with each of GPT, Claude and Gemini to see which work better for what I'm doing. Here's to hoping nobody figure out how to create a moat any time soon!

AndyNemmity · on March 27, 2024

The Claude Ui of holding files though is far superior.

Each of them do some things better.

dehrmann · on March 27, 2024

Now we just need an ensemble model.

Buttons840 · on March 27, 2024

Tomorrow my desktop computer hopefully

PeterisP · on March 27, 2024

I really doubt it.

"Tomorrow" your desktop computer might be twice powerful but at the same time the "good model of tomorrow" will be four or ten times larger - I'd expect that the gap between what can be done locally versus what is offered as a service will grow, not shrink.

Ukv · on March 27, 2024

The diminishing returns from model scale means that if your personal computer improves twofold in the same time a datacenter improves fivefold, you may still have lessened the gap in terms of quality.

That doesn't mean you'll be able to run the best model, but I'm relatively optimistic about the gap not growing out of control.

PeterisP · on March 28, 2024

Well, sure, one thing is that the absolute numbers do increase, so for any given notion of "good enough", every device at some point will reach the level where it will be able to run it.

mattwad · on March 27, 2024

i run a home media server, can't wait to be able to add my own LLM service. it's just a matter of time for it to be something i can install over the weekend with proper hardware

proaralyst · on March 27, 2024

Have you tried https://ollama.com/ ? You may find you already can

gryfft · on March 27, 2024

  git clone https://github.com/ggerganov/llama.cpp
  cd llama.cpp
  make
  ./server -m models/7B/ggml-model.gguf -c 2048

I don't think it'll take you the whole weekend :)

lxgr · on March 27, 2024

One moat might be access to recent training data, as evidenced by the NYT lawsuit and recent deals in that space.

barbazoo · on March 27, 2024

Do you have an IDE integration? What I find so great about Copilot/GPT-4 is how it's integrated into VSCode/Jetbrains and can use the context you're in. Like knowing what line you highlighted, what documents you have open, etc. Do you copy paste into whatever chatbot you're using?

paradite · on March 27, 2024

I had the same problem of copying and pasting code into LLM web UI so I built a small tool to streamline the process and add source code into the prompt: https://prompt.16x.engineer/

You can't rely on IDE auto context since the entire codebase is too large to feed into LLM (maybe Claude 3 200k context token can take it, but too expensive). And RAG is not smart enough to figure which part of code is relevant.

barbazoo · on March 27, 2024

> You can't rely on IDE auto context since the entire codebase is too large to feed into LLM

It's not feeding in the entire codebase but whatever you selected or whatever files you have open, at least that's what the UI suggests. So I ask it "what does this line do" and I get an answer that uses the whole file to explain what the line does.

paradite · on March 27, 2024

Yeah for context within the single file, GitHub Copilot is good, I use it all the time.

But if you use it to do something across multiple files (DB schema, service, controller, HTML, JavaScript), then it becomes less accurate (precision vs recall problem) as like you said it uses your open windows or some heuristics to decide what's the context.

With an IDE as an interface, it is just not intuitive, UX wise, to "open tabs" to signal to GitHub Copilot that the files should be included in the context.

theturtletalks · on March 27, 2024

That works, but for more complex questions taking into account different files and the whole architecture of the app, CoPilot fails. I've been trying to RAG my repos to accomplish this, but the parent comment said that's not possible.

paradite · on March 27, 2024

I believe https://aider.chat/ is working on RAG for codebase.

toth · on March 27, 2024

Is github copilot using GPT-4 or 3.5? I've tried to find out for sure but I can't seem to find the information anywhere

mellosouls · on March 27, 2024

I think 3.5, that was the last official note.

Copilot Chat uses 4, but it's suspiciously free of confirmation that is also used in the more contextual Copilot (no-chat).

paradite · on March 27, 2024

GitHub Copilot uses OpenAI Codex, which is a much older model fined-tuned on GPT-3.

Definitely not GPT-4, otherwise it would not be less than $10 a month for constant usage.

csnweb · on March 27, 2024

The chat part (mostly) uses GPT-4, you can also see which model is called in the request logs. Here is the official announcement: https://github.blog/changelog/2023-11-30-github-copilot-nove...

paradite · on March 27, 2024

Okay thanks for pointing that out.

I figure if they do this, they have to throttle or nert it somehow since it is cheaper than ChatGPT Plus which also gives access to GPT-4.

csnweb · on March 27, 2024

It won’t answer questions that are not somehow related to code or computing, I usually don’t need anything else so I didn’t really test the limits of that so far.

varjag · on March 27, 2024

I'm sure one can just ask Claude to code the integration, it has to be so good.

AndyNemmity · on March 27, 2024

Do you have any tips? I find copilot so much worse when trying to use it in VCCode, even with the integration.

It just seems to do a much worse job than pasting your code into the chat UI.

Like, it's answers are just profoundly bad in comparison.

bilekas · on March 27, 2024

Using copilot in both vsc and vs2022 . I see vast differences but usually down to the language I'm working with.

I've noticed that in visual studio (IDE) copilot gives better answers if I physically view an interface or implementation, then I get "okay" results. But it stuggles with larger more abstract projects.

Vscode is better for sure, but usually I'm working in smaller projects or interpreted languages.

The vim copilot extension is probably the better bit again I'm not working with dotnet in vim.

mellosouls · on March 27, 2024

You have contextual IDE integration with the main LLMs including OpenAI and Claude within Cursor (a VSCode "AI first" derivative), though I haven't tried it but heard good things.

credit_guy · on March 27, 2024

> Something I am realising is that LLMs doesn't have any moat. Today OpenAI, tomorrow someone else.

I think you are correct, for chat. But for audio, video, 3D stuff, it will never be that easy for a newcomer.

dehrmann · on March 27, 2024

Is generative ai easier (which isn't to say easy) than I assumed it'd be, and it's more limited by training data and training hardware than model complexity?

jncfhnb · on March 27, 2024

Yes

Edit: although on some level the training only gets you the general capabilities of the model. How you fine tune it to specifically be a useful bot is a very important element. That’s not really model complexity so much as design thinking and experimentation

scarmig · on March 27, 2024

The Bitter Lesson. https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

dehrmann · on March 28, 2024

The last paragraph gets to this, but the ways engineers and scientists imagine the mind works are nothing like how it actually works.

emilsedgh · on March 27, 2024

We spent >$10K last month with OAI.

Their moat right now is developer tooling. They allow fine tuning + an easy API to use their llm.

Noone else does that right now. By the time others do, so much tooling and infrastructure has been built around OAI that the switching costs will be a lot.

It will get to the point that if you want your llm to beat oai on the market it won't be enough that you are as good or even better. You need to be vey very significantly better than OAI. For an extreme example of this see Windows. The network effects keeping it together are so strong that the platform becoming an abandoned adware hasn't been enough to push users to significantly better platforms like Mac.

Now I've fine tuned the hell out of gpt-3.5 and I'd love to see how my app would performed on a fine tuned Opus. I went to their website and I can't seem to fine tune their model yet. Meh. My guess is that by the time they make it available I won't have a strong reason to even try anymore.

uh_uh · on March 27, 2024

Not in my experience. Sonnet gets to the point quicker, but hallucinates more than GPT4. I'm keen to try Opus.

thedangler · on March 27, 2024

how do you use it for coding? I'd like my own little app that doesn't share my stuff online.

suby · on March 27, 2024

For coding I've found ChatGPT4 a bit better than Claude 3 Opus because it tends to understand my intentions more and I trust it to make better suggestions for code changes.

I think Claude has better writing style, and it's refreshing not having to fight with the thing to get it to give me full code snippets. I also hate how difficult it is to get the arrow keys to scroll up or down on chatgpt.

AndyNemmity · on March 27, 2024

It's interesting that we all have different experiences.

I've found Claude 3 massively better than ChatGPT4 for my use cases.

ChatGPT4 will just get the question entirely wrong the majority of time on high levels of complexity.

For example, here's this 1k lines of code. Find this bug, and fix.

ChatGPT4 totally gets it wrong, Claude 3 gets it right, or at the very least finds the area to get it right.

This is repeatable over and over all day.

Edit: The funniest part of my workflow is when one LLM gets an idea wrong, I then put it to the other LLM.

It's a great way to get unstuck. Both LLMs get stuck on certain things, but they are usually different things.

suby · on March 27, 2024

It might be language or task dependent. I'm writing C++, and Claude is more likely to suggest things which simply won't compile. It feels like Claude 3's knowledge of C++ is a bit lacking comparatively.

declaredapple · on March 27, 2024

For me GPT4 seems to suggest more generic unittests in python. They're much more "Put tests here", or "path.to.dependency".

Claude3 opus (and often Sonnet) actually fills in the full dependency paths, actually makes tests, and just overall seems to "know what I want from it".

zinclozenge · on March 27, 2024

This tends to be my experience, but mainly when it comes to areas that neither model know about. Claude 3 Opus will happily make something up, but ChatGPT4 will point out where it's lacking.

ultrasaurus · on March 27, 2024

I wonder how well ELO score handle the edge case where your most important games are against yourself. There are 4 GPT4's in the top 10 (both #2 and #3) and 3 Claude's.

(To their credit, they count anything where the 95% confidence intervals overlap as a tie)

ryanwhitney · on March 27, 2024

Does Chatbot Arena do anything to avoid being gamed? You can just type “what is your name” and they generally identify themselves.

sunaookami · on March 27, 2024

From the Chatbot Arena[1]:

> Vote won't be counted if model identity is revealed during conversation.

[1] https://arena.lmsys.org/

ryanwhitney · on March 27, 2024

Aha -- thanks I missed that.

minimaxir · on March 27, 2024

Given the sample size, it would be difficult for one actor alone to game it.

MrSkelter · on March 28, 2024

GPT4 feels so much more confident when coding because it is so fast. Claude Opus is generally better for me but takes a long time to reply. It also has much lower limits and sometimes rejects medium size files (75k) as too large. I find the most power comes from bouncing between them. When one stalls hand the bad code over and ask for fixes.

i5heu · on March 27, 2024

Is there some information how many GPUs OpenAI and anthropic have? And maybe even how much they can use for training.

petre · on March 27, 2024

Is there something that can make transcripts of Youtube videos without subtitles in Hungarian? Preferably not requring log in. I tried to do that today but all the bots either claim to have no access to Youtube or try to bs me.

jacooper · on March 27, 2024

It's enticing to switch to Claude, however it doesn't look like it includes any image generation/Internet browsing Plugins, makes it more difficult to switch tbh.

andrewstuart · on March 27, 2024

I tried Claude and dumped it instantly because the results for programming were nowhere near as good as ChatGPT 3.5.

rany_ · on March 27, 2024

Most of the hype is for their paid model, I also didn't have a good experience with their free offering.

maremmano · on March 27, 2024

Not "all the time" but "most of the time".

Zaheer · on March 27, 2024

There's rumors that Sama has been holding back new versions until companies catch up to the latest OpenAI version. "The king is dead" quote feels like it's meant to bring forth GPT-5.

stefcoetzee · on March 27, 2024

Long live the king!

jacooper · on March 27, 2024

Nothing beats free (Copilot gpt-4)

barbazoo · on March 27, 2024

Since when is Copilot/GPT-4 free?

https://github.com/features/copilot

https://openai.com/chatgpt/pricing

dragonwriter · on March 27, 2024

Talking about Microsoft Copilot, not Github Copilot.

(Also, though, free MS Copilot only allows off-peak use of GPT-4/GPT-4-Turbo; paid MS Copilot is requires for peak hours use of those models.)

lxgr · on March 27, 2024

Microsoft seems to call multiple things “copilot” these days. One of them uses GPT-4 for a ChatGPT-like experience. I also find that really confusing.

lxgr · on March 27, 2024

Is that the same thing under the hood as Bing? Somehow I’ve found it insufferable compared to the “real” ChatGPT.

mupuff1234 · on March 27, 2024

Now the question is why they all seem to converge more or less around the same level of quality.

anonzzzies · on March 27, 2024

Still region locked. One of the most infuriating things ever. VPN fixes it of course, but it's quite lame imho.