Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Claude 3 surpasses GPT-4 on Chatbot Arena for the first time (arstechnica.com)
101 points by rntn on March 27, 2024 | hide | past | favorite | 63 comments



I use claude sonnet for coding and it's better than GPT4 most of the time. Something I am realising is that LLMs doesn't have any moat. Today OpenAI, tomorrow someone else.


I agree. My personal experience is that 80% of the time Opus is better than GPT-4 on coding.

Honestly the only thing that keeps me to sometimes prefer GPT-4 now is the UI. I like being able to Edit my messages, and to Stop the model if I gave it the wrong prompt. Please improve Claude's UI!

The interoperability between LLMs right now is amazing. When I write a program I can quickly test it with each of GPT, Claude and Gemini to see which work better for what I'm doing. Here's to hoping nobody figure out how to create a moat any time soon!


The Claude Ui of holding files though is far superior.

Each of them do some things better.


Now we just need an ensemble model.


Tomorrow my desktop computer hopefully


I really doubt it.

"Tomorrow" your desktop computer might be twice powerful but at the same time the "good model of tomorrow" will be four or ten times larger - I'd expect that the gap between what can be done locally versus what is offered as a service will grow, not shrink.


The diminishing returns from model scale means that if your personal computer improves twofold in the same time a datacenter improves fivefold, you may still have lessened the gap in terms of quality.

That doesn't mean you'll be able to run the best model, but I'm relatively optimistic about the gap not growing out of control.


Well, sure, one thing is that the absolute numbers do increase, so for any given notion of "good enough", every device at some point will reach the level where it will be able to run it.


i run a home media server, can't wait to be able to add my own LLM service. it's just a matter of time for it to be something i can install over the weekend with proper hardware


Have you tried https://ollama.com/ ? You may find you already can


  git clone https://github.com/ggerganov/llama.cpp
  cd llama.cpp
  make
  ./server -m models/7B/ggml-model.gguf -c 2048
I don't think it'll take you the whole weekend :)


One moat might be access to recent training data, as evidenced by the NYT lawsuit and recent deals in that space.


Do you have an IDE integration? What I find so great about Copilot/GPT-4 is how it's integrated into VSCode/Jetbrains and can use the context you're in. Like knowing what line you highlighted, what documents you have open, etc. Do you copy paste into whatever chatbot you're using?


I had the same problem of copying and pasting code into LLM web UI so I built a small tool to streamline the process and add source code into the prompt: https://prompt.16x.engineer/

You can't rely on IDE auto context since the entire codebase is too large to feed into LLM (maybe Claude 3 200k context token can take it, but too expensive). And RAG is not smart enough to figure which part of code is relevant.


> You can't rely on IDE auto context since the entire codebase is too large to feed into LLM

It's not feeding in the entire codebase but whatever you selected or whatever files you have open, at least that's what the UI suggests. So I ask it "what does this line do" and I get an answer that uses the whole file to explain what the line does.


Yeah for context within the single file, GitHub Copilot is good, I use it all the time.

But if you use it to do something across multiple files (DB schema, service, controller, HTML, JavaScript), then it becomes less accurate (precision vs recall problem) as like you said it uses your open windows or some heuristics to decide what's the context.

With an IDE as an interface, it is just not intuitive, UX wise, to "open tabs" to signal to GitHub Copilot that the files should be included in the context.


That works, but for more complex questions taking into account different files and the whole architecture of the app, CoPilot fails. I've been trying to RAG my repos to accomplish this, but the parent comment said that's not possible.


I believe https://aider.chat/ is working on RAG for codebase.


Is github copilot using GPT-4 or 3.5? I've tried to find out for sure but I can't seem to find the information anywhere


I think 3.5, that was the last official note.

Copilot Chat uses 4, but it's suspiciously free of confirmation that is also used in the more contextual Copilot (no-chat).


GitHub Copilot uses OpenAI Codex, which is a much older model fined-tuned on GPT-3.

Definitely not GPT-4, otherwise it would not be less than $10 a month for constant usage.


The chat part (mostly) uses GPT-4, you can also see which model is called in the request logs. Here is the official announcement: https://github.blog/changelog/2023-11-30-github-copilot-nove...


Okay thanks for pointing that out.

I figure if they do this, they have to throttle or nert it somehow since it is cheaper than ChatGPT Plus which also gives access to GPT-4.


It won’t answer questions that are not somehow related to code or computing, I usually don’t need anything else so I didn’t really test the limits of that so far.


I'm sure one can just ask Claude to code the integration, it has to be so good.


Do you have any tips? I find copilot so much worse when trying to use it in VCCode, even with the integration.

It just seems to do a much worse job than pasting your code into the chat UI.

Like, it's answers are just profoundly bad in comparison.


Using copilot in both vsc and vs2022 . I see vast differences but usually down to the language I'm working with.

I've noticed that in visual studio (IDE) copilot gives better answers if I physically view an interface or implementation, then I get "okay" results. But it stuggles with larger more abstract projects.

Vscode is better for sure, but usually I'm working in smaller projects or interpreted languages.

The vim copilot extension is probably the better bit again I'm not working with dotnet in vim.


You have contextual IDE integration with the main LLMs including OpenAI and Claude within Cursor (a VSCode "AI first" derivative), though I haven't tried it but heard good things.


> Something I am realising is that LLMs doesn't have any moat. Today OpenAI, tomorrow someone else.

I think you are correct, for chat. But for audio, video, 3D stuff, it will never be that easy for a newcomer.


Is generative ai easier (which isn't to say easy) than I assumed it'd be, and it's more limited by training data and training hardware than model complexity?


Yes

Edit: although on some level the training only gets you the general capabilities of the model. How you fine tune it to specifically be a useful bot is a very important element. That’s not really model complexity so much as design thinking and experimentation



The last paragraph gets to this, but the ways engineers and scientists imagine the mind works are nothing like how it actually works.


We spent >$10K last month with OAI.

Their moat right now is developer tooling. They allow fine tuning + an easy API to use their llm.

Noone else does that right now. By the time others do, so much tooling and infrastructure has been built around OAI that the switching costs will be a lot.

It will get to the point that if you want your llm to beat oai on the market it won't be enough that you are as good or even better. You need to be vey very significantly better than OAI. For an extreme example of this see Windows. The network effects keeping it together are so strong that the platform becoming an abandoned adware hasn't been enough to push users to significantly better platforms like Mac.

Now I've fine tuned the hell out of gpt-3.5 and I'd love to see how my app would performed on a fine tuned Opus. I went to their website and I can't seem to fine tune their model yet. Meh. My guess is that by the time they make it available I won't have a strong reason to even try anymore.


Not in my experience. Sonnet gets to the point quicker, but hallucinates more than GPT4. I'm keen to try Opus.


how do you use it for coding? I'd like my own little app that doesn't share my stuff online.


For coding I've found ChatGPT4 a bit better than Claude 3 Opus because it tends to understand my intentions more and I trust it to make better suggestions for code changes.

I think Claude has better writing style, and it's refreshing not having to fight with the thing to get it to give me full code snippets. I also hate how difficult it is to get the arrow keys to scroll up or down on chatgpt.


It's interesting that we all have different experiences.

I've found Claude 3 massively better than ChatGPT4 for my use cases.

ChatGPT4 will just get the question entirely wrong the majority of time on high levels of complexity.

For example, here's this 1k lines of code. Find this bug, and fix.

ChatGPT4 totally gets it wrong, Claude 3 gets it right, or at the very least finds the area to get it right.

This is repeatable over and over all day.

Edit: The funniest part of my workflow is when one LLM gets an idea wrong, I then put it to the other LLM.

It's a great way to get unstuck. Both LLMs get stuck on certain things, but they are usually different things.


It might be language or task dependent. I'm writing C++, and Claude is more likely to suggest things which simply won't compile. It feels like Claude 3's knowledge of C++ is a bit lacking comparatively.


For me GPT4 seems to suggest more generic unittests in python. They're much more "Put tests here", or "path.to.dependency".

Claude3 opus (and often Sonnet) actually fills in the full dependency paths, actually makes tests, and just overall seems to "know what I want from it".


This tends to be my experience, but mainly when it comes to areas that neither model know about. Claude 3 Opus will happily make something up, but ChatGPT4 will point out where it's lacking.


I wonder how well ELO score handle the edge case where your most important games are against yourself. There are 4 GPT4's in the top 10 (both #2 and #3) and 3 Claude's.

(To their credit, they count anything where the 95% confidence intervals overlap as a tie)


Does Chatbot Arena do anything to avoid being gamed? You can just type “what is your name” and they generally identify themselves.


From the Chatbot Arena[1]:

> Vote won't be counted if model identity is revealed during conversation.

[1] https://arena.lmsys.org/


Aha -- thanks I missed that.


Given the sample size, it would be difficult for one actor alone to game it.


GPT4 feels so much more confident when coding because it is so fast. Claude Opus is generally better for me but takes a long time to reply. It also has much lower limits and sometimes rejects medium size files (75k) as too large. I find the most power comes from bouncing between them. When one stalls hand the bad code over and ask for fixes.


Is there some information how many GPUs OpenAI and anthropic have? And maybe even how much they can use for training.


Is there something that can make transcripts of Youtube videos without subtitles in Hungarian? Preferably not requring log in. I tried to do that today but all the bots either claim to have no access to Youtube or try to bs me.


It's enticing to switch to Claude, however it doesn't look like it includes any image generation/Internet browsing Plugins, makes it more difficult to switch tbh.


I tried Claude and dumped it instantly because the results for programming were nowhere near as good as ChatGPT 3.5.


Most of the hype is for their paid model, I also didn't have a good experience with their free offering.


Not "all the time" but "most of the time".


There's rumors that Sama has been holding back new versions until companies catch up to the latest OpenAI version. "The king is dead" quote feels like it's meant to bring forth GPT-5.


Long live the king!


Nothing beats free (Copilot gpt-4)



Talking about Microsoft Copilot, not Github Copilot.

(Also, though, free MS Copilot only allows off-peak use of GPT-4/GPT-4-Turbo; paid MS Copilot is requires for peak hours use of those models.)


Microsoft seems to call multiple things “copilot” these days. One of them uses GPT-4 for a ChatGPT-like experience. I also find that really confusing.


Is that the same thing under the hood as Bing? Somehow I’ve found it insufferable compared to the “real” ChatGPT.


Now the question is why they all seem to converge more or less around the same level of quality.


Still region locked. One of the most infuriating things ever. VPN fixes it of course, but it's quite lame imho.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: