Hey, asking any experts here, what are their first thoughts in the significance ...

Me1000 · on March 17, 2024

Not an expert by any means, but I like learning about this stuff and I play with a lot of open weight models.

I’d say the significance is that it happened. It’s by far the largest open weight model I’ve seen. But I’m not sure why you’d use it over a model like Mixtral, which seems to perform about the same at like 1/6th the size.

But I welcome any contribution to the open weight LLM community. Hopefully people will learn something interesting with this model. And I hope they keep releasing new versions!

MichaelRazum · on March 17, 2024

If I may ask, how do you load such big models? 300gb seems like a lot to play around with.

Me1000 · on March 17, 2024

You're right, this model is going to be too big for most people to play around with. But to answer your question I have a 128GB of RAM in my M3 MacBook Pro, so I can use most of that for GPU inferencing. But still, this model is going to need to be heavily quantized for me to be able to use it. (fwiw, I probably wont try this one)

In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it. I suspect my computer might be able to run a 3 bit quant, but it might need to go down to 2 bits to have any kind of reasonable context length. But with quants that small I'd expect the model's performance to degrade well below that of Mixtral, so it probably isn't really even worth using. But we'll see; quantization is weird, some models perform better than others when quantized.

TMWNN · on March 17, 2024

>In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.

How quickly are new models available through Ollama?

Me1000 · on March 18, 2024

Ollama is just a wrapper around llama.cpp, so when the gguf model files come out it'll be able to run on Ollama (assuming no llama.cpp patch is needed, but even if it is ollama is usually good at getting those updates out pretty quickly).

cjbprime · on March 17, 2024

Few days max.

MichaelRazum · on March 17, 2024

Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.

Me1000 · on March 17, 2024

No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.

Apple had plenty of reasons to move forward with their Apple Silicon CPUs and GPUs in the mac, but they really did seem to get lucky with the unified memory architecture. It was kind of just an artifact of their design, but ends up serving the needs of deep neural net models really well!

zozbot234 · on March 17, 2024

A top-of-the-line Mac Studio Ultra maxes out at 192GB currently. This is also a MoE model, so only a fraction of parameters have to be in RAM.

Me1000 · on March 18, 2024

MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.

EgoIncarnate · on March 18, 2024

Each token generated may only use a subset of the parameters (86billion instead of 314billion), but the next generated token might use a different subset. If it's anything like Mixtral, it will switch between experts constantly. It helps with memory bandwidth, but all the parameters still need to be in RAM or it would be unbearably slow.

brucethemoose2 · on March 18, 2024

Tests are not out yet, but:

- It's very large, yes.

- It's a base model, so its not really practical to use without further finetuning.

- Based on Grok-1 API performance (which itself is probably a finetune) its... not great at all.

whimsicalism · on March 18, 2024

seems like a large undertrained model, not that exciting imo compared to mixtral

it is also not the biggest model oss, switch transformer was released years ago and is larger and similarly undertrained