Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At least for the earlier Llama 70B demo, they claimed to be running unquantized. https://twitter.com/lifebypixels/status/1757619926360096852

Update: This comment says "some data is stored as FP8 at rest" and I don't know what that means. https://news.ycombinator.com/item?id=39432025



The weights are quantized to FP8 when they're stored in memory, but all the activations are computed at full FP16 precision.


Can you explain if this affects quality relative to fp16? And is mixtral quantized?


We don't think so, but you be the judge! I believe we quantize both Mixtral and Llama 2 in this way.


Is your confidence rooted in quantified testing, or just vibes? I'm sure you're right, just curious. (My reasoning: running inference at full fp16 is borderline wasteful. You can use q7 with almost no loss.)


I know some fancy benchmark says "almost no loss", but... subjectively, there is a clear quality loss. You can try for yourself, I can run Mixtral at 5.8bpw and there is an OBVIOUS difference between what I have seen from Groq and my local setup beside the sound barrier shattering speed of Groq. I didn't know Mixtral could output such nice code and I have used it A LOT locally.


Yes, but this gray area underperformance that lets them claim they are the cheapest and fastest appeals to people for whom qualitative (aka real) performance doesn’t matter.


What quantified testing would you like to see? We've had a lot of very good feedback from our users, particularly about Mixtral.


Nothing really wrong with FP8 IMO, it performs pretty damn well usually within 98% while significantly reducing memory usage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: