At least for the earlier Llama 70B demo, they claimed to be running unquantized....

tome · on Feb 19, 2024

The weights are quantized to FP8 when they're stored in memory, but all the activations are computed at full FP16 precision.

youssefabdelm · on Feb 19, 2024

Can you explain if this affects quality relative to fp16? And is mixtral quantized?

tome · on Feb 19, 2024

We don't think so, but you be the judge! I believe we quantize both Mixtral and Llama 2 in this way.

a_wild_dandan · on Feb 19, 2024

Is your confidence rooted in quantified testing, or just vibes? I'm sure you're right, just curious. (My reasoning: running inference at full fp16 is borderline wasteful. You can use q7 with almost no loss.)

monkmartinez · on Feb 20, 2024

I know some fancy benchmark says "almost no loss", but... subjectively, there is a clear quality loss. You can try for yourself, I can run Mixtral at 5.8bpw and there is an OBVIOUS difference between what I have seen from Groq and my local setup beside the sound barrier shattering speed of Groq. I didn't know Mixtral could output such nice code and I have used it A LOT locally.

doctorpangloss · on Feb 20, 2024

Yes, but this gray area underperformance that lets them claim they are the cheapest and fastest appeals to people for whom qualitative (aka real) performance doesn’t matter.

tome · on Feb 19, 2024

What quantified testing would you like to see? We've had a lot of very good feedback from our users, particularly about Mixtral.

bearjaws · on Feb 19, 2024

Nothing really wrong with FP8 IMO, it performs pretty damn well usually within 98% while significantly reducing memory usage.