More

mbitsnbites · 2025-02-14T13:12:54 1739538774

The model is based on Qwen2.5-Coder-7b it seems. I currently run some quantized variant of Qwen2.5-Coder-7b locally with llama.cpp and it fits nicely in the 8GB VRAM of my Radeon 7600 (with excellent performance BTW), so it looks like it should be perfectly possible.

I would also only use Zeta locally.

jbk · 2025-02-14T13:27:59 1739539679

Are you happy with the speed with your 8GB GPU?

mbitsnbites · on Jan 2, 2025

The big cores do. They essentially pump division through something like an FMA (fused multiply-add) unit, possibly the same unit that is used for multiplication and addition. That's for the Newton-Raphson steps, or Goldschmidt steps.

In hardware it's much easier to do a LUT-based approximation for the initial estimate rather than the subtraction trick, though.

It's common for CPUs to give 6-8 accurate bits in the approximation. x86 gives 13 accurate bits. Back in 1975, the Cray 1 gave 30 (!) accurate bits in the first approximation, and it didn't even have a division instruction (everything about that machine was big and fast).

mbitsnbites · on Jan 2, 2025

We have memcpy behind a C++ template function that mimics the interface of std::bit_cast.

mbitsnbites · on Jan 2, 2025

Your suggestion got me intrigued. I have a program that does an exhaustive check for maximum and average error, so I'll give your numbers a spin.

mbitsnbites · on Jan 2, 2025

Given my search criteria, the optimal magic number turns out to be: 0x7ef311c2

  Initial approximation:
    Good bits min: 4
    Good bits avg: 5.242649912834
    Error max: 0.0505102872849 (4.30728 bits)
    Error avg: 0.0327344845327 (4.93304 bits)

  1 NR step:
    Good bits min: 8
    Good bits avg: 10.642581939697
    Error max: 0.00255139507338 (8.61450 bits)
    Error avg: 0.00132373889641 (9.56117 bits)

  2 NR steps:
    Good bits min: 17
    Good bits avg: 19.922843217850
    Error max: 6.62494557693e-06 (17.20366 bits)
    Error avg: 2.62858584054e-06 (18.53728 bits)

  3 NR steps:
    Good bits min: 23
    Good bits avg: 23.674004554749
    Error max: 1.19249960972e-07 (22.99951 bits)
    Error avg: 3.44158509521e-08 (24.79235 bits)

Here, "good bits" is 24 minus the number of trailing non-zero-bits in the integer difference between the approximation and the correct value, looking at the IEEE 754 binary representation (if that makes sense).

Also, for the NR steps I used double precision for the inner (2.0 - x * y) part, then rounded to single precision, to simulate FMA, but single precision for the outer multiplication.

dahart · on Jan 2, 2025

Ah very nice, I was close with using max error - 0.05051 is the same number I got. Pretty sure 0x7ef311c2 came up for me at least a few times as I was fiddling with parameters. Is this using minimum good bits as the deciding criteria, or is it the best overall number using one of the averages and also 1-3 NR steps? Did you limit the input range, or use all finite floats? Having the min/avg error in bits is nice, it’s more intuitive than relative error.

I like the FMA simulation, that’s smart; I didn’t think about it. I did my search in Python. I don’t have it in front of me right now, and off the top of my head I’m not even sure whether my NR steps are in Python precision or fp32. :P My posts in this thread were with NR turned off, I wanted to find the best raw approximation and noticed I got a different magic number when using refinement. It really is an amazing trick, right? Even knowing how it works it still looks like magic when plotting the result.

Thanks for the update!

BTW I was also fiddling with another possible trick that is specific to reciprocal. I suspect you can simply negate all the bits except the sign and get something that’s a decent starting point for Newton iters, though it’s a much worse approximation than the subtraction. So maybe (x ^ 0x7fffffff). Not sure if negating the mantissa helps or if it’s better to negate only the exponent. I haven’t had time to analyze it properly yet, and I don’t know of any cases where it would be preferred, but I still think it’s another interesting/cute observation about how fp32 bits are stored.

mbitsnbites · on Jan 2, 2025

When measuring the errors I exhaustively iterate over all possible floats in the range [1, 2), by enumerating all IEEE 754 single precision representations in that range. That's "only" 2^23 numbers, so perfectly doable.

My selection criteria was abit complex, but something like this:

1. Maximize number of accurate bits in the approximation.

2. Same in NR step 1, then NR step 2 etc.

3. Minimize the max error in the approximation, and then the avg ertor in the approximation.

4. Same for NR step 1, 2, ...

mbitsnbites · on Jan 1, 2025

If you're not constrained to software solutions you have a whole world of opportunities. E.g. if it's a graphics or neural net pipeline you can pour tricks like this (or better) onto it. If it's a CPU then you can add special instructions that do exponent manipulation and the likes.

mbitsnbites · on Jan 1, 2025

Excellent! Will have a look.

mbitsnbites · on Jan 1, 2025

There is also the case with machines that lack FP support, like some ARM Cortex M variants.

mbitsnbites · on Jan 1, 2025

The proper solution is to use std::bit_cast in modern C++ or otherwise use memcpy, and of course know what you're doing.

Some things that could mess with you:

* Floating-point endianity is not the same as integer endianity.

* Floating-point alignment requirements differ from integer alignment requirements.

* The compuler is configured to use something else than 32-bit binary32 IEEE 754 for the type "float".

* The computer does not use two's complement arithmetic for integers.

In practice, these are not real problems.

mbitsnbites · on Dec 30, 2024

Yep. Along the same lines. This one is even simpler, though, as it requires only a single integer CPU instruction (and the simplest of all instructions too).

If you want full precision, you need to do three Newton-Raphson iterations after the initial approximation. One iteration is:

    y = y * (2.0F - x * y);

magicalhippo · on Dec 30, 2024

It's a neat trick, and could be very useful on microcontrollers which doesn't have hardware division but does have hardware multiplication.

actionfromafar · on Dec 31, 2024

Hm, like 68000?

pechay · on Dec 31, 2024

The 68k has both signed and unsigned 32/16 bit divide instructions.

mbitsnbites · on Oct 27, 2024

I'd argue that x86 and IBM z/Arch are the ones that stick out among contemporary ISAs in that they need fairly complex fron-end translation into an internal instruction format.

ARM implementations that support both ARMv7 (both ARM mode and THUMB mode) and ARMv8 also need some kind of translation in the front-end, but recent ARMs (like Apple's implementations) don't support 32-bit mode and are simpler in that way.

Most other ISAs are much closer to the metal, although many implementations still do some level of translation in the front-end (mostly fusing/splitting certain instruction combinations for better efficiency). Examples: ARMv8+, RISC-V, MIPS, Loongson, TI 6x & 7x DSP, and of course most GPUs (and my own MRISC32 ISA).