Algorithms for division: Using Newton's method

Const-me · on Dec 8, 2021

Good article, but note that if the hardware supports the division instruction, will be much faster than the described workarounds.

Personally, I recently did what’s written in 2 cases: FP32 division on ARMv7, and FP64 division on GPUs who don’t support that instruction.

For ARMv7 CPUs, not only they have FRECPE, they also have FRECPS for the iteration step. An example there: https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...

For GPUs, Microsoft classified FP64 division as “extended double shader instruction” and the support is optional. However, GPUs are guaranteed to support FP32 division. The result of FP32 division provides an awesome starting point for Newton-Raphson refinement in FP64 precision.

phkahler · on Dec 8, 2021

>> Good article, but note that if the hardware supports the division instruction, will be much faster than the described workarounds.

There are systems where numbers going into a calculation are similar to the previous iteration. Physics engines, or real-time measurement systems come to mind. In those cases it is often fastest to use a number from a previous time-step and one or two iterations of an algorithm like this while also being very precise.

Another classic is computing 1/sqrt(x) for normalizing a vector (even ignoring that silly floating point trick). I've also used similar methods for calculating sqrt(x) quickly based on the (valid in my case) assumption that my x value didn't change too much from the previous one.

Const-me · on Dec 8, 2021

> Another classic is computing 1/sqrt(x) for normalizing a vector

All AMD64 processors support RSQRTPS instruction, on modern CPUs the latency is ≤ 3 cycles. Similar on ARM with FRSQRTE instruction, albeit I don’t know where to lookup latencies for ARM CPUs.

> for calculating sqrt(x) quickly

Modern CPUs compute non-approximated square root in 12-14 cycles for FP32, and 13-20 cycles for FP64. That’s pretty quickly already.

I doubt Newton-Raphson gonna help much, if at all. FMA instructions take 4 cycles, if you don’t have FMA multiply/add need 3-4 cycles for each of the two. And then there’re costs for instructions fetch, decode and dispatch.

MatteoFrigo · on Dec 8, 2021

> I doubt Newton-Raphson gonna help much, if at all.

FWIW, in some machines in the past, division was implemented via Newton-Raphson and there was no other division hardware available. The implementation was either done in software or in microcode. The same statement applies to square roots as well.

The foundational technical result (due I think to Peter Markstein but I can't find the paper right now) is that a sequence of 4 (or 5?) fused multiply/add instructions produce a correctly rounded division (or square root) in double precision. The fused part is essential: you must implement ab+c with a single rounding at the end, because if you round after ab the final result is no longer exact in all cases.

Fused multiply/add instructions (FMA) were pioneered by the IBM RS/6000 and its Power/PowerPC descendants, and were adopted by Itanium, MIPS, and later x86/arm. IIRC, Itanium had no division instruction at all, and the compiler would generate the 4 (or 5?) FMAs for you, properly pipelined. PowerPC had a division instruction that would use the FMA unit multiple times.

The last time I looked into this problem (in 2005 or so) there wasn't much latency difference between FMA-based implementations and a dedicated hardware divider. I don't know what tradeoffs people are making these days.

Const-me · on Dec 8, 2021

> last time I looked into this problem (in 2005 or so) there wasn't much latency difference between FMA-based implementation and a dedicated hardware divider

In 2005 or so, the newest and greatest CPUs were probably Core 2 Duo i.e. Conroe microarchitecture.

Hardware dividers became much faster since then. Conroe took up to 20 cycles for FP32 divide (DIVPS), and up to 34 cycles for FP64 divide (DIVPD). AMD Zen 3 does that in 10 and 13 cycles, respectively.

Additions and multiplications improved too, but not nearly as much. FP32 multiplication (MULPS) took 4 cycles on Conroe, 3 cycles on Zen 3. FP64 multiplication (MULPD) took 5 cycles on Conroe, 3 cycles on Zen 3. Both FP32 and FP64 additions (ADDPS, ADDPD) take same 3 cycles, despite 16 years of technical progress.

MatteoFrigo · on Dec 8, 2021

Thanks!

Do you happen to know how these algorithms are implemented? Intel used to implement some variant of long-division (likely SRT), which takes O(n) cycles to divide two n-bit numbers. The Conroe numbers that you are quoting are consistent with an O(n) algorithm that processes two or perhaps three bits at each step.

However, the Zen 3 numbers that you quote are not consistent with an O(n) algorithm, but they are consistent with O(log n) steps of Newton-Raphson. Basically, do whatever you want to compute the single-precision answer in 10 cycles, and spend 3 cycles on one FMA to extend the result to double precision. Could it be that AMD is in fact implementing Newton-Raphson exploiting its 3-cycle FMA?

Const-me · on Dec 8, 2021

> Do you happen to know how these algorithms are implemented?

I do not. I never designed CPUs, only optimized different code for different processors, that’s why I’m aware of these timings. BTW, for AMD64, this is the best resource I’m aware of: https://www.uops.info/table.html

> Could it be that AMD is in fact implementing Newton-Raphson exploiting its 3-cycle FMA?

Maybe, I can only guess. One interesting point, still.

For modern Intel CPUs with lake-based names, the latency is close to Zen 3, 11-12 cycles for FP32 divide, 13-15 cycles for FP64 divide.

AMD only has 2 kinds of ALUs (two of each kind per core), one for scalar stuff and very basic SIMD (bitwise and integer add), another one for real SIMD which seem to be a superset (includes FP, integer multiply, AES, etc.)

Intel cores however have heterogeneous set of ALUs, they call them “execution units” or “execution ports”. On these lakes, FMA instructions can run on ports 0 or 1. Division instructions need to be scheduled on port 0 only. However, the “fast approximate reciprocal” FP32 instruction, RCPPS, which with ~11 bits of precision and 4 cycles of latency seems to be a good first step for Newton-Raphson, needs to be scheduled on the same port 0 as divide.

adgjlsfhk1 · on Dec 8, 2021

The biggest difference is that fma gives you a 2x speedup since it can do a mul and an add with latency of 4 on modern cpus.

phkahler · on Dec 8, 2021

>> All AMD64 processors...

But on things like an embedded TI DSP or an ARM M0 they don't have such things.

odipar · on Dec 8, 2021

Another interesting fixed-point (16 bit) division 'algorithm' is with log and exp tables, with a well chosen base :).

consider a and b being 16 bit fixed-points (at the 10 bit position) then:

c = a/b

is:

al = log(a) (log table lookup)

bl = log(b) (log table lookup)

c = exp(al - bl) (exp table lookup)

...so basically 3 table lookups and 1 subtraction.

this method that has been used to do fast 3d->2d projections on a Atari ST (68000 processor): https://www.atari-forum.com/viewtopic.php?f=1&t=5696

errcorrectcode · on Dec 8, 2021

Style nit:

    q += 1;

Bothers me when ++q; would be just fine.