How do compilers optimize divisions?

Someone · on Feb 20, 2017

On this subject, these three blog posts are hard to beat:

http://ridiculousfish.com/blog/posts/labor-of-division-episo...

The same author wrote a library for doing this (and more; it also uses bit shifts) at runtime: http://libdivide.com

Of course, that only makes sense if you know you will do lots of divisions by the same number

glangdale · on Feb 20, 2017

Hacker's Delight - an excellent book in general - has an extensive chapter on this, for those who are interested in this topic.

http://www.hackersdelight.org/

WalterBright · on Feb 20, 2017

Here's how the dmd D compiler does it:

https://github.com/dlang/dmd/blob/master/src/backend/divcoef...

metaphor · on Feb 21, 2017

Noticed Granlund and Montgomery paper[1] cited in comment. Apparently, Warren also cites the same in Hacker's Delight.

[1] https://doi.org/10.1145/773473.178249

Cyph0n · on Feb 21, 2017

He is also known for Montgomery modular multiplication, which is a technique widely used to efficiently implement RSA in hardware.

raphlinus · on Feb 20, 2017

Here's a trick I came across recently which I found quite neat: you can test for divisibility by 3 using (x * 0xaaaaaaab) < 0x55555556. Same concept works for 5 and 15, but sadly not other factors.

LLVM does not currently generate that code, but you can make a case it should.

conistonwater · on Feb 20, 2017

That's pretty cool, although it ought to be mentioned that x must be a uint32. Here's a proof:

    λ> import Data.SBV
    λ> prove $ do x <- Data.SBV.sWord32 "x"; return $ (x * 0xaaaaaaab .< 0x55555556) .== (x `sMod` 3 .== 0)
    Q.E.D.

Are you sure it doesn't work for other factors? It seems it works for 7, 9, and 11.

    λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 7 .== 0))
    Satisfiable. Model:
      a = 28087 :: Word16
      b =  9363 :: Word16

    λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 9 .== 0))
    Satisfiable. Model:
      a = 36409 :: Word16
      b =  7282 :: Word16

    λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 11 .== 0))
    Satisfiable. Model:
      a = 35747 :: Word16
      b =  5958 :: Word16

So a is equal to 7^{-1} mod 2^16, and b to ceil((2^16-1)/7), etc.

raphlinus · on Feb 21, 2017

Duh, you're right on both counts. It should work for any odd divisor, I was just computing the inverse in a dumb way when I tried out other divisors. And I should have stated u32 wrapping arithmetic.

im3w1l · on Feb 21, 2017

And for even divisors you can separate the divisor into a power of two, and an odd part. The odd part is checked as before, and the power of two is checked with an and mask.

raphlinus · on Feb 21, 2017

An even trickier way to do this is to rotate right by the power of two factor between doing the multiply and the compare.

im3w1l · on Feb 23, 2017

That wouldn't work, because a number that is bigger than the threshold could become smaller than it after the rotation.

panic · on Feb 21, 2017

Wow, that's really cool! Looks like this is a Haskell library: https://hackage.haskell.org/package/sbv-5.15/docs/Data-SBV.h...

conistonwater · on Feb 21, 2017

Yes, it's really good! Makes it feasible to "sat first, ask questions later", especially for bit trickery.

im3w1l · on Feb 21, 2017

Ah took me some time to understand why this works:

Every uint32 can be expressed as k=3*N mod 2^32. We can get that N by multiplying by the modular inverse. k is divisible by 3 if and only if the multiplication 3N doesn't "wrap around".

So: Multiply by modular inverse to get N, check that it's small enough that 3N < 2^32 ore equivalently that N < 2^32 / 3 = 0x55555556.

curun1r · on Feb 21, 2017

Those are very convenient numbers for fizz buzz. Have you ever pulled out that trick in an interview?

contravariant · on Feb 21, 2017

Hmm, I wonder if this is more efficient than calculating mod 15 directly. It's basically something like

    gcd15(int x) {
         return ((x * 0xaaaaaaab) < (0xfffffff/3 + 1) ? 1 : 3)
              * ((x * 0xaaaaaaab) < (0xfffffff/5 + 1) ? 1 : 5)
    }

versus something like

    gcd15(int x) { 
       LOOKUP_TABLE = { /* gcd(n,15) for n = 0,...,15 */ }
       int q = (x * ((1 << 28) / 15)) >> 28;
       return LOOKUP_TABLE[x - 15 * q];
    }

Well, you're probably better of using the method in the article for the division, but this should work for small numbers.

raphlinus · on Feb 21, 2017

Yes, just last week. I'm applying for a batch at Recurse Center as a sabbatical this fall, and wanted to put in something to show off a bit.

morecoffee · on Feb 21, 2017

From what I understand, inlining is the key optimization that compilers make to speed up code. Once inlined, the excess computations of a function body can be removed or shortened, based on the calling context. With that in mind, how can the decompiler identify such "magic multiplication" division given that the compiler might mutate the division code at the callsite?

WalterBright · on Feb 21, 2017

Register allocation is the key to victory.

qznc · on Feb 21, 2017

My rule of thumb is: If an optimization reduces memory accesses, it is important today. Register allocation fits that pattern, because the alternative is to spill values to memory.

Disclaimer: That rule is restricted to fast processors, not low-power embedded stuff.

Inlining is not that important for speed itself. It rather is an optimization booster, which makes many other optimization much more effective. If your other optimizations are crap, then inlining will not help much, because the call overhead is not that big.

barrkel · on Feb 21, 2017

Registers are also the nodes in the DAG of evaluation. Superscalar execution relies on having multiple edges active over time. Thus, good register allocation won't have bottlenecks, and won't make the evaluation DAG look like a series of linked lists.

There's a nice overview for people with a passing interest here: http://www.lighterra.com/papers/basicinstructionscheduling/

techdragon · on Feb 21, 2017

How did I go so far without hearing the path of execution described as a DAG. Damn that's a useful concept.

I might start prototyping some things in DAG notation just to avoid ever forgetting this conceptual abstraction.

barrkel · on Feb 21, 2017

Expressions are DAGs after common subexpression elimination. DAGs are what you get when you symbolically evaluate expressions on a stack machine without backward edges in control flow. Leave out the backward edges and the phi nodes they create, and you get DAGs from SSA form as well (but sequencing has information that you lose).

qznc · on Feb 21, 2017

You might be interested in "FIRM—A Graph-Based Intermediate Representation" http://pp.ipd.kit.edu/publication.php?id=braun11wir

(I'm one of the authors)

WalterBright · on Feb 21, 2017

Inlining makes for better register allocation :-)

dan00 · on Feb 21, 2017

Because you don't need registers for the function call ceremony?

benchaney · on Feb 21, 2017

Yes, and because you don't have to spill the arguments to the stack. Also, you optimize across the function call boundary.

WalterBright · on Feb 22, 2017

Because you can allocate registers in a way that is particular to a specific call's context, rather than in the generic context of a called function.

dmichulke · on Feb 21, 2017

Probably I'm missing something here:

Why can't one just multiply with the inverse of 19 (which can be calc'ed during compile time)?

desdiv · on Feb 21, 2017

On most modern platforms, the cost of each operation, from smallest to largest, is roughly something like this[0]:

1. Bit shifting

2. Integer multiplication

3. Integer division

4. Floating point multiplication

The trick in the article works because the cost of 1 + 2 is still smaller than 3.

Multiplying with the inverse of 19, a floating point number, wouldn't work because 4 is more costly than 3.

[0] http://nicolas.limare.net/pro/notes/2014/12/12_arit_speed/

kr7 · on Feb 21, 2017

That page has a warning at the top:

> IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.

Looking over the results, some of the numbers are off.

On Intel CPUs, FP multiplication is faster than integer division. Might not be true on ARM CPUs which generally have slower FPUs.

On Skylake, for example, 32-bit unsigned integer division has a 26 cycle latency with a throughput of 1 instruction / 6 cycles, while 32/64-bit floating point multiplication has a 4 cycle latency with a throughput of 2 instructions / cycle.

Source: http://agner.org/optimize/instruction_tables.pdf

yuriks · on Feb 21, 2017

The crucial point, however, is that while FP multiplication is faster than integer division, converting between a floating point and an integer is very slow: cvtsi2ss and cvtss2si have a latency of 6 cycles each. This adds a latency of 12 cycles for each of these multiplications.

For divisions by a constant value that don't easily decompose into shifts you can fall back to multiplication by a magic constant which is the integer reciprocal. (This is also something compilers do and is what's being explained in the article.)

caf · on Feb 21, 2017

Multiplying by the integer reciprocal only works if the dividend is an integer multiple of the divisor.

What's being explained in the article is multiplying by a fraction the value of which is close to the rational reciprocal of the divisor, and where the denominator of the fraction is an integer power of two (so dividing by the denominator can be done with a shift).

The fraction in this case is (2938661835 + 2^32) / 2^37.

kr7 · on Feb 21, 2017

It does work, though not as well as using an integer multiply.

The approximate latencies for Skylake are:

    div --> 26 cycles
    cvtsi2sd + mulsd + cvttsd2siq --> 6 + 4 + 6 = 16 cycles

I did a quick (and imperfect) microbenchmark, got these results:

    Real integer division (-Os) --> 1.392s
    FPU Multiply (-Os)          --> 0.243s
    FPU Multiply (-O2)          --> 0.197s
    Integer Multiply (-O2)      --> 0.164s

The code:

    #include <stdio.h>

    int main() {
        volatile unsigned x;
        for (unsigned n = 0; n < 100000000; ++n) {
    #if 1 /* Change to 0 to use FPU. */
            /*
            Compile with -Os to get GCC to emit div instruction.
            -O2 to emit integer multiply.
            Clang emits integer multiply, even with -Os.
            */
            x = n / 19;
    #else
            /* Use the FPU. */
            x = (double)n * (1.0 / 19.0);
    #endif
        }
    }

the-dude · on Feb 21, 2017

Is it correct? What is x for n < 19 for example?

kr7 · on Feb 21, 2017

It works for 32-bit unsigned integers and double precision floats.

For n < 19, "(double)n * (1.0 / 19.0)" evaluates to a double between 0.0 and 1.0, then it is truncated to 0 when it is implicitly converted to unsigned int.

Since there are only 2^32 values for 32-bit integers, it is possible to test all values in under a minute:

    #include <stdio.h>
    #include <stdint.h>

    int main() {
        uint32_t n = 0;
        do {
            uint32_t a = n / 19;
            uint32_t b = (double)n * (1.0 / 19.0);
            if (a != b) {
                printf("Not equal for n = %u\n", n);
            }
            ++n;
        } while (n != 0);
    }

ridiculous_fish · on Feb 21, 2017

Note that approach done naively will produce wrong answers for lots of divisors, because the reciprocal will not be exactly representable. For example, 49*(1.0/49) is less than one.

If you round the FP division up (to the next largest representable value, that is), it should be correct, for 32 bit integer types at least.

jeffasinger · on Feb 21, 2017

This is an unsigned int, so the expectation would be to do integer division, which would not get you the same result as multiplying by the inverse.

dmichulke · on Feb 21, 2017

Couldn't that be addressed via a floor operation afterwards?

SolarNet · on Feb 21, 2017

Because unless you are on a video card (e.g. doing SIMD) floating point operations are very very expensive.

adwn · on Feb 21, 2017

That is not true; compare kr7's comment [1]:

> On Skylake [...] 32/64-bit floating point multiplication has a 4 cycle latency with a throughput of 2 instructions / cycle.

Of course, there are some operations that are very expensive (trigonometric functions, for example), but they're not necessary here, and they're also very expensive on the GPU.

[1] https://news.ycombinator.com/item?id=13693749

caf · on Feb 21, 2017

The result of 7233629131 / 137438953472 is 18.999999997649866.

This is wrong, the divisor and dividend are the wrong way around.

..and in case anyone else is wondering how those numbers relate to the constant in the code, it's

  2^37 / (2938661835 + 2^32)

jwilk · on Feb 22, 2017

It's been fixed.

xelxebar · on Feb 22, 2017

Is there a reason compilers don't just use modular inverses and let overflow take care of things?

19 * 678,152,731 = 1 (mod 2^32)

19 * 9,708,812,670,373,448,219 = 1 (mod 2^64)

As long as we already know the size of an int Euclid's algorithm should be able to find these inverses if they exist.

SamReidHughes · on Feb 23, 2017

That only works when the value is known to be a multiple of whatever you're dividing by, and they do use that when it is the case, such as for pointer subtraction.

jwilk · on Feb 22, 2017

I don't see how finding the modular inverse helps with division.

What algorithm do you have in mind?

amelius · on Feb 20, 2017

It could also be a neat trick around the Intel Pentium's FDIV bug, [1].

[1] https://en.wikipedia.org/wiki/Pentium_FDIV_bug

lfowles · on Feb 20, 2017

I'm not sure I follow your line of thought given the post is talking integer division and FDIV is floating point division.

amelius · on Feb 20, 2017

Well, you can build a floating point DIV out of an integer DIV, I suppose. Roughly speaking, you divide the (integer) mantissa, and subtract the exponent, and you normalize.

bbcbasic · on Feb 21, 2017

Cool. Time to dust off that old PC in the attic.

Sohcahtoa82 · on Feb 21, 2017

Because people are totally going to be concerned about working around a bug that was fixed over 20 years ago.