Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How do compilers optimize divisions? (zneak.github.io)
166 points by moyix on Feb 20, 2017 | hide | past | favorite | 51 comments


On this subject, these three blog posts are hard to beat:

http://ridiculousfish.com/blog/posts/labor-of-division-episo...

http://ridiculousfish.com/blog/posts/labor-of-division-episo...

http://ridiculousfish.com/blog/posts/labor-of-division-episo...

The same author wrote a library for doing this (and more; it also uses bit shifts) at runtime: http://libdivide.com

Of course, that only makes sense if you know you will do lots of divisions by the same number


Hacker's Delight - an excellent book in general - has an extensive chapter on this, for those who are interested in this topic.

http://www.hackersdelight.org/



Noticed Granlund and Montgomery paper[1] cited in comment. Apparently, Warren also cites the same in Hacker's Delight.

[1] https://doi.org/10.1145/773473.178249


He is also known for Montgomery modular multiplication, which is a technique widely used to efficiently implement RSA in hardware.


Here's a trick I came across recently which I found quite neat: you can test for divisibility by 3 using (x * 0xaaaaaaab) < 0x55555556. Same concept works for 5 and 15, but sadly not other factors.

LLVM does not currently generate that code, but you can make a case it should.


That's pretty cool, although it ought to be mentioned that x must be a uint32. Here's a proof:

    λ> import Data.SBV
    λ> prove $ do x <- Data.SBV.sWord32 "x"; return $ (x * 0xaaaaaaab .< 0x55555556) .== (x `sMod` 3 .== 0)
    Q.E.D.
Are you sure it doesn't work for other factors? It seems it works for 7, 9, and 11.

    λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 7 .== 0))
    Satisfiable. Model:
      a = 28087 :: Word16
      b =  9363 :: Word16

    λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 9 .== 0))
    Satisfiable. Model:
      a = 36409 :: Word16
      b =  7282 :: Word16

    λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 11 .== 0))
    Satisfiable. Model:
      a = 35747 :: Word16
      b =  5958 :: Word16
So a is equal to 7^{-1} mod 2^16, and b to ceil((2^16-1)/7), etc.


Duh, you're right on both counts. It should work for any odd divisor, I was just computing the inverse in a dumb way when I tried out other divisors. And I should have stated u32 wrapping arithmetic.


And for even divisors you can separate the divisor into a power of two, and an odd part. The odd part is checked as before, and the power of two is checked with an and mask.


An even trickier way to do this is to rotate right by the power of two factor between doing the multiply and the compare.


That wouldn't work, because a number that is bigger than the threshold could become smaller than it after the rotation.


Wow, that's really cool! Looks like this is a Haskell library: https://hackage.haskell.org/package/sbv-5.15/docs/Data-SBV.h...


Yes, it's really good! Makes it feasible to "sat first, ask questions later", especially for bit trickery.


Ah took me some time to understand why this works:

Every uint32 can be expressed as k=3*N mod 2^32. We can get that N by multiplying by the modular inverse. k is divisible by 3 if and only if the multiplication 3N doesn't "wrap around".

So: Multiply by modular inverse to get N, check that it's small enough that 3N < 2^32 ore equivalently that N < 2^32 / 3 = 0x55555556.


Those are very convenient numbers for fizz buzz. Have you ever pulled out that trick in an interview?


Hmm, I wonder if this is more efficient than calculating mod 15 directly. It's basically something like

    gcd15(int x) {
         return ((x * 0xaaaaaaab) < (0xfffffff/3 + 1) ? 1 : 3)
              * ((x * 0xaaaaaaab) < (0xfffffff/5 + 1) ? 1 : 5)
    }
versus something like

    gcd15(int x) { 
       LOOKUP_TABLE = { /* gcd(n,15) for n = 0,...,15 */ }
       int q = (x * ((1 << 28) / 15)) >> 28;
       return LOOKUP_TABLE[x - 15 * q];
    }
Well, you're probably better of using the method in the article for the division, but this should work for small numbers.


Yes, just last week. I'm applying for a batch at Recurse Center as a sabbatical this fall, and wanted to put in something to show off a bit.


From what I understand, inlining is the key optimization that compilers make to speed up code. Once inlined, the excess computations of a function body can be removed or shortened, based on the calling context. With that in mind, how can the decompiler identify such "magic multiplication" division given that the compiler might mutate the division code at the callsite?


Register allocation is the key to victory.


My rule of thumb is: If an optimization reduces memory accesses, it is important today. Register allocation fits that pattern, because the alternative is to spill values to memory.

Disclaimer: That rule is restricted to fast processors, not low-power embedded stuff.

Inlining is not that important for speed itself. It rather is an optimization booster, which makes many other optimization much more effective. If your other optimizations are crap, then inlining will not help much, because the call overhead is not that big.


Registers are also the nodes in the DAG of evaluation. Superscalar execution relies on having multiple edges active over time. Thus, good register allocation won't have bottlenecks, and won't make the evaluation DAG look like a series of linked lists.

There's a nice overview for people with a passing interest here: http://www.lighterra.com/papers/basicinstructionscheduling/


How did I go so far without hearing the path of execution described as a DAG. Damn that's a useful concept.

I might start prototyping some things in DAG notation just to avoid ever forgetting this conceptual abstraction.


Expressions are DAGs after common subexpression elimination. DAGs are what you get when you symbolically evaluate expressions on a stack machine without backward edges in control flow. Leave out the backward edges and the phi nodes they create, and you get DAGs from SSA form as well (but sequencing has information that you lose).


You might be interested in "FIRM—A Graph-Based Intermediate Representation" http://pp.ipd.kit.edu/publication.php?id=braun11wir

(I'm one of the authors)


Inlining makes for better register allocation :-)


Because you don't need registers for the function call ceremony?


Yes, and because you don't have to spill the arguments to the stack. Also, you optimize across the function call boundary.


Because you can allocate registers in a way that is particular to a specific call's context, rather than in the generic context of a called function.


Probably I'm missing something here:

Why can't one just multiply with the inverse of 19 (which can be calc'ed during compile time)?


On most modern platforms, the cost of each operation, from smallest to largest, is roughly something like this[0]:

1. Bit shifting

2. Integer multiplication

3. Integer division

4. Floating point multiplication

The trick in the article works because the cost of 1 + 2 is still smaller than 3.

Multiplying with the inverse of 19, a floating point number, wouldn't work because 4 is more costly than 3.

[0] http://nicolas.limare.net/pro/notes/2014/12/12_arit_speed/


That page has a warning at the top:

> IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.

Looking over the results, some of the numbers are off.

On Intel CPUs, FP multiplication is faster than integer division. Might not be true on ARM CPUs which generally have slower FPUs.

On Skylake, for example, 32-bit unsigned integer division has a 26 cycle latency with a throughput of 1 instruction / 6 cycles, while 32/64-bit floating point multiplication has a 4 cycle latency with a throughput of 2 instructions / cycle.

Source: http://agner.org/optimize/instruction_tables.pdf


The crucial point, however, is that while FP multiplication is faster than integer division, converting between a floating point and an integer is very slow: cvtsi2ss and cvtss2si have a latency of 6 cycles each. This adds a latency of 12 cycles for each of these multiplications.

For divisions by a constant value that don't easily decompose into shifts you can fall back to multiplication by a magic constant which is the integer reciprocal. (This is also something compilers do and is what's being explained in the article.)


Multiplying by the integer reciprocal only works if the dividend is an integer multiple of the divisor.

What's being explained in the article is multiplying by a fraction the value of which is close to the rational reciprocal of the divisor, and where the denominator of the fraction is an integer power of two (so dividing by the denominator can be done with a shift).

The fraction in this case is (2938661835 + 2^32) / 2^37.


It does work, though not as well as using an integer multiply.

The approximate latencies for Skylake are:

    div --> 26 cycles
    cvtsi2sd + mulsd + cvttsd2siq --> 6 + 4 + 6 = 16 cycles
I did a quick (and imperfect) microbenchmark, got these results:

    Real integer division (-Os) --> 1.392s
    FPU Multiply (-Os)          --> 0.243s
    FPU Multiply (-O2)          --> 0.197s
    Integer Multiply (-O2)      --> 0.164s
The code:

    #include <stdio.h>

    int main() {
        volatile unsigned x;
        for (unsigned n = 0; n < 100000000; ++n) {
    #if 1 /* Change to 0 to use FPU. */
            /*
            Compile with -Os to get GCC to emit div instruction.
            -O2 to emit integer multiply.
            Clang emits integer multiply, even with -Os.
            */
            x = n / 19;
    #else
            /* Use the FPU. */
            x = (double)n * (1.0 / 19.0);
    #endif
        }
    }


Is it correct? What is x for n < 19 for example?


It works for 32-bit unsigned integers and double precision floats.

For n < 19, "(double)n * (1.0 / 19.0)" evaluates to a double between 0.0 and 1.0, then it is truncated to 0 when it is implicitly converted to unsigned int.

Since there are only 2^32 values for 32-bit integers, it is possible to test all values in under a minute:

    #include <stdio.h>
    #include <stdint.h>

    int main() {
        uint32_t n = 0;
        do {
            uint32_t a = n / 19;
            uint32_t b = (double)n * (1.0 / 19.0);
            if (a != b) {
                printf("Not equal for n = %u\n", n);
            }
            ++n;
        } while (n != 0);
    }


Note that approach done naively will produce wrong answers for lots of divisors, because the reciprocal will not be exactly representable. For example, 49*(1.0/49) is less than one.

If you round the FP division up (to the next largest representable value, that is), it should be correct, for 32 bit integer types at least.


This is an unsigned int, so the expectation would be to do integer division, which would not get you the same result as multiplying by the inverse.


Couldn't that be addressed via a floor operation afterwards?


Because unless you are on a video card (e.g. doing SIMD) floating point operations are very very expensive.


That is not true; compare kr7's comment [1]:

> On Skylake [...] 32/64-bit floating point multiplication has a 4 cycle latency with a throughput of 2 instructions / cycle.

Of course, there are some operations that are very expensive (trigonometric functions, for example), but they're not necessary here, and they're also very expensive on the GPU.

[1] https://news.ycombinator.com/item?id=13693749


The result of 7233629131 / 137438953472 is 18.999999997649866.

This is wrong, the divisor and dividend are the wrong way around.

..and in case anyone else is wondering how those numbers relate to the constant in the code, it's

  2^37 / (2938661835 + 2^32)


It's been fixed.


Is there a reason compilers don't just use modular inverses and let overflow take care of things?

19 * 678,152,731 = 1 (mod 2^32)

19 * 9,708,812,670,373,448,219 = 1 (mod 2^64)

As long as we already know the size of an int Euclid's algorithm should be able to find these inverses if they exist.


That only works when the value is known to be a multiple of whatever you're dividing by, and they do use that when it is the case, such as for pointer subtraction.


I don't see how finding the modular inverse helps with division.

What algorithm do you have in mind?


It could also be a neat trick around the Intel Pentium's FDIV bug, [1].

[1] https://en.wikipedia.org/wiki/Pentium_FDIV_bug


I'm not sure I follow your line of thought given the post is talking integer division and FDIV is floating point division.


Well, you can build a floating point DIV out of an integer DIV, I suppose. Roughly speaking, you divide the (integer) mantissa, and subtract the exponent, and you normalize.


Cool. Time to dust off that old PC in the attic.


Because people are totally going to be concerned about working around a bug that was fixed over 20 years ago.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: