Here's a trick I came across recently which I found quite neat: you can test for divisibility by 3 using (x * 0xaaaaaaab) < 0x55555556. Same concept works for 5 and 15, but sadly not other factors.
LLVM does not currently generate that code, but you can make a case it should.
Duh, you're right on both counts. It should work for any odd divisor, I was just computing the inverse in a dumb way when I tried out other divisors. And I should have stated u32 wrapping arithmetic.
And for even divisors you can separate the divisor into a power of two, and an odd part. The odd part is checked as before, and the power of two is checked with an and mask.
Ah took me some time to understand why this works:
Every uint32 can be expressed as k=3*N mod 2^32. We can get that N by multiplying by the modular inverse. k is divisible by 3 if and only if the multiplication 3N doesn't "wrap around".
So: Multiply by modular inverse to get N, check that it's small enough that 3N < 2^32 ore equivalently that N < 2^32 / 3 = 0x55555556.
From what I understand, inlining is the key optimization that compilers make to speed up code. Once inlined, the excess computations of a function body can be removed or shortened, based on the calling context. With that in mind, how can the decompiler identify such "magic multiplication" division given that the compiler might mutate the division code at the callsite?
My rule of thumb is: If an optimization reduces memory accesses, it is important today. Register allocation fits that pattern, because the alternative is to spill values to memory.
Disclaimer: That rule is restricted to fast processors, not low-power embedded stuff.
Inlining is not that important for speed itself. It rather is an optimization booster, which makes many other optimization much more effective. If your other optimizations are crap, then inlining will not help much, because the call overhead is not that big.
Registers are also the nodes in the DAG of evaluation. Superscalar execution relies on having multiple edges active over time. Thus, good register allocation won't have bottlenecks, and won't make the evaluation DAG look like a series of linked lists.
Expressions are DAGs after common subexpression elimination. DAGs are what you get when you symbolically evaluate expressions on a stack machine without backward edges in control flow. Leave out the backward edges and the phi nodes they create, and you get DAGs from SSA form as well (but sequencing has information that you lose).
> IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.
Looking over the results, some of the numbers are off.
On Intel CPUs, FP multiplication is faster than integer division. Might not be true on ARM CPUs which generally have slower FPUs.
On Skylake, for example, 32-bit unsigned integer division has a 26 cycle latency with a throughput of 1 instruction / 6 cycles, while 32/64-bit floating point multiplication has a 4 cycle latency with a throughput of 2 instructions / cycle.
The crucial point, however, is that while FP multiplication is faster than integer division, converting between a floating point and an integer is very slow: cvtsi2ss and cvtss2si have a latency of 6 cycles each. This adds a latency of 12 cycles for each of these multiplications.
For divisions by a constant value that don't easily decompose into shifts you can fall back to multiplication by a magic constant which is the integer reciprocal. (This is also something compilers do and is what's being explained in the article.)
Multiplying by the integer reciprocal only works if the dividend is an integer multiple of the divisor.
What's being explained in the article is multiplying by a fraction the value of which is close to the rational reciprocal of the divisor, and where the denominator of the fraction is an integer power of two (so dividing by the denominator can be done with a shift).
The fraction in this case is (2938661835 + 2^32) / 2^37.
#include <stdio.h>
int main() {
volatile unsigned x;
for (unsigned n = 0; n < 100000000; ++n) {
#if 1 /* Change to 0 to use FPU. */
/*
Compile with -Os to get GCC to emit div instruction.
-O2 to emit integer multiply.
Clang emits integer multiply, even with -Os.
*/
x = n / 19;
#else
/* Use the FPU. */
x = (double)n * (1.0 / 19.0);
#endif
}
}
It works for 32-bit unsigned integers and double precision floats.
For n < 19, "(double)n * (1.0 / 19.0)" evaluates to a double between 0.0 and 1.0, then it is truncated to 0 when it is implicitly converted to unsigned int.
Since there are only 2^32 values for 32-bit integers, it is possible to test all values in under a minute:
#include <stdio.h>
#include <stdint.h>
int main() {
uint32_t n = 0;
do {
uint32_t a = n / 19;
uint32_t b = (double)n * (1.0 / 19.0);
if (a != b) {
printf("Not equal for n = %u\n", n);
}
++n;
} while (n != 0);
}
Note that approach done naively will produce wrong answers for lots of divisors, because the reciprocal will not be exactly representable. For example, 49*(1.0/49) is less than one.
If you round the FP division up (to the next largest representable value, that is), it should be correct, for 32 bit integer types at least.
> On Skylake [...] 32/64-bit floating point multiplication has a 4 cycle latency with a throughput of 2 instructions / cycle.
Of course, there are some operations that are very expensive (trigonometric functions, for example), but they're not necessary here, and they're also very expensive on the GPU.
That only works when the value is known to be a multiple of whatever you're dividing by, and they do use that when it is the case, such as for pointer subtraction.
Well, you can build a floating point DIV out of an integer DIV, I suppose. Roughly speaking, you divide the (integer) mantissa, and subtract the exponent, and you normalize.
http://ridiculousfish.com/blog/posts/labor-of-division-episo...
http://ridiculousfish.com/blog/posts/labor-of-division-episo...
http://ridiculousfish.com/blog/posts/labor-of-division-episo...
The same author wrote a library for doing this (and more; it also uses bit shifts) at runtime: http://libdivide.com
Of course, that only makes sense if you know you will do lots of divisions by the same number