> Maybe OpenMP and CUDA do more optimizations than I thought they did. Do they perform [...] range analysis to choose smaller data types?
CUDA compiler developer here. We definitely perform range analysis on values stored in registers. This is an important optimization.
Via scalar replacement of aggregates, we can sometimes also replace a struct with a set of scalars. Once we do that, we can again perform range analysis on those scalars.
We can't change structs that don't get SROA'ed because of limitations of the language. For one thing, essentially any memory we write to the GPU's global memory can be read from the CPU side, and it's basically impossible to tell what is and isn't read, so we have to keep the memory layout the same.
"We can't change structs that don't get SROA'ed because of limitations of the language. For one thing, essentially any memory we write to the GPU's global memory can be read from the CPU side, and it's basically impossible to tell what is and isn't read, so we have to keep the memory layout the same.
"
Which is precisely a language semantic problem, since there are plenty of languages where you can change SOA to AOS.
In fact, it's wildly common among high perf fortran compilers
CUDA compiler developer here. We definitely perform range analysis on values stored in registers. This is an important optimization.
Via scalar replacement of aggregates, we can sometimes also replace a struct with a set of scalars. Once we do that, we can again perform range analysis on those scalars.
We can't change structs that don't get SROA'ed because of limitations of the language. For one thing, essentially any memory we write to the GPU's global memory can be read from the CPU side, and it's basically impossible to tell what is and isn't read, so we have to keep the memory layout the same.