GPUCC – An Open-Source GPGPU Compiler

		GPUCC – An Open-Source GPGPU Compiler (research.google.com)
		195 points by haberman on April 25, 2016 \| hide \| past \| favorite \| 53 comments

haberman on April 25, 2016 | [–]

I don't know much about this (it's not my area of expertise), but I thought this G+ post was interesting: https://plus.google.com/u/0/+VincentVanhoucke/posts/6RQmgqcm...

It says that a lot of the reason TensorFlow initially lagged in performance is because a lot of those performance issues only manifested under NVCC, whereas they had been using GPUCC internally.

_oq9t on April 25, 2016 | | [–]

This is part of llvm trunk (upcoming 3.9 release) now: http://llvm.org/docs/CompileCudaWithLLVM.html

svensken on April 25, 2016 | | [–]

Thanks for the link! Pretty exciting stuff.

Can anyone comment on the following quote:

The list below shows some of the more important optimizations for GPUs... A few of them have not been upstreamed due to lack of a customizable target-independent optimization pipeline.

So the LLVM version of gpucc will be incomplete? Will there be a release of the original stand-alone gpucc?

wujingyue on April 25, 2016 | | | [–]

Thanks for your interest, and hope you like it!

Yes, it is currently incomplete, but I'd say at least 80% of the optimizations are upstreamed already. Also, folks in the LLVM community are actively working on that. For example, Justin Lebar recently pushed http://reviews.llvm.org/D18626 that added the speculative execution pass to -O3.

Regarding performance, one thing worth noting is that missing one optimization does not necessarily cause significant slowdown on the benchmarks you care about. For example, the memory-space alias analysis only noticeably affects one benchmark in the Rodinia benchmark suite.

Regarding your second question, the short answer is no. The Clang/LLVM version uses a different architecture (as mentioned in http://wujingyue.com/docs/gpucc-talk.pdf) from the internal version. The LLVM version offers better functionality and compilation time, and is much easier to maintain and improve in the future. It would cost even more effort to upstream the internal version than to make all optimizations work with the new architecture.

jlebar on April 25, 2016 | | | [–]

In fact I think at the moment almost everything, other than the memory-space alias analysis and a few pass tuning tweaks, is in. I know the former will be difficult to land, and I suspect the latter may be as well.

I don't have a lot of benchmarks at the moment, so I can't say how important they are. And it of course depends on what you're doing.

clang/llvm's CUDA implementation shares most of the backend with gpucc, but it's an entirely new front-end. The front-end works for tensorflow, eigen, and thrust, but I suspect if you try hard enough you'll be able to find something nvcc accepts that we can't compile. At the moment we're pretty focused on making it work well for Tensorflow.

svensken on April 25, 2016 | | | | [–]

Thanks for the clarification! It's always a pleasure to get a direct response from the first author on something as awesome as this.

I'm definitely subscribing to the llvm-dev list[1] in case any discussion on this continues there. There's also the llvm-commits, clang-dev, and clang-commits lists as well, but llvm-dev kinda seems like the right place for this.

Gpucc in LLVM is definitely a breath of fresh air for all of us nvcc users. To get to see some compiler internals for cuda, it feels like Christmas. A big thanks from me for all the upstreaming effort!

1: http://lists.llvm.org/mailman/listinfo/llvm-dev

m_mueller on April 26, 2016 | | | [–]

Looking forward to a CUDA Fortran frontend for this. Does it exist already?

Annatar on April 26, 2016 | | | [–]

No idea, but I do know that the PGI group has had a working CUDA Fortran compiler since 2013:

http://www.pgroup.com/doc/pgicudaforug.pdf

One could take one's Fortran code and simply recompile it with their compiler to run on the Nvidia GPU's. The compiler would perform automatic parallelization. Wild stuff.

m_mueller on April 26, 2016 | | | [–]

I'm aware of that, it's the main GPU compiler I'm using currently. But I have to say, PGI only has limited resources and it would be very cool if there'd be a second player in town, especially if it's one of the big five.

Btw. I'm working on something that's geared towards pretty much exactly what you're talking about. My stretch goal is fully automatic GPU parallelization for data parallel Fortran code [1].

[1] https://github.com/muellermichel/Hybrid-Fortran

ashitlerferad on April 26, 2016 | | | [–]

If only it didn't still need the proprietary CUDA SDK.

wujingyue on April 26, 2016 | | | [–]

That is a very valid concern and a key motivation for the proposed StreamExecutor project (http://lists.llvm.org/pipermail/llvm-dev/2016-March/096576.h...).

EliRivers on April 25, 2016 | | [–]

I see Eli Bendersky's name on this; his site ( http://eli.thegreenplace.net/ ) has a number of interesting C++ articles, some of which I've even carefully printed out and taped into my notebook of really useful things. If you're a C++ programmer, there are a lot of useful reads on there.

I don't see anything specifically about this in the archives, but maybe that's something to look forwards to.

wmf on April 25, 2016 | | [–]

One wonders why they didn't invest that effort in making an awesome OpenCL 2.1 compiler instead.

joe_the_user on April 25, 2016 | | [–]

I'm looking at building a GPGPU program.

When I look at CUDA code, it seems to be a big loop targeting the GPU memory with standard c code, allocating memory with standard functions and specifying where code lives with simple defines.

When I look at OpenCL, it is... I don't know what it is. I haven't figure it out after considerable scanning. And that has cemented my decision to avoid it because I don't have infinite time to scan obscurity.

For example, here is a standard "first OpenCL program" - ~200 lines of boiler plate and no simple example of our many cores working together to do something brutally simple and useful like add two vectors. Just "hello world" from GPU.

As far as I can tell, as a production of a multitude of vendors all of which have different stuff, OpenCl is a monstrosity where you have a wide of variety of functionalities supported but none of those functionalities is guaranteed to be present - hence 200 lines of boiler plate. Kind of like the umpteen Unix flavors and such back in the day, "Open standards" that are bridges between only semi-compatible hardware have generally been doomed abortions discarded in favor of a single best approach that all vendors are forced to adopt.

So it seems like the best thing is jettisoning the monstrosity and cloning CUDA for other hardware.

https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBoo...

Stubb on April 25, 2016 | | | [–]

This was precisely my conclusion when getting into GPGPU programming two months ago. CUDA maps directly onto Nvidia hardware and its execution pipeline, leading to very tight expression of parallel algorithms. OpenCL, in attempting to map onto not just the GPUs of multiple vendors but DSPs and FPGAs, struck me as awash in code for navigating their architectural differences. So I'm developing in CUDA…

dman on April 25, 2016 | | | | [–]

I dont completely understand the inclination of evaluating a technical stack by the brevity of hello world.

Use the cl C++ wrapper if brevity is important to you and using C++ is a choice. The hello world example here is noticeably shorter

http://simpleopencl.blogspot.com/2013/06/tutorial-simple-sta...

joe_the_user on April 25, 2016 | | | [–]

That is a much more useful example program, thank you.

The problem is that the "canonical example" pretty much remains what I showed. And what's bad about that example isn't simple length but the way that the creation and manipulation of kernels and threads remains entirely opaque (in contrast to your example I think).

paulmd on April 25, 2016 | | | | [–]

CUDA is a language that specifically targets NVIDIA GPUs. OpenCL is a general-purpose framework for heterogenous compute tasks. You could be running against a FPGA, a DSP, or even a manycore CPU like Knight's Landing. It's definitely tied less tightly to the hardware, and that's really my main objection to it too. But from AMD's perspective it's an abstract framework, and they are providing stub functions for others to implement.

What you are looking at is largely just code to compile a kernel and launch it on the device, basically a makefile in C. You can really just ignore most of the boilerplate until you need to modify it.

"Hello World" is not a particularly meaningful program on a device with no console or LEDs to blink - why don't you try something simple like summing every number from 1 to N? Or, a Monte-Carlo simulator to calculate Pi? Easy to code, easy to verify...

CUDA is also quite verbose if you are playing by all the rules. You are still supposed to specify your target device, verify your capabilities, free your buffers, check your async return values, etc - and if you don't then you will get weird behavior down the road until you figure out what's going on. OpenCL might try to helpfully ignore misbehavior too, I don't know for sure.

Personally I find both CUDA and OpenCL to be quite verbose and when I write for them I use a library like Thrust or Bolt, which takes most of the boilerplate out of it and feel much like writing C++ STL code. It also allows you to work efficiently at a much higher level, and then come through after-the-fact and optimize the parts that are actually slowing you down. It also provides stuff like automatic occupancy tuning and so on. You can trivially switch between Thrust and native kernels by using raw_pointer, so it's great for sorts, scans, etc that "glue" things together. Use it like you would Python, bearing in mind that round trips to host memory are slow (but perhaps not fatally so!)

http://docs.nvidia.com/cuda/thrust/

http://developer.amd.com/tools-and-sdks/opencl-zone/bolt-c-t...

One useful trick is that you can write your CUDA functions inside a Thrust functor which iterates your data elements. So you can use a count_iterator which represents the index of the data element(s) or grid processing element in question, and then you write a functor which uses the counter value to load a data element from a pointer stored inside the functor, and does some work on it. This gives you a "kernel simulator" as an intermediate step between array-wide functional programming and native __device__ functions.

This also provides a great place for indirection so you can easily swap between a GPU backend and Thrust's OpenMP CPU backend.

joe_the_user on April 25, 2016 | | | [–]

"Hello World" is not a particularly meaningful program on a device with no console or LEDs to blink - why don't you try something simple like summing every number from 1 to N? Easy to code, easy to verify...

I'm not trying to run "hello world", I'm looking at the easily accessible OpenCl code and seeing "hello world" as the most common example program.

sklogic on April 26, 2016 | | | [–]

OpenCL is very low level (sort of like Vulkan, which was actually inspired by it). Start with some easy wrappers, like PyOpenCL.

joe_the_user on April 26, 2016 | | | [–]

Except I'm a c/c++ programmer.

Any "easy wrappers" for c?

sklogic on April 26, 2016 | | | [–]

There is an official Khronos C++ wrapper API that is significantly more dense than the underlying C API.

joe_the_user on April 27, 2016 | | | [–]

I was thinking of easy and usable rather than dense.

I think Khronos wrapper is just a translation of the original API into c++, which partakes of the basic problem - a zillion options for a zillion distinct sorts of hardware, in contrast to Cuda, with a main functional and supported approach.

sklogic on April 27, 2016 | | | [–]

Unlike the C API it got a lot of sane defaults.

Annatar on April 26, 2016 | | | | [–]

No console? Why couldn't one use the built-in HDMI / DVI ports as the console output?

jcranmer on April 25, 2016 | | | | [–]

That's a problem of the training materials, not of the language itself.

mattnewton on April 25, 2016 | | | [–]

I think they still need NVIDIA's libraries (cuDNN specifically) alongside this compiler, which AFAIK don't have good OpenCL equivalents yet.

wiso on April 25, 2016 | | [–]

GPUCC An Open-Source GPGPU Compiler A Preview http://images.nvidia.com/events/sc15/SC5105-open-source-cuda...

yzh on April 25, 2016 | | [–]

Not a compiler guy but a GPU programmer. This is exciting! Attended one of the authors' lecture a while ago. Although at this point I assume gpucc would be super-optimized for deep learning (by which I mean dense matrix multiplication), this is very good for the community so that people can work on various versions that either focus on better general performance, or difference feature sets for specific applications in the future.

cjbprime on April 25, 2016 | | [–]

So, uh, if it's an open-source GPGPU compiler, where's the source code?

wujingyue on April 25, 2016 | | [–]

http://llvm.org/docs/CompileCudaWithLLVM.html

jpgvm on April 25, 2016 | | | [–]

The code will be submitted to Clang.

maaku on April 25, 2016 | | | [–]

So where's the pull request?

cjbprime on April 25, 2016 | | | | [–]

Announcing that they'll throw a patchbomb at Clang at some indeterminate point in the future seems to satisfy neither the "you can get source now" nor the "this is developed in a participatory way" definitions of Open Source.

DannyBee on April 25, 2016 | | | [–]

Except, we didn't. Instead, what's happened is a discussion was started on the clang and llvm mailing lists about the best way to upstream this stuff, and as those discussions have reached consensus, patches have started flowing.

See, for, example, the streamexecutor thread, etc.

Also, outside of that, they've been upstreaming the non-controversial smaller stuff that is part of this for many months now.

(Seriously, i think of all the companies you are going to complain about, you may want to look at google's interactions with clang and llvm, where we are actually one of the only folks who work completely upstream at all times, before throwing stones)

cjbprime on April 25, 2016 | | | [–]

You're right, I'm not familiar with the culture of who best contributes to clang and llvm.

But I know some things about what words mean, and publishing a paper describing an open source project in March and not having any code available for download in April is just kind of weird, no?

It's good to talk about working out the precise mechanics of upstreaming code. But in an open source project, you'd expect to publish your fork so that other people can play an informed part in that conversation.

dgacmu on April 25, 2016 | | | [–]

As a meta-note, your comments are coming across in a very hostile way, in case you didn't intend them that way.

In an open source project, the best expectation is to play by the rules of the existing project and try to integrate your changes in the way that works well with it. There's no One True Open Source way - there are a lot of projects, each with their own cultures.

What Google's done is take an internally developed thingy and transition it to LLVM. That's a pretty non-trivial effort for any company. I don't see why having it be open source has any requirement for a dump of the internal version. Earlier commenters noted already that a large fraction of the code is already present in LLVM and has been streaming in for some time now, so -- why the hostility?

Collectively, I don't think "our" (the wider community) goal is necessarily to have a bunch of junk forks out there that can't be compiled or used. Working, thoughtfully contributed code is much more likely to be widely used and have a big impact, and that's a standard we should be happy if companies meet. LLVM isn't a quick hack project - it's a foundational bit of tech that millions of people depend on directly or indirectly, and that millions of people benefit from improvements to.

DannyBee on April 26, 2016 | | | | [–]

"But I know some things about what words mean, and publishing a paper describing an open source project in March and not having any code available for download in April is just kind of weird, no? "

This is a lot of assumptions. For starters, it's all pretty much part of llvm 3.9 already.

"It's good to talk about working out the precise mechanics of upstreaming code. But in an open source project, you'd expect to publish your fork so that other people can play an informed part in that conversation. "

Actually, no, you wouldn't, at least not in these communities. In fact, in most of the projects i've been in, that's the last thing i'd want or the community would want, because it encourages people to play with these weird hybrid versions, when that's not what anyone really wants. Instead, we'd want people to come to us with design ideas, and use cases, not a fait accompli already written that they expect us to do something with.

I'm not sure what projects you are part of that work the opposite way, where it's "fork everything and whatever happens happens", but it's not the majority of projects i've belonged to nor have i personally found it to be a very valuable mechanism. It makes people attached to their current implementations. When it comes to things like this, a lot of them are written the way they are not because it's a good design, but because they needed to get stuff done. Thus, discussing the use case, and design, and providing example code if people want it to play with, great. Dumping forks on the world, not a particularly useful thing in most cases except for boosting egos, IMHO (there are times it is useful,for sure, but those are mostly rare cases IMHO).

Look at the history of GCC and LLVM forks. Of those who published forks and started discussion about merging those forks as-is, the number of successful merges is near zero.

Of those who said 'hey, we did some cool stuff, here's our use cases and our initial thoughts on design. We have stuff that follows this design we can show but are otherwise happy to figure out the right design and build that", the number of successful merges is near 100%.

So that's what was done.

Alphasite_ on April 25, 2016 | | [–]

Just as a point of interest, is there any limitation to supporting CUDA on AMD hardware (were this to be compiled with the AMDGPU backend)? With the obvious lack of libraries, etc.

slizard on April 26, 2016 | | [–]

AMD's new Boltzmann initiative includes an LLVM-based compiler which has been posted online. I'm not sure what are the plans around an OpenCL fronted, but the backend should be there, so I think an OpenCL support in LLVM for AMD GPUs could be a realistic goal.

http://gpuopen.com/compute-product/hcc-heterogeneous-compute... https://github.com/RadeonOpenCompute/hcc

barneso on April 25, 2016 | | | [–]

The Tensorflow code mentions "GCUDACC" in several places, and from the surrounding comments it seems to be targeted at OpenCL as well as CUDA. So it seems that this has been at least considered.

fooblaster on April 25, 2016 | | [–]

I suspect that this compiler is generating ptx and not true native binaries for nvidia's architectures. Nvidia's proprietary compiler stack is still heavily involved in the conversion of ptx ir to native binaries. Essentially.. this isn't a full open source stack.

magicalist on April 25, 2016 | | [–]

> I suspect that this compiler is generating ptx and not true native binaries for nvidia's architectures

It would take all of getting to page 2 of the article to confirm this instead of speculating...

OTOH, there is an intriguing footnote that

> We are also experimenting compiling [virtual ISA] PTX to [Nvidia's proprietary Shader ASSembler] SASS before program execution and embedding the SASS directly into the resultant binary

but the paper mentions in the conclusion that a SASS spec is not publicly available. It would be interesting for someone involved to comment more on that. Experiments on reverse engineering the compiled PTX results?

If implementing a replacement for nvcc gave these gains, I would imagine being able to control an offline version of the (normally JIT) compilation to SASS would also yield large benefits. It would likely be incredibly architecture dependent, but for the big machine learning projects that still might be worth the expense.

jrk on April 25, 2016 | | | [–]

In addition to open source drivers, there has been work to reverse engineer the binary formats and write open source assemblers for recent versions of SASS (e.g., https://github.com/NervanaSystems/maxas).

fooblaster on April 25, 2016 | | | | [–]

Open source graphics driver implementations (nouveau) have been able to reverse engineer some of the encodings of fermi, kepler, maxwell, etc by fuzzing the sass disassembler provided with the cuda distribution.

jlebar on April 25, 2016 | | | | [–]

> the paper mentions in the conclusion that a SASS spec is not publicly available. It would be interesting for someone involved to comment more on that. Experiments on reverse engineering the compiled PTX results?

I'm one of the people implementing this stuff in upstream clang/llvm.

I'm not aware of any experiments internal or otherwise where we've tried to compile to SASS. We believe that there would be some performance gains to be had, but it's a really big project, even if we had an ISA, which at the moment we do not.

wujingyue on April 25, 2016 | | | [–]

You are right that gpucc still depends on NVIDIA's ptxas tool that translates PTX to native binaries. NVIDIA does not publish the specification of their native binaries. Besides that, it is fully open-source.

rsp1984 on April 25, 2016 | | [–]

What are the target GPUs for this? Will it run only on NVIDIA cards? What about mobile GPUs?

maaku on April 25, 2016 | | [–]

I presume it will run everywhere CUDA is supported. Draw your own conclusions.

wujingyue on April 25, 2016 | | | [–]

It currently generates NVIDIA's PTX only.

varelse on April 28, 2016 | | [–]

Clang crashed upon impact trying to compile some of my CUDA code as in the very first .cu file. Not a good start IMO.

hsivonen on April 25, 2016 | [–]

Can this LLVM back end be used with Rust?

wcrichton on April 25, 2016 | [–]

Yes, and it has been. See: https://www.cs.indiana.edu/~eholk/papers/hips2013.pdf