When I look at CUDA code, it seems to be a big loop targeting the GPU memory with standard c code, allocating memory with standard functions and specifying where code lives with simple defines.
When I look at OpenCL, it is... I don't know what it is. I haven't figure it out after considerable scanning. And that has cemented my decision to avoid it because I don't have infinite time to scan obscurity.
For example, here is a standard "first OpenCL program" - ~200 lines of boiler plate and no simple example of our many cores working together to do something brutally simple and useful like add two vectors. Just "hello world" from GPU.
As far as I can tell, as a production of a multitude of vendors all of which have different stuff, OpenCl is a monstrosity where you have a wide of variety of functionalities supported but none of those functionalities is guaranteed to be present - hence 200 lines of boiler plate. Kind of like the umpteen Unix flavors and such back in the day, "Open standards" that are bridges between only semi-compatible hardware have generally been doomed abortions discarded in favor of a single best approach that all vendors are forced to adopt.
So it seems like the best thing is jettisoning the monstrosity and cloning CUDA for other hardware.
This was precisely my conclusion when getting into GPGPU programming two months ago. CUDA maps directly onto Nvidia hardware and its execution pipeline, leading to very tight expression of parallel algorithms. OpenCL, in attempting to map onto not just the GPUs of multiple vendors but DSPs and FPGAs, struck me as awash in code for navigating their architectural differences. So I'm developing in CUDA…
That is a much more useful example program, thank you.
The problem is that the "canonical example" pretty much remains what I showed. And what's bad about that example isn't simple length but the way that the creation and manipulation of kernels and threads remains entirely opaque (in contrast to your example I think).
CUDA is a language that specifically targets NVIDIA GPUs. OpenCL is a general-purpose framework for heterogenous compute tasks. You could be running against a FPGA, a DSP, or even a manycore CPU like Knight's Landing. It's definitely tied less tightly to the hardware, and that's really my main objection to it too. But from AMD's perspective it's an abstract framework, and they are providing stub functions for others to implement.
What you are looking at is largely just code to compile a kernel and launch it on the device, basically a makefile in C. You can really just ignore most of the boilerplate until you need to modify it.
"Hello World" is not a particularly meaningful program on a device with no console or LEDs to blink - why don't you try something simple like summing every number from 1 to N? Or, a Monte-Carlo simulator to calculate Pi? Easy to code, easy to verify...
CUDA is also quite verbose if you are playing by all the rules. You are still supposed to specify your target device, verify your capabilities, free your buffers, check your async return values, etc - and if you don't then you will get weird behavior down the road until you figure out what's going on. OpenCL might try to helpfully ignore misbehavior too, I don't know for sure.
Personally I find both CUDA and OpenCL to be quite verbose and when I write for them I use a library like Thrust or Bolt, which takes most of the boilerplate out of it and feel much like writing C++ STL code. It also allows you to work efficiently at a much higher level, and then come through after-the-fact and optimize the parts that are actually slowing you down. It also provides stuff like automatic occupancy tuning and so on. You can trivially switch between Thrust and native kernels by using raw_pointer, so it's great for sorts, scans, etc that "glue" things together. Use it like you would Python, bearing in mind that round trips to host memory are slow (but perhaps not fatally so!)
One useful trick is that you can write your CUDA functions inside a Thrust functor which iterates your data elements. So you can use a count_iterator which represents the index of the data element(s) or grid processing element in question, and then you write a functor which uses the counter value to load a data element from a pointer stored inside the functor, and does some work on it. This gives you a "kernel simulator" as an intermediate step between array-wide functional programming and native __device__ functions.
This also provides a great place for indirection so you can easily swap between a GPU backend and Thrust's OpenMP CPU backend.
"Hello World" is not a particularly meaningful program on a device with no console or LEDs to blink - why don't you try something simple like summing every number from 1 to N? Easy to code, easy to verify...
I'm not trying to run "hello world", I'm looking at the easily accessible OpenCl code and seeing "hello world" as the most common example program.
I was thinking of easy and usable rather than dense.
I think Khronos wrapper is just a translation of the original API into c++, which partakes of the basic problem - a zillion options for a zillion distinct sorts of hardware, in contrast to Cuda, with a main functional and supported approach.