I’ve been experimenting a little bit with parallel programming, using a bunch of different interfaces – MPI, PVM, OpenMP, POSIX threads, parallel Haskell, Occam-π, and most recently OpenCL. I’ve also been looking at a few others, including XC and Spin. Of them all, OpenCL is by far the most promising when it comes to number crunching, for one simple reason – GPUs. It also has the advantages of being vendor neutral, C based, and openly published. The main downside would seem to be a lack of implementations, but it’s rapidly changing. It doesn’t by itself cover distribution over hosts (although nothing in the API prevents it), but it’s possible to combine with MPI or PVM, which do. If you only need CPU support, though, it’s likely easier to use OpenMP as it’s a more direct extension of C – and OpenMP programs reduce without modification to single threaded ones.
As for implementations, there are three big ones out for public use right now – Apple (in Mac OS X 10.6), AMD/ATI Stream, and nVidia (via CUDA). There’s mention of some others, of which the Gallium one interests me most as I am a free software enthusiast. The reason I’m writing this post is that I’ve finally been able to use nVidia’s implementation.
When I first looked into OpenCL, it was primarily to avoid the proprietary CUDA. I found nVidia did have OpenCL code in their GPU Computing SDK, but to my dismay, it was specific to an old driver, known to be buggy. I picked it up again because the most recent nVidia driver beta – 195.36.15 – contained new OpenCL libraries. With a bit of fiddling, this version actually functions on both of my computers that have a modern enough graphics card. There was just one snag while testing, and that is that OpenCL contexts must be created with a CL_CONTEXT_PLATFORM property. No really big deal, as I can just extract that from whatever device I find.
Here’s my simple OpenCL Hello World. It’s an excellent example of what sort of task you don’t leave for the GPU to do, as it’s a ridiculously small dataset and the code is full of conditionals while very low on actual processing. However, it does work, and has no extra dependencies. For some reason, that latter was one thing I didn’t find when looking about at examples. If you’re going to use OpenCL seriously, I suggest you check for errors and use something that can display them, for instance CLCC.