Before I get into some technical details, here's a youtube video where you can see the OpenCL implementation of my detector in action:
Pretty neat, right? :) So what you just saw was an implementation of a detector based on the WaldBoost algorithm (a variant of AdaBoost) that had as its input a classifier trained for detecting frontal faces (and an awesome video of course) running on a GPU.
If you know anything about boosting algorithms, you'll know that one strong classifier is usually composed of lots of weak classifiers (which are usually very simple and computationally inexpensive functions) - in my case there are 1000 weak classifiers where each uses Local Binary Patterns to extract a feature from the input texture. Unfortunately such strong classifier is resolution dependent, and to be able to detect objects of various sizes in the input image, we need a pre-processing step.
During pre-processing we create a pyramid of images by gradually down-scaling the input (oh and we don't need colors, so we also convert it to greyscale). This way the detector can still detect only faces with resolution of 24x24, but using a mapping function we will know when it actually detected something in any of the downscaled versions of the image and there we have resolution independent detector. Interesting tidbit: it turned out that creating the pyramid texture by putting the downscaled images horizontally instead of vertically (which you can see on the image below) slightly improved performance of the detector - simply because the texture cache unit had higher hit ratio in such setup, but since the pyramid texture is approximately 3.6 times larger than the width of the original image, the detector wouldn't be able to process HD (1280x720) nor Full-HD (1920x1080) videos, because maximum texture size for OpenCL image is 4096 pixels (when using vertical layout though 1080 x 3.6 ~= 3900, so even Full-HD videos can be processed).
|Left - original image, right - pyramid of downscaled images (real pyramid texture also has the original on top)|
Once we have our pyramid image, it's divided into small blocks, which are processed by the GPU cores and each work item (or thread if you wish) in this block is evaluating the strong classifier at a particular window position of the pyramid image. Overall we'll evaluate every window position - think of every pixel. (in reality it's more complicated than that - the detector is using multiple kernels and each is evaluating only a part of the strong classifier - that's because WaldBoost can preliminary reject a window without evaluating all weak classifiers, so when a kernel finishes it just reduces the number of candidate windows and next kernel continues to evaluate only windows that survived the previous steps - this also ensures that we keep most of the work items in the work groups busy).
Once the detector finishes, we have a couple of window positions in the pyramid image and response value of the strong classifier in these windows, and these are sent back to the host. The CPU can then finish the detection (by simply thresholding the response values) and map the coordinates back to the input image. If you watched the video carefully you'd have noticed that there are multiple positive responses around a face, so this would be also a good place to do some post-processing and merge these. Plus there's a false detection from time to time, so again good place to get rid of them.
You're surely asking how does this compare to a pure CPU implementation and as you can imagine having to evaluate every window position in the pyramid image is very costly and even optimized SSE implementations can't get close to performance of a GPU (even though you need to copy a lot of data between the host and the GPU). So a simple graph to answer that (note the logarithmic scale):
|Processed video frames per second (CPU: Core2 Duo E8200 @ 2.66GHz; GPU: GeForce GTX 285 - driver ver 270)|
Grab the code branch from Launchpad (bzr branch lp:~mhr3/+junk/ocl-detector), or get the tarball (the only dependencies are glib, opencv plus libOpenCL.so somewhere where the linker can find it). Run it with `./oclDetector -s CAM` (and if that doesn't seem to detect anything try `./oclDetector -r -20 -s CAM`).