Red-eye Removal with CUDA

This code demonstrates basic steps for red-eye correction in pictures. It requires a picture of someone with red eyes and a small template file which is a picture of a red eye to help us find eyes in the original picture. While the sample works with the picture that I tested it with, it requires refinement to work with any image. The code also suffers when you try to pass large in size pictures. But that is a good start if you want to learn how red-eye correction is done. Below screenshot displays an output from the program.


Here are the original picture and the eye template:



And this is the image of the girl after running the code:


The flowchart diagram below shows the main steps in the code execution:


Main function (program entry point)

Solution consists of RedEyeFilter helper class to get the code running and a file containing CUDA kernels. To launch the program, we need to create the helper class and pass to it locations of the picture of a person with red eyes, name of the fixed picture after we remove redness, and the location of a template file which the code will use to identify pixels in the original picture that must be corrected.


To simplify writing the code we will add the following string to the Command Arguments under the project settings:


We also need to ensure that CUDA compute version is set to at least 2.0. It is necessary because we will be using some features that are not available to earlier CUDA versions.


And of interest to us are also post-build event where we will add script to copy media files into the output folder:


Once the project is configured, we can instantiate RedEyeFilter class and launch the code.


Load picture

First, we have to load both the image requiring correction and red-eye template.


We use OpenCV library to open the files and populate in-memory vector structures:


Here’s the code that opens the images and populates vector structure:


Process image

Once images are loaded and in the host memory, we can execute our CUDA kernels to do their magic. We need to:

1. Separate RGB channels;

2. Compute template mean for each channel;

3. Compute cross-correlation between the template and the picture;

4. Sort correlated values in the ascending order where read pixel positions will be at the top of the sorted list;

5. Replace color of the arbitrary number of top pixels in the red channel with an averaged color from blue and green channels.

Here’s the function that orchestrates all steps:


Separate channels

Before we run kernels, we have to copy data to the GPU. RedEyeData structure helps organize containers that will be loaded into the device memory.


To help separate channels, we will create a SplitChannels structure.


We first copy both picture and re-eye template to the device and then invoke thrust library transform function which will take image input for each pixel and copy individual channels to the red, green, and blue containers.


Compute template mean

For each channel we have to compute average color value. We use thrust library to help with that task: we total all color values and divide by number of pixels.


Compute cross correlation

That’s where important logic starts. First, we create a container to store correlated values which will be of type float. We need to compute cross correlation between each image pixel in every channel and the red-eye template box, then multiply all channel values to arrive to the final array of correlated values.

In signal processing, cross-correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long-signal for a shorter, known feature. It also has applications in pattern recognition, single particle analysis, electron tomographic averaging, cryptanalysis, and neurophysiology.


In the following screenshot we show instantiation of the kernel for red channel. Important is the size of the block: 256 pixels. My GPU card supports blocks of 1024 elements (threads) but that would not provide best performance and may not even work at all. Please read my article explaining GPU architecture to understand how to select optimal thread sizes per execution block.


Here’s a naïve, simplified cross-correlation kernel.



Once correlation values have been computed, we can multiply pixels from each channel. For that we will create another helper structure: CombineResponses.


Now we can pass the structure to thrust library function transform. In order to simplify sorting, we will remove all negative values from the array by adding to each value the smallest number in the array.


Sort cross-correlated pixels

We need to sort pixels in ascending order to identify those we need to replace. Here you should experiment with writing radix sort algorithm on the GPU. The basic idea behind radix sort is to construct a histogram on each pass of how many of each “digit” there are. Then we need to scan the histogram to tell us where to put the output of each digit. For example, the first 1 must come after all the 0s, so we have to know how many 0s there are.

1) Histogram of the number of occurrences of each digit

2) Exclusive Prefix Sum of Histogram

3) Determine relative offset of each digit

For example [0 0 1 1 0 0 1]

         -> [0 1 0 1 2 3 2]

4) Combine results of steps 2 & 3 to determine the final output location for each element and move it there.

LSB Radix sort is an out-of-place sort and you will need to ping-pong values between the input and output buffers. Final sorted results are placed into the output buffer.

Important, we are not interested in the correlated values, rather in their positions. That is why we are creating sortedCoords container which will later be used to remove red pixels from the original picture.


I’ll cheat here and use thrust library to sort correlated values and their coordinates. But first we need to cast floats to integers in such a way that we are not interested in the exact cast, but rather in the relative order of the values. So, we will do an imprecise and strange conversion by doing memcopy from float type to integer type resulting in strange values but fulfilling our condition.


Remove redness

We only need to replace small number of pixels in the original picture, so here’s where we do some magic. I mean, this code is something you would want to seriously improve if you want it to work for every image. Here, for this example, we’ll pic first 40 pixels and even though our template size is 17-pixel box, we’ll only use 9 of them on each side. As a result of the kernel call we’ll get back modified red channel after which we can combine all channels back into the final image.


Where CombineChannels looks like this:


Here’s the final kernel:


Save image copy

Now we can take the final image and pass it back to OpenCV for saving it to the disk.



Download code

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: