Note that the code for the article is available here.
In this example, for the sake of learning, we are going to butcher some great images. Let’s describe the problem first.
We are going to take some HDR images and modify their luminosity to lighten them. We could also darken the images, or apply changes only to parts of the images, but for now we will make entire image lighter.
HDR is short for High Dynamic Range. It is a post-processing method of taking either one image or a series of images, combining them, and adjusting the contrast ratios to do things that are virtually impossible with a single aperture and shutter speed.
If you are familiar with Photoshop, you can change image mode from normal 8-bits per channel to 32-bits. Experiment with it and see how you can make a color redder than red, for example:

See that 3.4 in the red channel below?! You would think that you could not make the color value greater than 1. By setting the value greater than 1 you can achieve stunning effects in Photoshop.

Here are couple examples, first a simple experiment. Notice the glow:

And a much more complex design where the words ‘Carmina Burana’ are created using these brighter than visible limit colors:

All right, we came to a new image format, EXR. I’ll just quote what it is from its website:
OpenEXR is a high dynamic-range (HDR) image file format developed by Industrial Light & Magic for use in computer imaging applications.
OpenEXR is used by ILM on all motion pictures currently in production. The first movies to employ OpenEXR were Harry Potter and the Sorcerers Stone, Men in Black II, Gangs of New York, and Signs. Since then, OpenEXR has become ILM’s main image file format.
Some amazing stuff!
Now about butchering the images for the sake of learning… After we write the code we should be able to take an existing exr image and change its luminosity. The image at the top is an original taken from OpenEXR SDK, and the one at the bottom is after we ran our code on the image:

Note, that I had to convert 32-bit images to 8-bits per channel in order to show them here. Still you get the difference. What we have done is we took histogram of luminance distributions in the original image and pretty much removed all extremes from it as well as flattened it by distributing colors evenly (or butchered it!):

That’s OK for learning, we can now improve our code dramatically to only modify regions of the image or tones (bright, dark, or middle) we want to!
Here’s an example where our inputs into the alforithm produced a much better result transforming a very dark image into an image with much more detail, although even there we probably over-did it:

References:
also find as many tutorials as possible describing the following topics:
Parallel sum (prefix)
GPU thread communication
Parallel scan algorithms, for example Hillis-Steele or Blelloch
Parallel reduction algorithms
Important
A very important note. GPU on my laptop supports 1024 threads per bloc and I can schedule kernels 1024 threads wide:
Using device: 0 Name: Quadro 5000M Compute version: 2.0 Global memory: 2048 mb Const memory: 64 kb L2 cache size: 512 kb Clock rate: 810 mhz Timeout enabled: true Multiprocessors: 10 Max grid size: 65535 : 65535 : 65535 Max threads per SM: 1536 Max threads per block: 1024 Registers per block: 32768 Shared memory per block: 48 kb Memory bus width: 256 bits Memory clock rate: 1200 mhz
Your GPU may only support 512 threads per block. In that case you will have to modify kernel invocation code by replacing BLOCK_WIDTH variable below with separate variables for block width in x and y dimensions. Commonly you will select 32 threads in x dimension and 16 in y for the total of 512:
// Compute luminance histogram static const int BLOCK_WIDTH = 32; // threads per block; because we are setting 2-dimensional block, the total number of threads is 32^2, or 1024 // 1024 is the maximum number of threads per block for modern GPUs. int width = static_cast<int>(ceilf(static_cast<float>(cols) / BLOCK_WIDTH)); int height = static_cast<int>(ceilf(static_cast<float>(rows) / BLOCK_WIDTH)); const dim3 grid (width, height, 1); // number of blocks const dim3 block(BLOCK_WIDTH, BLOCK_WIDTH, 1); // block width: number of threads per block // Compute histogram hdr_compute_histogram<<<grid, block>>>( histogram, luminance, lumMin, lumRange, static_cast<int>(rows), static_cast<int>(cols), bins ); hr = cudaDeviceSynchronize();
Examples of calling the program and time it took to execute
C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\desk.exr" "Media\desk-tm.exr" Loaded HDR image in 156003us (156ms) rgb to xyY in 0us (0ms); gpu time: 5.12362ms GPU luminance CDF in 15614us (15ms); gpu time: 5.53718ms Post-Processed in 0us (0ms) SaveImage in 389987us (389ms) Done! C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\GoldenGate.exr" "Media\GoldenGate-tm.exr" Loaded HDR image in 140427us (140ms) rgb to xyY in 0us (0ms); gpu time: 7.07667ms GPU luminance CDF in 15569us (15ms); gpu time: 10.0554ms Post-Processed in 0us (0ms) SaveImage in 748803us (748ms) Done! C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\memorial.exr" "Media\memorial-tm.exr" Loaded HDR image in 31227us (31ms) rgb to xyY in 15570us (15ms); gpu time: 1.59734ms GPU luminance CDF in 0us (0ms); gpu time: 1.67718ms Post-Processed in 0us (0ms) SaveImage in 62400us (62ms) Done! C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\memorial_large.exr" "Media\memorial_lar Loaded HDR image in 46819us (46ms) rgb to xyY in 0us (0ms); gpu time: 4.38285ms GPU luminance CDF in 15583us (15ms); gpu time: 3.77363ms Post-Processed in 0us (0ms) SaveImage in 265219us (265ms) Done! C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\MtTamWest.exr" "Media\MtTamWest-tm.exr" Loaded HDR image in 78029us (78ms) rgb to xyY in 0us (0ms); gpu time: 6.59494ms GPU luminance CDF in 0us (0ms); gpu time: 7.65834ms Post-Processed in 15594us (15ms) SaveImage in 639575us (639ms) Done! C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\Ocean.exr" "Media\Ocean-tm.exr" Loaded HDR image in 93638us (93ms) rgb to xyY in 15631us (15ms); gpu time: 7.19949ms GPU luminance CDF in 0us (0ms); gpu time: 8.74659ms Post-Processed in 15545us (15ms) SaveImage in 780024us (780ms) Done! C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\Spirals.exr" "Media\Spirals-tm.exr" Loaded HDR image in 109202us (109ms) rgb to xyY in 15609us (15ms); gpu time: 7.96438ms GPU luminance CDF in 0us (0ms); gpu time: 10.4417ms Post-Processed in 0us (0ms) SaveImage in 733220us (733ms) Done! C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\Tree.exr" "Media\Tree-tm.exr" Loaded HDR image in 109194us (109ms) rgb to xyY in 0us (0ms); gpu time: 6.80957ms GPU luminance CDF in 0us (0ms); gpu time: 7.29443ms Post-Processed in 15605us (15ms) SaveImage in 577199us (577ms) Done!
How does it work?
To simplify things, the project is a console application. In the main function we extract image path that we want to modify and another image path where we want to save processed copy, and then make a call to HdrTone ApplyFilter function:
int main(int argc, char** argv) { . . . string imagePath; string outputPath; if (argc > 2) { imagePath = string(argv[1]); outputPath = string(argv[2]); } . . . HdrTone hdr; try { hdr.ApplyFilter (imagePath, outputPath); } catch(exception& e) { cerr << endl << "ERROR: " << e.what() << endl; exit(EXIT_FAILURE); }
HdrTone class contains both handles to variables we need on the host and those on the device:
// class HdrTone class HdrTone { public: . . . void ApplyFilter(const string& imagePath, const string& outputPath); private: // struct HostData // Used on the CPU host // RGB channels, number of rows and columns. // This is a temporary data we need to initialize kernel. struct HostData { vector<float> red; vector<float> green; vector<float> blue; size_t rows; size_t cols; size_t numPixels; }; // struct xyY // Data is used on the GPU device // Y = Luminescence (cd/m2) xy = chromaticity co-ordinates (spectral locus) // http://en.wikipedia.org/wiki/CIE_1931_color_space#CIE_xy_chromaticity_diagram_and_the_CIE_xyY_color_space struct DeviceData { float* red; // input for the conversion from rgb to xyY float* green; // input for the conversion from rgb to xyY float* blue; // input for the conversion from rgb to xyY float* x; // component of xyY float* y; // component of xyY float* luminance; // Y component of xyY unsigned int* histogram; // histogram of luminance values unsigned int* cdf; // luminance cumulative distribution, calculated from histogram float* normalizedCdf; // normalized cdf float* lumMin; // luminance min value float* lumMax; // luminance max value float* lumRange; // luminance range:max - min }; private: void ComputeCDF (); void rgb2xyY (const HostData& hdr); void LoadHdrImage (const string& imagePath, HostData& hdr); void PostProcess (); void SaveImage (const string& imagePath); // CPU verification void VerifyGpuComputation (); private: HostData m_host; // helper structure for items used on the CPU DeviceData m_device; // helper structure to allocate memory on GPU static const unsigned int NumBins = 1024; // number of bins for luminocity cdf histogram. // NOTE that we hard-code 1024 in many places in kernels. // Do not just change this number without re-factoring kernel code.
Here are the steps we have to do to apply toning to an HDR image:
- Obviously, load image from disk and split it into red, green, and blue channels;
- Transform RGB color space to xyY;
- Compute luminance cumulative distribution (CDF):
a) Find luminance (capital Y in xyY) min, max value and range between them;
b) Compute luminance histogram using the following formula:
vector<unsigned int> histogram(NumBins, 0); for (int i = 0; i < m_host.numPixels; ++i) { unsigned int bin = (luminance[i] - lumMin) / lumRange * NumBins; bin = std::min(bin, NumBins - 1); histogram[bin]++; }
c) Compute luminance cumulative distribution (CDF)
vector<unsigned int> cdf(NumBins, 0); for (int i = 1; i < NumBins; ++i) { cdf[i] = cdf[i - 1] + histogram[i - 1]; }
4. Normalize luminance CDF
float k = (float)cdf[numBins - 1]; float value = (float)cdf[idx] / k; norm[idx] = value;
5. Apply tone map, in other words, map each luminance value to its new value and then transform back to RGB.
6. Save image back to disk.
Here’s the ApplyFilter function (with the noise removed):
// Applies gaussian blur to an image void HdrTone::ApplyFilter(const string& imagePath, const string& outputPath) { LoadHdrImage(imagePath, m_host); rgb2xyY(m_host); ComputeCDF(); PostProcess(); SaveImage(outputPath); #if 0 // Change to 1 to enable // Validate GPU convertion against CPU result. VerifyGpuComputation(); #endif }
Summary
- In this article I am not explaining parallel algorithms; you should read about them in the references that I listed above. I was experimenting with my kernels and left a lot of improvements that can be done. Hope the code will help those learning parallel programming. Enjoy!
Source Code Files
- program.cpp – main function.
- HdrTone.h and HdrTone.cpp – responsible for loading and saving images, initializing and calling CUDA kernels, managing memory;
- HdrTone.cu – CUDA kernels;
- utilities – checks runtime errors and compares pixels from GPU conversion to reference conversion on CPU;
- gputimer – events to measure execution time on the GPU in milliseconds
- Note that the code for the article is available here.