HDR Tone Mapping with CUDA 5

Note that the code for the article is available here.
In this example, for the sake of learning, we are going to butcher some great images. Let’s describe the problem first.
We are going to take some HDR images and modify their luminosity to lighten them. We could also darken the images, or apply changes only to parts of the images, but for now we will make entire image lighter.
HDR is short for High Dynamic Range. It is a post-processing method of taking either one image or a series of images, combining them, and adjusting the contrast ratios to do things that are virtually impossible with a single aperture and shutter speed.
If you are familiar with Photoshop, you can change image mode from normal 8-bits per channel to 32-bits. Experiment with it and see how you can make a color redder than red, for example:
See that 3.4 in the red channel below?! You would think that you could not make the color value greater than 1. By setting the value greater than 1 you can achieve stunning effects in Photoshop.
Here are couple examples, first a simple experiment. Notice the glow:
And a much more complex design where the words ‘Carmina Burana’ are created using these brighter than visible limit colors:
All right, we came to a new image format, EXR. I’ll just quote what it is from its website:

OpenEXR is a high dynamic-range (HDR) image file format developed by Industrial Light & Magic for use in computer imaging applications.

OpenEXR is used by ILM on all motion pictures currently in production. The first movies to employ OpenEXR were Harry Potter and the Sorcerers Stone, Men in Black II, Gangs of New York, and Signs. Since then, OpenEXR has become ILM’s main image file format.

Some amazing stuff!

Now about butchering the images for the sake of learning… After we write the code we should be able to take an existing exr image and change its luminosity. The image at the top is an original taken from OpenEXR SDK, and the one at the bottom is after we ran our code on the image:
Note, that I had to convert 32-bit images to 8-bits per channel in order to show them here. Still you get the difference. What we have done is we took histogram of luminance distributions in the original image and pretty much removed all extremes from it as well as flattened it by distributing colors evenly (or butchered it!):
That’s OK for learning, we can now improve our code dramatically to only modify regions of the image or tones (bright, dark, or middle) we want to!
Here’s an example where our inputs into the alforithm produced a much better result transforming a very dark image into an image with much more detail, although even there we probably over-did it:
References:
also find as many tutorials as possible describing the following topics:
Parallel sum (prefix)
GPU thread communication
Parallel scan algorithms, for example Hillis-Steele or Blelloch
Parallel reduction algorithms

Important

A very important note. GPU on my laptop supports 1024 threads per bloc and I can schedule kernels 1024 threads wide:
Using device: 0
   Name:                    Quadro 5000M
   Compute version:         2.0
   Global memory:           2048 mb
   Const memory:            64 kb
   L2 cache size:           512 kb
   Clock rate:              810 mhz
   Timeout enabled:         true
   Multiprocessors:         10
   Max grid size:           65535 : 65535 : 65535
   Max threads per SM:      1536
   Max threads per block:   1024
   Registers per block:     32768
   Shared memory per block: 48 kb
   Memory bus width:        256 bits
   Memory clock rate:       1200 mhz
Your GPU may only support 512 threads per block. In that case you will have to modify kernel invocation code by replacing BLOCK_WIDTH variable below with separate variables for block width in x and y dimensions. Commonly you will select 32 threads in x dimension and 16 in y for the total of 512:
	// Compute luminance histogram
	static const int BLOCK_WIDTH = 32;						// threads per block; because we are setting 2-dimensional block, the total number of threads is 32^2, or 1024
															// 1024 is the maximum number of threads per block for modern GPUs.

	int width  = static_cast<int>(ceilf(static_cast<float>(cols) / BLOCK_WIDTH));
	int height = static_cast<int>(ceilf(static_cast<float>(rows) / BLOCK_WIDTH));

	const dim3 grid (width, height, 1);						// number of blocks
	const dim3 block(BLOCK_WIDTH, BLOCK_WIDTH, 1);			// block width: number of threads per block

	// Compute histogram
	hdr_compute_histogram<<<grid, block>>>(
			histogram,
			luminance,
			lumMin,
			lumRange,
			static_cast<int>(rows),
			static_cast<int>(cols),
			bins
		);

	hr = cudaDeviceSynchronize();

Examples of calling the program and time it took to execute

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\desk.exr" "Media\desk-tm.exr"

Loaded HDR image in 156003us (156ms)
rgb to xyY in 0us (0ms); gpu time: 5.12362ms
GPU luminance CDF in 15614us (15ms); gpu time: 5.53718ms
Post-Processed in 0us (0ms)
SaveImage in 389987us (389ms)
Done!

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\GoldenGate.exr" "Media\GoldenGate-tm.exr"

Loaded HDR image in 140427us (140ms)
rgb to xyY in 0us (0ms); gpu time: 7.07667ms
GPU luminance CDF in 15569us (15ms); gpu time: 10.0554ms
Post-Processed in 0us (0ms)
SaveImage in 748803us (748ms)
Done!

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\memorial.exr" "Media\memorial-tm.exr"

Loaded HDR image in 31227us (31ms)
rgb to xyY in 15570us (15ms); gpu time: 1.59734ms
GPU luminance CDF in 0us (0ms); gpu time: 1.67718ms
Post-Processed in 0us (0ms)
SaveImage in 62400us (62ms)
Done!

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\memorial_large.exr" "Media\memorial_lar

Loaded HDR image in 46819us (46ms)
rgb to xyY in 0us (0ms); gpu time: 4.38285ms
GPU luminance CDF in 15583us (15ms); gpu time: 3.77363ms
Post-Processed in 0us (0ms)
SaveImage in 265219us (265ms)
Done!

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\MtTamWest.exr" "Media\MtTamWest-tm.exr"

Loaded HDR image in 78029us (78ms)
rgb to xyY in 0us (0ms); gpu time: 6.59494ms
GPU luminance CDF in 0us (0ms); gpu time: 7.65834ms
Post-Processed in 15594us (15ms)
SaveImage in 639575us (639ms)
Done!

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\Ocean.exr" "Media\Ocean-tm.exr"

Loaded HDR image in 93638us (93ms)
rgb to xyY in 15631us (15ms); gpu time: 7.19949ms
GPU luminance CDF in 0us (0ms); gpu time: 8.74659ms
Post-Processed in 15545us (15ms)
SaveImage in 780024us (780ms)
Done!

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\Spirals.exr" "Media\Spirals-tm.exr"

Loaded HDR image in 109202us (109ms)
rgb to xyY in 15609us (15ms); gpu time: 7.96438ms
GPU luminance CDF in 0us (0ms); gpu time: 10.4417ms
Post-Processed in 0us (0ms)
SaveImage in 733220us (733ms)
Done!

C:\Projects\CUDA\My\HDR Tone Mapping\bin>hdrToneMap "Media\Tree.exr" "Media\Tree-tm.exr"

Loaded HDR image in 109194us (109ms)
rgb to xyY in 0us (0ms); gpu time: 6.80957ms
GPU luminance CDF in 0us (0ms); gpu time: 7.29443ms
Post-Processed in 15605us (15ms)
SaveImage in 577199us (577ms)
Done!

How does it work?

To simplify things, the project is a console application. In the main function we extract image path that we want to modify and another image path where we want to save processed copy, and then make a call to HdrTone ApplyFilter function:
int main(int argc, char** argv) 
{ 
. . . 
    string imagePath; 
    string outputPath; 

    if (argc > 2) 
    { 
        imagePath = string(argv[1]); 
        outputPath = string(argv[2]); 
    } 

. . . 

    HdrTone            hdr; 

    try 
    { 
        hdr.ApplyFilter            (imagePath, outputPath); 
    } 
    catch(exception& e) 
    { 
        cerr << endl << "ERROR: " << e.what() << endl; 
        exit(EXIT_FAILURE); 
    }
HdrTone class contains both handles to variables we need on the host and those on the device:
// class HdrTone 
class HdrTone 
{ 
public: 
. . . 
    void ApplyFilter(const string& imagePath, const string& outputPath); 

private: 

    // struct HostData 
    //        Used on the CPU host 
    //        RGB channels, number of rows and columns. 
    //        This is a temporary data we need to initialize kernel. 
    struct HostData 
    { 
        vector<float>    red; 
        vector<float>    green; 
        vector<float>    blue; 
        size_t            rows; 
        size_t            cols; 
        size_t            numPixels; 
    }; 

    // struct xyY 
    //        Data is used on the GPU device 
    //        Y = Luminescence (cd/m2)   xy = chromaticity co-ordinates (spectral locus)  
    //        http://en.wikipedia.org/wiki/CIE_1931_color_space#CIE_xy_chromaticity_diagram_and_the_CIE_xyY_color_space 
    struct DeviceData 
    { 
        float*            red;                        // input for the conversion from rgb to xyY 
        float*            green;                        // input for the conversion from rgb to xyY 
        float*            blue;                        // input for the conversion from rgb to xyY 
        float*            x;                            // component of xyY 
        float*            y;                            // component of xyY 
        float*            luminance;                    // Y component of xyY 
        unsigned int*    histogram;                    // histogram of luminance values 
        unsigned int*    cdf;                        // luminance cumulative distribution, calculated from histogram 
        float*            normalizedCdf;                // normalized cdf 
        float*            lumMin;                        // luminance min value 
        float*            lumMax;                        // luminance max value 
        float*            lumRange;                    // luminance range:max - min 
    }; 

private: 
    void ComputeCDF                (); 
    void rgb2xyY                (const HostData& hdr); 
    void LoadHdrImage            (const string& imagePath, HostData& hdr); 
    void PostProcess            (); 
    void SaveImage                (const string& imagePath); 

    // CPU verification 
    void VerifyGpuComputation    (); 

private: 
    HostData        m_host;                            // helper structure for items used on the CPU 
    DeviceData        m_device;                        // helper structure to allocate memory on GPU 

    static const unsigned int NumBins = 1024;        // number of bins for luminocity cdf histogram. 
                                                    // NOTE that we hard-code 1024 in many places in kernels. 
                                                    // Do not just change this number without re-factoring kernel code.
Here are the steps we have to do to apply toning to an HDR image:
  1. Obviously, load image from disk and split it into red, green, and blue channels;
  2. Transform RGB color space to xyY;
  3. Compute luminance cumulative distribution (CDF):

a) Find luminance (capital Y in xyY) min, max value and range between them;

b) Compute luminance histogram using the following formula:

vector<unsigned int> histogram(NumBins, 0); 

for (int i = 0; i < m_host.numPixels; ++i) 
{ 
    unsigned int bin = (luminance[i] - lumMin) / lumRange * NumBins; 
    bin = std::min(bin, NumBins - 1); 
    histogram[bin]++; 
}
 c) Compute luminance cumulative distribution (CDF)
vector<unsigned int> cdf(NumBins, 0); 

for (int i = 1; i < NumBins; ++i) 
{ 
    cdf[i] = cdf[i - 1] + histogram[i - 1]; 
}
4. Normalize luminance CDF
float k = (float)cdf[numBins - 1]; 

float value = (float)cdf[idx] / k; 

norm[idx] = value;

5. Apply tone map, in other words, map each luminance value to its new value and then transform back to RGB.

 6. Save image back to disk.
Here’s the ApplyFilter function (with the noise removed):
// Applies gaussian blur to an image 
void HdrTone::ApplyFilter(const string& imagePath, const string& outputPath) 
{ 
    LoadHdrImage(imagePath, m_host); 
    rgb2xyY(m_host); 
    ComputeCDF(); 
    PostProcess(); 
    SaveImage(outputPath); 

#if 0            // Change to 1 to enable 
    // Validate GPU convertion against CPU result. 
    VerifyGpuComputation(); 
#endif 
}

Summary

In this article I am not explaining parallel algorithms; you should read about them in the references that I listed above. I was experimenting with my kernels and left a lot of improvements that can be done. Hope the code will help those learning parallel programming. Enjoy!

Source Code Files

program.cpp – main function.
HdrTone.h and HdrTone.cpp – responsible for loading and saving images, initializing and calling CUDA kernels, managing memory;
HdrTone.cu – CUDA kernels;
utilities – checks runtime errors and compares pixels from GPU conversion to reference conversion on CPU;
gputimer – events to measure execution time on the GPU in milliseconds
Note that the code for the article is available here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: