Matrix Transpose on GPU using CUDA

The following sample demonstrates matrix transpose on GPU. It starts with sequential code on the CPU and progresses towards more advanced optimizations, first a parallel transformation on the CPU, then several transformations on the GPU. In real life, it is impractical to do just a single matrix operation on the GPU due to the cost of copying data from the host to the device and back, you would want to execute multiple operations on data to compensate for transfers. I have to make a disclaimer that my notebook is 3-years old and is due for an upgrade, still the advice to keep data on the GPU for multiple operations on it remains. To make the exercise meaningful I used huge matrices up to ~800MB.

Note that the code executed 6 matrix transformations on the GPU (including moving matrices to the GPU and back) faster than sequential code executed one and almost at the same time as parallel code executed one on the CPU.

Here are the results of my test runs:

Matrix size:   256MB  (8192 x 8192)
Serial on CPU: 1036160us (1036ms)
Parallel on CPU: 715091us (715ms)
Copied matrix to the GPU: 278036us (278ms)
  transpose per row:                          182160us (182.16ms)
  transpose per element:                      49380.8us (49.3808ms)
  transpose per element tiled 32x32:          42527.4us (42.5274ms)
  transpose per element tiled 16x16:          15986.8us (15.9868ms)
  transpose per element tiled & padded 32x32: 42520.1us (42.5201ms)
  transpose per element tiled & padded 16x16: 16000.7us (16.0007ms)
Copied matrix to the CPU: 200055us (200ms)
GPU kernel: 864134us (864ms)
Done... Resetting device!

Matrix size:   576MB (12288 x 12288)
Serial on CPU: 2566357us (2566ms)
Parallel on CPU: 1646203us (1646ms)
Copied matrix to the GPU: 578070us (578ms)
  transpose per row:                          429319us (429.319ms)
  transpose per element:                      200222us (200.222ms)
  transpose per element tiled 32x32:          94772.3us (94.7723ms)
  transpose per element tiled 16x16:          36178.4us (36.1784ms)
  transpose per element tiled & padded 32x32: 94198.5us (94.1985ms)
  transpose per element tiled & padded 16x16: 35532.6us (35.5326ms)
Copied matrix to the CPU: 475066us (475ms)
GPU kernel: 1994252us (1994ms)
Done... Resetting device!

Matrix size:   576MB (12288 x 12288)
Serial on CPU: 2577357us (2577ms)
Parallel on CPU: 1643209us (1643ms)
Copied matrix to the GPU: 550103us (550ms)
  transpose per row:                          438458us (438.458ms)
  transpose per element:                      199382us (199.382ms)
  transpose per element tiled 32x32:          95158.2us (95.1582ms)
  transpose per element tiled 16x16:          35546.1us (35.5461ms)
  transpose per element tiled & padded 32x32: 93601.1us (93.6011ms)
  transpose per element tiled & padded 16x16: 35327.2us (35.3272ms)
Copied matrix to the CPU: 440073us (440ms)
GPU kernel: 1938261us (1938ms)
Done... Resetting device!

Matrix size:   784MB (14336 x 14336)
Serial on CPU: 3570456us (3570ms)
Parallel on CPU: 2291295us (2291ms)
Copied matrix to the GPU: 1279162us (1279ms)
  transpose per row:                          534128us (534.128ms)
  transpose per element:                      190986us (190.986ms)
  transpose per element tiled 32x32:          250136us (250.136ms)
  transpose per element tiled 16x16:          113303us (113.303ms)
  transpose per element tiled & padded 32x32: 247859us (247.859ms)
  transpose per element tiled & padded 16x16: 112266us (112.266ms)
Copied matrix to the CPU: 645092us (645ms)
GPU kernel: 3441454us (3441ms)
Done... Resetting device!

Solution consists of a kernel mtx.cu, a helper class mtx, and a main function in the program.cpp file.

Main function

Matrix helper class

Run tests function in the Mtx (matrix helper) class

Matrix transpose kernels

This is the entry function into the kernels, it contains 6 tests varying in optimization techniques.

Kernels

Transpose per row

The simplest of kernels. Each threads loops on every column value and writes into global memory in reverse order. This is the slowest way to run code on the GPU.

Here’s the code

Transpose per element

Each thread is responsible for copying a value to the destination. Runs much faster than the previous example, but could be improved further by coalescing write access (ensuring that writes are into adjacent memory positions).

Coalesced write

Difference in execution time depends on the block size, that is on the number of threads concurrently executing. The idea here is that we want first to read from coalesced (or adjacent) memory locations and write transposed values to a temporary memory on the chip, then read from the temporary matrix and write to new global memory locations which are also coalesced (adjacent) although different from global read loacations.

Matrix size:   784MB (14336 x 14336)
Serial on CPU: 3680235us (3680ms)
Parallel on CPU: 2294109us (2294ms)
Copied matrix to the GPU: 1714101us (1714ms)
  transpose per row:                          523030us (523.03ms)
  transpose per element:                      185965us (185.965ms)
  transpose per element tiled 32x32:          132566us (132.566ms)
  transpose per element tiled 16x16:          63028.6us (63.0286ms)
  transpose per element tiled & padded 32x32: 123514us (123.514ms)
  transpose per element tiled & padded 16x16: 61854.2us (61.8542ms)
Copied matrix to the CPU: 726024us (726ms)
GPU kernel: 3624254us (3624ms)
Done... Resetting device!
__global__ 
void transpose_per_element_tiled(float* t, const float* m, int matrixSize) 
{ 
    int col = blockIdx.x * blockDim.x + threadIdx.x;        // col 
    int row = blockIdx.y * blockDim.y + threadIdx.y;        // row 

    if (col >= matrixSize || row >= matrixSize) 
        return; 

    extern __shared__ float tile[]; 

    // Coalesced read from global memory - TRANSPOSED write into shared memory 

    int from = row * matrixSize + col; 
    int tx   = threadIdx.y + threadIdx.x * blockDim.x;    // col 
    int ty   = threadIdx.y * blockDim.y + threadIdx.x;    // row 

    tile[ty] = m[from]; 
    __syncthreads(); 

    // Read from shared memory - coalesced write to global memory 
    int to   = (blockIdx.y * blockDim.y + threadIdx.x) + (blockIdx.x * blockDim.x + threadIdx.y) * matrixSize; 

    t[to] = tile[tx]; 
}

 

Download code here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: