Here is Second sample, convolution.
Convolution is calculation like below,
for (n=0; n<end;n++) output[n] = Sum_of_m{
data[n - m]*filter[m] }
used in many mathematics / statistics / physics / computing fields.
FIR filter also uses convolution.
Convolution is O(n * m) algorithm, so when m goes large
calculation will be quite heavy.
Let's see result first. This code calculates
294,912 samples * 32,768 length filter = 9,663,676,416 load-load-multiply-add operation.
Using device 0: GeForce GTX 280
Calculated 293888 / 294912
CPU time = 18923 msec, out[20000]=13.800844, out[200000]=-50.197727
Calculated 294912 / 294912
GPU(1) time = 827 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated 294912 / 294912
GPU(2) time = 234 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated 294912 / 294912
GPU(2)UR4 time = 156 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated 294912 / 294912
GPU(2)UR8 time = 140 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated 294912 / 294912
GPU(2)UR16 time = 140 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated 294912 / 294912
GPU(3) time = 265 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated 294912 / 294912
GPU(3)UR time = 140 msec, out[20000]=13.800883, out[200000]=-50.197735
Copy done 294912 / 294912
COPYonly time = 16 msec.
Press ENTER to exit...
ConvSample.zip
UR means " #pragma unroll X", loop unrolling.
Fastest version is 18923/140 = 135 times faster than CPU.
You will mind "COPY overhead" of CUDA. I added "copy only" time at bottom. 16 msec is about 10% of total operation.

1. CPU
void
convolution_CPU(float * input, int inLen, float * filter, int
filterLen, float *output)
{
float fOut;
for (int i = 0; i < inLen -
filterLen; i++)
{
fOut =
0.0;
for (int
j = 0; j < filterLen; j++)
{
fOut += input[i + j]*filter[j];
}
output[i]
= fOut;
if (i %
1024 == 0) printf("Calculated %7d / %d\r", i, 294912);
}
}
Just a multiply - sum -
store.
2. GPU (1) straight transform
__global__ void
Kernel_Convolution1(const float * input, const int inLen, const float *
filter, const int filterLen, float * output)
{
// access Block Width
const unsigned int bw = gridDim.x;
// access Block ID
const unsigned int bix = blockIdx.x;
// access thread id
const unsigned int tid = threadIdx.x;
//There are no optimization.
//just a loop.
float fOut;
//there are THREADNUM * bw total threads.
//My thread number (from beginning) =
BlockIDX*threadsinblock + threadID.
for (int i = 0; i < inLen -
filterLen; i = i + THREAD_NUM * bw )
{
fOut =
0.0;
for (int
j = 0; j < filterLen; j++)
{
fOut += input[i + THREAD_NUM * bix + j +
tid ]*filter[j];
}
__syncthreads();
output[i
+ THREAD_NUM * bix + tid] = fOut;
}
}
outer [i] Loop was
changed from i++ to { i + THREAD_NUM * bw }.
This Jump means, the function is executed in parallel, by many threads.
So one thread does not have to calculate all of [i].
3. GPU(2) shared memory
the arguments *input *filter, are placed in Board Memory (Global Memory),
which you can see around GPU chip.
They are huge (512MB - 1GB) but very slow. It will take hundreds clocks
to access these data.
GPU scheduling will hide this latency, but this delay should be
resolved.
(2) uses shared memory as scratch pad, it is 16KB per shader processer, GPU chip's internal memory.
__global__
void Kernel_Convolution2(const float * input, const int inLen, const
float * filter, const int filterLen, float * output)
{
// access Block Width
const unsigned int bw = gridDim.x;
// access Block ID
const unsigned int bix = blockIdx.x;
// access thread id
const unsigned int tid = threadIdx.x;
//fiter, and input are used many times,
from global memory.
//let's copy to shared memory, then
calculate.
__shared__ float
s_indata[THREAD_NUM+THREAD_NUM];
__shared__ float s_filter[THREAD_NUM];
float fOut;
//there are THREADNUM * bw total threads.
//My thread number (from beginning) =
BlockIDX*threadsinblock + threadID.
for (int i = 0; i < inLen -
filterLen; i = i + THREAD_NUM * bw )
{
fOut =
0.0;
for (int
j = 0; j < filterLen ; j = j + THREAD_NUM)
{
//copy global to shared memory, for
threads
s_filter[tid
] = filter[j + tid];
s_indata[tid
] = input[i + THREAD_NUM * bix + j +
tid
];
s_indata[tid + THREAD_NUM] = input[i +
THREAD_NUM * bix + j + tid + THREAD_NUM];
__syncthreads();
//use shared as scratch memory.
//#pragma unroll X
for (int k = 0; k < THREAD_NUM;
k++)
{
fOut +=
s_indata[k + tid] * s_filter[k];
}
}
__syncthreads();
output[i
+ THREAD_NUM * bix + tid] = fOut;
}
}
inportant point is, one thread copy only 1 word from filter, 2 word from data, then runs loop for THREAD_NUM.
This is because copy formula has "tid", thread id in address as offset. Then these addresses are "coalesched" to global memory.
Destination is "Shared" memory. It means, One thread can use another word which was copied by another thread.
There is __syncthreads() routine after 3 words copy.
By this
function, each thread will wait until all source memory(on board) are
copied to destination(on chip). Now they can be "Shared".
also you can see __syncthreads() routine before store operation.
By this sync, access to board memory are coalesched, packed.
Most inner Loop should have "#pragma unroll" . By unrolling 8 loops, best performance achieved.
In other word, 8 unroll is enough to hide memory access latency.
GPU(3) is just for comparison. you will wonder "There are wasted copy
in (2). if I use large shared, less wasted, it will be faster." But
no improvement for performance, code is more difficult to
understand.
I will not recommend this optimization.