Here is Second sample, convolution.
Convolution is calculation like below,
for (n=0; n<end;n++) output[n] = Sum_of_m{ data[n - m]*filter[m] }
used in many mathematics / statistics / physics / computing fields.
FIR filter also uses convolution.

Convolution is O(n * m) algorithm, so when m goes large calculation will be quite heavy.

Let's see result first. This code calculates
294,912 samples * 32,768 length filter = 9,663,676,416 load-load-multiply-add operation.

Using device 0: GeForce GTX 280
Calculated  293888 / 294912
CPU        time =  18923 msec, out[20000]=13.800844, out[200000]=-50.197727
Calculated  294912 / 294912
GPU(1)     time =    827 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated  294912 / 294912
GPU(2)     time =    234 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated  294912 / 294912
GPU(2)UR4  time =    156 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated  294912 / 294912
GPU(2)UR8  time =    140 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated  294912 / 294912
GPU(2)UR16 time =    140 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated  294912 / 294912
GPU(3)     time =    265 msec, out[20000]=13.800883, out[200000]=-50.197735
Calculated  294912 / 294912
GPU(3)UR   time =    140 msec, out[20000]=13.800883, out[200000]=-50.197735
Copy  done  294912 / 294912
COPYonly time =     16 msec.
Press ENTER to exit...



ConvSample.zip

UR means " #pragma unroll X", loop unrolling.
Fastest version is 18923/140 = 135 times faster than CPU.
You will mind "COPY overhead" of CUDA. I added "copy only" time at bottom. 16 msec is about 10% of total operation.



1. CPU

void convolution_CPU(float * input, int inLen, float * filter, int filterLen, float *output)
{
    float fOut;
    for (int i = 0; i < inLen - filterLen; i++)
    {
        fOut = 0.0;
        for (int j = 0; j < filterLen; j++)
        {
            fOut += input[i + j]*filter[j];
        }
        output[i] = fOut;

        if (i % 1024 == 0) printf("Calculated %7d / %d\r", i, 294912);
    }
}


Just a multiply - sum - store.

2. GPU (1) straight transform

__global__ void Kernel_Convolution1(const float * input, const int inLen, const float * filter, const int filterLen, float * output)
{
    // access Block Width
    const unsigned int bw = gridDim.x;
    // access Block ID
    const unsigned int bix = blockIdx.x;
    // access thread id
    const unsigned int tid = threadIdx.x;

    //There are no optimization.
    //just a loop.

    float fOut;
    //there are THREADNUM * bw total threads.
    //My thread number (from beginning) = BlockIDX*threadsinblock + threadID.
    for (int i = 0; i < inLen - filterLen; i = i + THREAD_NUM * bw )
    {
        fOut = 0.0;
        for (int j = 0; j < filterLen; j++)
        {
            fOut += input[i + THREAD_NUM * bix + j + tid ]*filter[j];
        }
        __syncthreads();
        output[i + THREAD_NUM * bix + tid] = fOut;
    }
}


outer [i] Loop was changed from i++ to { i + THREAD_NUM * bw }.
This Jump means, the function is executed in parallel, by many threads.
So one thread does not have to calculate all of [i].

3. GPU(2) shared memory
the arguments *input *filter, are placed in Board Memory (Global Memory), which you can see around GPU chip.
They are huge (512MB - 1GB) but very slow. It will take hundreds clocks to access these data.
GPU scheduling will hide this latency, but this delay should be resolved.
(2) uses shared memory as scratch pad, it is 16KB per shader processer, GPU chip's internal memory.

__global__ void Kernel_Convolution2(const float * input, const int inLen, const float * filter, const int filterLen, float * output)
{
    // access Block Width
    const unsigned int bw = gridDim.x;
    // access Block ID
    const unsigned int bix = blockIdx.x;
    // access thread id
    const unsigned int tid = threadIdx.x;

    //fiter, and input are used many times, from global memory.
    //let's copy to shared memory, then calculate.

    __shared__ float s_indata[THREAD_NUM+THREAD_NUM];
    __shared__ float s_filter[THREAD_NUM];

    float fOut;
    //there are THREADNUM * bw total threads.
    //My thread number (from beginning) = BlockIDX*threadsinblock + threadID.
    for (int i = 0; i < inLen - filterLen; i = i + THREAD_NUM * bw )
    {
        fOut = 0.0;

        for (int j = 0; j < filterLen ; j = j + THREAD_NUM)
        {
            //copy global to shared memory, for threads
            s_filter[tid             ] = filter[j + tid];
            s_indata[tid             ] = input[i + THREAD_NUM * bix + j + tid             ];
            s_indata[tid + THREAD_NUM] = input[i + THREAD_NUM * bix + j + tid + THREAD_NUM];
            __syncthreads();

            //use shared as scratch memory.
//#pragma unroll X
            for (int k = 0; k < THREAD_NUM; k++)
            {
                fOut += s_indata[k + tid] * s_filter[k];
            }
        }
        __syncthreads();
        output[i + THREAD_NUM * bix + tid] = fOut;
    }
}



inportant point is, one thread copy only 1 word from filter, 2 word from data, then runs loop for THREAD_NUM.
This is because copy formula has "tid", thread id in address as offset. Then these addresses are "coalesched" to global memory.

Destination is "Shared" memory. It means, One thread can use another word which was copied by another thread.
There is __syncthreads() routine after 3 words copy.
By this function, each thread will wait until all source memory(on board) are copied to destination(on chip). Now they can be "Shared".

also you can see __syncthreads() routine before store operation.
By this sync, access to board memory are coalesched, packed.

Most inner Loop should have "#pragma unroll" . By unrolling 8 loops,  best performance achieved.
In other word, 8 unroll is enough to hide memory access latency.

GPU(3) is just for comparison. you will wonder "There are wasted copy in (2). if I use large shared, less wasted, it will be faster." But  no improvement for performance,  code is more difficult to understand.
I will not recommend this optimization.