Here is code sample "Before" cuda and "After" cuda.
Sample is just an Outer / Inner Loop. I think if you are programmer, you saw and wrote many.
// Calculate Inverse DFT for desired Frequency Responces.
// Result is Filter Coeffs (not windowed
yet)
int i,k;
float omega;
float otemp;
clock_t time3 = clock();
for (i = 0; i < Taps; i++)
{
Tempcoeff_r[i] = 0.0; //Initialize Coeffs
to zero
otemp =
(float)(2.0 * PI * i / Taps);
for (k =
0; k < Taps; k++)
{
//Calculate Coeffs by sum up 0 to Tap
Count
//omega = (float)(2.0 * PI * k * i /
Taps);
omega = (float)(otemp * k);
Tempcoeff_r[i] = Tempcoeff_r[i] +
FreqResR[k] * cos(omega);
}
if
(((i+1)%1024)==0) printf("calculating coeff, count = %10d\r", i);
}
clock_t time4 = clock();
printf("\nCPU processing time
= %d sec\n", (time4-time3)/1000);
This is a piece of FIR
parameter calculation. There are Loop Count "Taps" exists both Outer, Inner Loop, O(N^2). when I want long TAPs FIR
parameter, execution time will go pretty longer.
This Outer / Inner Loop is CUDAble like below.
#define
THREAD_NUM 512
#define CUDABLOCKS 16
__global__ void Kernel_CalcCoeffs(float *g_Tempcoeff_r, const float
*g_FreqResR, const int Taps, const int currentblock)
{
// access Block ID
const unsigned int bix = blockIdx.x;
// access thread id
const unsigned int tid = threadIdx.x;
float tempcoeff;
float omega;
//Only Inner Loop exists here. index of
Outer Loop, i is calculated as..
// (Current Outer Loop Index) + (Current
Block, this thread exists) + Thread ID.
int i = THREAD_NUM*CUDABLOCKS*currentblock +
THREAD_NUM*bix + tid;
tempcoeff = 0.0;
for (int k = 0; k < Taps; k++)
{
omega =
2.0 * PI * k * i / Taps;
tempcoeff
= tempcoeff + g_FreqResR[k] * cos(omega);
}
__syncthreads();
g_Tempcoeff_r[i] = tempcoeff;
}
///
/// Other codes...
///
//CUDA Section
//I want to order 16 blocks * 512
Threads execution to CUDA.
dim3 grid( CUDABLOCKS, 1, 1);
dim3 threads( THREAD_NUM, 1,
1);
//allocate GPU device memory, in Graphic
Board.
float * g_Tempcoeff_r;
CUDA_SAFE_CALL(cudaMalloc((void**)
&g_Tempcoeff_r, sizeof(float)*Taps) );
float * g_FreqResR;
CUDA_SAFE_CALL(cudaMalloc((void**)
&g_FreqResR, sizeof(float)*Taps) );
clock_t time1 = clock();
//Copy Host Array to Device memory.
CUDA_SAFE_CALL( cudaMemcpy(g_FreqResR,
FreqResR, sizeof(float)*Taps, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(
cudaMemcpy(g_Tempcoeff_r, Tempcoeff_r, sizeof(float)*Taps,
cudaMemcpyHostToDevice));
//Outer Loop is reduced, to
1/(Threads*Blocks).
for(int j = 0; j < Taps /
(THREAD_NUM * CUDABLOCKS); j++)
{
//Inner
Loop is executed in GPU, parallel.
Kernel_CalcCoeffs<<<grid,threads>>>(g_Tempcoeff_r,
g_FreqResR, Taps, j);
CUDA_SAFE_CALL( cudaThreadSynchronize() );
//transfer
result to Host Memory.
CUDA_SAFE_CALL( cudaMemcpy( (void *)(Tempcoeff_r + j* (THREAD_NUM *
CUDABLOCKS)), g_Tempcoeff_r+j* (THREAD_NUM * CUDABLOCKS),
sizeof(float)* (THREAD_NUM * CUDABLOCKS), cudaMemcpyDeviceToHost) );
printf("calculating coeff, count = %10d / %d \r", (j+1)*(THREAD_NUM *
CUDABLOCKS), Taps);
}
clock_t time2 = clock();
cudaFree(g_Tempcoeff_r);
cudaFree(g_FreqResR);
printf("\nCUDA processing time = %d
sec\n", (time2-time1)/1000);
Inner Loop is executed in GPU, runs
parallel. I specified 512 Threads * 16 Blocks, so total 8192 threads
are issued at one time. Outer Loop Count "Jumps" 8192, or reduced
1/8192.
NOTE: Too Large thread*block cause display driver timeout. 128 Threads * 16 Blocks is more safe.
Here is the result.
Creating Frequency Response.
Calculating Coeffs.
calculating coeff, count = 131071
CPU processing time = 1172 sec
calculating coeff, count = 131072 / 131072
CUDA processing time = 21 sec
CUDA achieves x 56 times faster than CPU. It can be more faster by some tuning.
for example, if you use __cosf(x) function, execution time is 14 sec.
NOTE: __cosf(x) is tuned version, little bit less accuracy.
if you "#pragma unroll 16", execution time is 5 sec.
(CPU: AMD Phenom9600, GPU: GTX280)
Additional 40 lines of code reduced 1,150 seconds (almost 20 minutes) of execution time.
I should keep CPU side logic, to process smaller "taps".