Here is code sample "Before" cuda and "After" cuda.
Sample is just an Outer / Inner Loop. I think if you are programmer, you saw and wrote many.

    // Calculate Inverse DFT for desired Frequency Responces.
    // Result is Filter Coeffs (not windowed yet)
    int i,k;
    float omega;
    float otemp;
    clock_t time3 = clock();
    for (i = 0; i < Taps; i++)
    {
        Tempcoeff_r[i] = 0.0;    //Initialize Coeffs to zero
        otemp = (float)(2.0 * PI * i / Taps);

        for (k = 0; k < Taps; k++)
        {
            //Calculate Coeffs by sum up 0 to Tap Count
            //omega = (float)(2.0 * PI * k * i / Taps);
            omega = (float)(otemp * k);
            Tempcoeff_r[i] = Tempcoeff_r[i] + FreqResR[k] * cos(omega);
        }
        if (((i+1)%1024)==0) printf("calculating coeff, count = %10d\r", i);
    }
    clock_t time4 = clock();
    printf("\nCPU  processing time = %d sec\n", (time4-time3)/1000);



This is a piece of FIR parameter calculation. There are Loop Count "Taps" exists both Outer, Inner Loop, O(N^2).   when I want long TAPs FIR parameter, execution time will go pretty longer.

This Outer / Inner Loop is CUDAble like below.

#define THREAD_NUM  512
#define CUDABLOCKS 16

__global__ void Kernel_CalcCoeffs(float *g_Tempcoeff_r, const float *g_FreqResR, const int Taps, const int currentblock)
{
    // access Block ID
    const unsigned int bix = blockIdx.x;
    // access thread id
    const unsigned int tid = threadIdx.x;

    float tempcoeff;
    float omega;

    //Only Inner Loop exists here. index of Outer Loop, i is calculated as..
    // (Current Outer Loop Index) + (Current Block, this thread exists) + Thread ID.
    int i = THREAD_NUM*
CUDABLOCKS*currentblock + THREAD_NUM*bix + tid;

    tempcoeff = 0.0;
    for (int k = 0; k < Taps; k++)
    {
        omega = 2.0 * PI * k * i / Taps;
        tempcoeff = tempcoeff + g_FreqResR[k] * cos(omega);
    }
    __syncthreads();

    g_Tempcoeff_r[i] = tempcoeff;

}

///
/// Other codes...
///
    //CUDA Section
    //I want to order 16 blocks * 512 Threads execution to CUDA.
    dim3  grid( CUDABLOCKS, 1, 1);
    dim3  threads( THREAD_NUM, 1, 1);

    //allocate GPU device memory, in Graphic Board.
    float * g_Tempcoeff_r;
    CUDA_SAFE_CALL(cudaMalloc((void**) &g_Tempcoeff_r, sizeof(float)*Taps) );
    float * g_FreqResR;
    CUDA_SAFE_CALL(cudaMalloc((void**) &g_FreqResR, sizeof(float)*Taps) );

    clock_t time1 = clock();

    //Copy Host Array to Device memory.
    CUDA_SAFE_CALL( cudaMemcpy(g_FreqResR, FreqResR, sizeof(float)*Taps, cudaMemcpyHostToDevice));
    CUDA_SAFE_CALL( cudaMemcpy(g_Tempcoeff_r, Tempcoeff_r, sizeof(float)*Taps, cudaMemcpyHostToDevice));

    //Outer Loop is reduced, to 1/(Threads*Blocks).
    for(int j = 0; j < Taps / (THREAD_NUM * CUDABLOCKS); j++)
    {
        //Inner Loop is executed in GPU, parallel.
        Kernel_CalcCoeffs<<<grid,threads>>>(g_Tempcoeff_r, g_FreqResR, Taps, j);
        CUDA_SAFE_CALL( cudaThreadSynchronize() );
        //transfer result to Host Memory.
        CUDA_SAFE_CALL( cudaMemcpy( (void *)(Tempcoeff_r + j* (THREAD_NUM * CUDABLOCKS)), g_Tempcoeff_r+j* (THREAD_NUM * CUDABLOCKS), sizeof(float)* (THREAD_NUM * CUDABLOCKS), cudaMemcpyDeviceToHost) );
        printf("calculating coeff, count = %10d / %d \r", (j+1)*(THREAD_NUM * CUDABLOCKS), Taps);
    }
    clock_t time2 = clock();
    cudaFree(g_Tempcoeff_r);
    cudaFree(g_FreqResR);
    printf("\nCUDA processing time = %d sec\n", (time2-time1)/1000);


Inner Loop is executed in GPU,  runs parallel. I specified 512 Threads * 16 Blocks, so total 8192 threads are issued at one time. Outer Loop Count "Jumps" 8192, or reduced 1/8192.
NOTE: Too Large thread*block cause display driver timeout. 128 Threads * 16 Blocks is more safe.

Here is the result.

Creating Frequency Response.
Calculating Coeffs.
calculating coeff, count =     131071
CPU  processing time = 1172 sec
calculating coeff, count =     131072 / 131072
CUDA processing time = 21 sec


CUDA achieves  x 56 times faster than CPU. It can be more faster by some tuning.
for example, if you use __cosf(x) function, execution time is 14 sec.
NOTE: __cosf(x) is tuned version, little bit less accuracy.
if you "#pragma unroll 16", execution time is 5 sec.

(CPU: AMD Phenom9600, GPU: GTX280)
Additional 40 lines of code reduced 1,150 seconds (almost 20 minutes) of execution time.

I should keep CPU side logic, to process smaller "taps".