before
starting the code directly, we need to understand the basic terms and
theoretical aspects of cuda programming. First thing is we need a device
popularly called GPU for cuda programming and GPU must be Nvidia gpu not any
others like ATI or anyone else except Nvidia’s GPU.
Then we
need to have basic concept of two terms
1)
Host
2)
Device
These two
terms are frequently used in the documentation and in the resources found in
the internet.
1)
Host
refers to CPU where sequential execution of programs occurs.
2)
Device
refers to Nvidia GPU card.
Basic concepts of Gpu Programming
Gpu programming is parallel programming where we create large number of
threads and each threads executes the program concurrently at the same time
thus parallism occurs.And the block of code which is executed by each thread is
called kernel function.
Now let us learn how the cuda
code is written.
1)
first
variables for the host and device are initialized.
2)
Memory
is allocated in host side.
3)
Values
are assigned in the memory at host side.
4)
Memory
is allocated in the devices.
5)
Values
in the host side are copied to device memory as gpu can only access the gpu
memory.
6)
Then
the parallel operations are done in the gpu using kernel function.
7)
Then the final data are again copied back to
host memory as in order to display the data we need to display it through cpu
memory or host memory.
8)
Finally free host aswell as device memory.
Basic cuda
program
Firstprogram.cu
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void sum_array(float *a, int N)
{
int idx = blockIdx.x *
blockDim.x + threadIdx.x;
if (idx<N) a[idx] =
a[idx] + a[idx];
}
// main routine that executes on the host
int main(void)
{
float *array_h, *array_d;
// Pointer to host & device arrays
const int N = 10; //
Number of elements in arrays
size_t size = N *
sizeof(float);
array_h = (float *)malloc(size);
// Allocate array on host
cudaMalloc((void **) &array_d,
size); // Allocate array on device
// Initialize host array and copy
it to CUDA device
for (int i=0; i<N; i++)
array_h[i] = (float)i;
cudaMemcpy(array_d, array_h, size,
cudaMemcpyHostToDevice);
sum_array <<< 3,4 >>> (array_d, N);//kernel call
// save the result from device in
the device memory and copy it in host array
cudaMemcpy(array_h, array_d,
sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++)
printf("%d %f\n", i, array_h[i]);
// Cleanup
free(array_h); cudaFree(array_d);
}
Here __global__
void sum_array(float *a,
int N) is a kernel function. This kernel function
runs simultaneously at the same time for all the threads which causes
parralism. During the call of kernel function <<,>> is used which
indicates the call to kernel function and << ,>> has <<
number of block,number of threads >>.
No comments:
Post a Comment