Saturday, June 7, 2014

first cuda program for beginners

before starting the code directly, we need to understand the basic terms and theoretical aspects of cuda programming. First thing is we need a device popularly called GPU for cuda programming and GPU must be Nvidia gpu not any others like ATI or anyone else except Nvidia’s GPU.
Then we need to have basic concept of two terms
1)      Host
2)      Device
These two terms are frequently used in the documentation and in the resources found in the internet.
1)      Host refers to CPU where sequential execution of programs occurs.
2)      Device refers to Nvidia GPU card.



Basic concepts of Gpu Programming
Gpu programming is parallel programming where we create large number of threads and each threads executes the program concurrently at the same time thus parallism occurs.And the block of code which is executed by each thread is called kernel function.

Now let us learn how the  cuda code is written.
1)      first variables for the host and device are initialized.
2)      Memory is allocated in host side.
3)      Values are assigned in the memory at host side.
4)      Memory is allocated in the devices.
5)      Values in the host side are copied to device memory as gpu can only access the gpu memory.
6)      Then the parallel operations are done in the gpu using kernel function.
7)       Then the final data are again copied back to host memory as in order to display the data we need to display it through cpu memory or host memory.
8)       Finally free host aswell as device memory.
Basic cuda program
Firstprogram.cu
#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
__global__ void sum_array(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] + a[idx];
}

// main routine that executes on the host
int main(void)
{
  float *array_h, *array_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  array_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &array_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) array_h[i] = (float)i;
  cudaMemcpy(array_d, array_h, size, cudaMemcpyHostToDevice);
  



  
  sum_array <<< 3,4 >>> (array_d, N);//kernel call
  // save the result from device in the device memory and copy it in host array
  cudaMemcpy(array_h, array_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, array_h[i]);
  // Cleanup
  free(array_h); cudaFree(array_d);
}


Here __global__ void sum_array(float *a, int N) is a kernel function. This kernel function runs simultaneously at the same time for all the threads which causes parralism. During the call of kernel function <<,>> is used which indicates the call to kernel function and << ,>> has << number of block,number of threads >>.


No comments:

Post a Comment