Thursday, September 11, 2014

how cuda code works or how the cuda code is scheduled in GPU


Cuda is a parallel programming code . This is generally done when we are in need of higher performance in terms of speed. When we are in need of speed increment we need to know some terminology which affects speed so that we can do it in a desired and best way.

First let us talk about how cuda works ?
 cuda is coded with kernels in it. we need to define the thread size and block size. cuda has important device properties like streaming Multiprocessor(SM). Different Gpu devices have different number of cuda streaming Multiprocessor. Streaming processor , multiprocessor are the same thing. when we search in the web, different webpages may mentioned any of these terms. This may confuse us but are the same thing. cuda devices have number of streaming processor with maximum number of block in each streaming Multiprocessor(SM).

 First we need to have knowledge on these two terms such that we can define the thread size and block size properly to increase occupancy and thus speed. when the program runs the block are assigned to the streaming processor as per its capacity(max number of block in each SM). The SM then "unpacks" the threadblock into warps, and schedules warp instructions on the SM internal resources (e.g. "cores or SP", and special function units), as those resources become available. Higher number of multiprocessor(SM), large number are block are scheduled parallelly and thus increase speed. the important thing we need to know is there is also limit of maximum number of thread in each SM ,thus even though the limit of maximum number of blocks may be (lets say 16) doesnot mean all 16 may be scheduled to a SM . if we have defined 1024 thread per block than (in cc 3.5 max available thread per SM is 2048) 2048/1024 = 2, only 2 block are scheduled in a SM. out of 16 , only 2 blocks are scheduled to each SM .Total 16*2 =32 blocks are scheduled at a time in 16 different SM and if we have more than 32 blocks then they will be scheduled for later execution. We need to choose thread size in each block such that total number of wraps per SM are properly utilized.

 SP in cuda architecture is called streaming processor which is also known as cuda cores. each thread in cuda is scheduled to cuda cores.

 Significance of cuda cores in cuda 
The number of cores per SM translates roughly to how many warp instructions can be processed in any given clock cycle. A single warp instruction can be processed in any given clock cycle but requires 32 cores to complete (and may require multiple clock cycles to complete, depending on the instruction). A cc2.0 fermi SM with 32 "cores" can retire at most 1 instruction per clock, average.  (it's actually 2 instructions every 2 clocks). A Kepler SMX having 192 cores can retire 4 or more instructions per clock.

for more information: http://people.math.umass.edu/~johnston/M697S12/CUDA_threads_and_block_scheduling.pdf

Thursday, September 4, 2014

Sorting of local variables or using thrust inside kernels in cuda

Thrust is a library to cuda which allows to perform several operation in an easy way i.e we just need to call the  the function without taking much care on it. Sorting is also one among  several operation  performed by thrust in cuda.

Generally thrust code is a host code. We need to call the thrust function from host and whenever we tried to used inside kernel it complains that the host code cannot be called in device function. However we may come across the situation that we need to sort the number within kernel. for e.g we may have a local variables within kernel which needs to be sorted and perform operation .

If we are in need of such operation and now it is possible in cuda with thrust . But what we need is  appropriate thrust version. In thrust version v1.7 this feature is not supported. we need to have thrust v1.8 and further if we are programming in windows platform with visual studio then beware vs 2005 doesnot support. visual studio 2010 works well in my case . I havenot tested for other newer version.


thrust:sort can be combined with the thrust:seq execution policy to sort numbers sequentially within a single CUDA thread (or sequentially within a single CPU thread). and 
#include <thrust/execution_policy.h>   needs to be added in the header.

here below is the complete code to sort the local array with cuda 5.0 with thrust v 1.8


#include <stdio.h>
#include<iostream>

#include <cuda.h>


// main routine that executes on the host
 for(int i=0;i<N; i++)
}
int main(void)
  cudaMemcpy(a_h, a_d, sizeof(int)*N, cudaMemcpyDeviceToHost);

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/device_ptr.h>
#include <thrust/execution_policy.h>

__global__ void sort_array(int *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
 int td[10];
 for(int i=0;i<N; i++)
 {
   td[i]=a[i];
 }
 thrust::device_ptr<int> t_a(td);
 thrust::sort(thrust::seq,t_a, t_a + N);

 {
  a[i] = td[i];
 }


{
  int *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(int);
  a_h = (int *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);// Allocate array on device
  std::cout<<"enter the 10 numbers";
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) 
  {
      std::cin>>a_h[i];
  }
  for (int i=0; i<N; i++) printf("%d %d\n", i, a_h[i]);
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  sort_array <<< 1,1 >>> (a_d, N);
 /* thrust::device_ptr<int> t_a(a_d);
  thrust::sort(a_d, a_d + N);*/
  // Do calculation on device:

  // Print results
  printf("sorted value\n");
  for (int i=0; i<N; i++) printf("%d %d\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);


output



Wednesday, September 3, 2014

how to start cuda programming in windows in visual studio.

Cuda is a parallel programming which supports c/c++ or even python. If we are trying to write a cuda program in c/c++ an in windows platform, then visual studio is the best IDE to start cuda programming. Further Visual Sudio 2005 is the best version that supports cuda programming.

Now first of all let’s talk on the installation stuff we need to run the cuda program
  •       Nvidia display driver
  •      Cuda toolkit(different version are available)
  •      Cuda vs wizard(integrates the cuda options in visual studio for project creation)



After we install visual studo, we need to install cuda toolkit 5 or even different version. 

  •          Download the cuda toolkit from here

  • Install the cuda toolkit, after installation cuda sample browser is installed with it.
  • Then install cuda vs wizard which enable cuda option during creation of project.



Steps of creating cuda project in visual studio
  1. Open visual studio
  2.     Click  File->New->Project.  The following form with cuda option appears


  

a 3.  Click cuda and name the project
    4.    After  creation of project , the following  screen is seen with headers ,source folder created at the left hand side of the  project.

      5.    Right click on source ->Add->new Item. The following options will appear as
   



   5)   Click on cuda, name the file as cuda.cu
    6)Then write the cuda program in it.
     7)     Go to main project->right click->custom build rules as
   



     8) Check theCuda runtime API Build rule at the last as
8




















   9)    Go to the main project->right click->properties->linker->input and  in additional dependiences  add cudart.lib
   

    10)     Go to Tools->options->project and solution->vc++ directories  and give the path to  include as
    
    
111)Go to dropdown and select library files as and set the path to the win32 as 

f

114) then finally copy and paste the simple hello world code from here build the program and  run. Happy Coding