Thursday, September 4, 2014

Sorting of local variables or using thrust inside kernels in cuda

Thrust is a library to cuda which allows to perform several operation in an easy way i.e we just need to call the  the function without taking much care on it. Sorting is also one among  several operation  performed by thrust in cuda.

Generally thrust code is a host code. We need to call the thrust function from host and whenever we tried to used inside kernel it complains that the host code cannot be called in device function. However we may come across the situation that we need to sort the number within kernel. for e.g we may have a local variables within kernel which needs to be sorted and perform operation .

If we are in need of such operation and now it is possible in cuda with thrust . But what we need is  appropriate thrust version. In thrust version v1.7 this feature is not supported. we need to have thrust v1.8 and further if we are programming in windows platform with visual studio then beware vs 2005 doesnot support. visual studio 2010 works well in my case . I havenot tested for other newer version.


thrust:sort can be combined with the thrust:seq execution policy to sort numbers sequentially within a single CUDA thread (or sequentially within a single CPU thread). and 
#include <thrust/execution_policy.h>   needs to be added in the header.

here below is the complete code to sort the local array with cuda 5.0 with thrust v 1.8


#include <stdio.h>
#include<iostream>

#include <cuda.h>


// main routine that executes on the host
 for(int i=0;i<N; i++)
}
int main(void)
  cudaMemcpy(a_h, a_d, sizeof(int)*N, cudaMemcpyDeviceToHost);

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/device_ptr.h>
#include <thrust/execution_policy.h>

__global__ void sort_array(int *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
 int td[10];
 for(int i=0;i<N; i++)
 {
   td[i]=a[i];
 }
 thrust::device_ptr<int> t_a(td);
 thrust::sort(thrust::seq,t_a, t_a + N);

 {
  a[i] = td[i];
 }


{
  int *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(int);
  a_h = (int *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);// Allocate array on device
  std::cout<<"enter the 10 numbers";
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) 
  {
      std::cin>>a_h[i];
  }
  for (int i=0; i<N; i++) printf("%d %d\n", i, a_h[i]);
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  sort_array <<< 1,1 >>> (a_d, N);
 /* thrust::device_ptr<int> t_a(a_d);
  thrust::sort(a_d, a_d + N);*/
  // Do calculation on device:

  // Print results
  printf("sorted value\n");
  for (int i=0; i<N; i++) printf("%d %d\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);


output



No comments:

Post a Comment