first cuda program for beginners

Saturday, June 7, 2014

first cuda program for beginners

before starting the code directly, we need to understand the basic terms and theoretical aspects of cuda programming. First thing is we need a device popularly called GPU for cuda programming and GPU must be Nvidia gpu not any others like ATI or anyone else except Nvidia’s GPU.

Then we need to have basic concept of two terms

1) Host

2) Device

These two terms are frequently used in the documentation and in the resources found in the internet.

1) Host refers to CPU where sequential execution of programs occurs.

2) Device refers to Nvidia GPU card.

Basic concepts of Gpu Programming

Gpu programming is parallel programming where we create large number of threads and each threads executes the program concurrently at the same time thus parallism occurs.And the block of code which is executed by each thread is called kernel function.

Now let us learn how the cuda code is written.

1) first variables for the host and device are initialized.

2) Memory is allocated in host side.

3) Values are assigned in the memory at host side.

4) Memory is allocated in the devices.

5) Values in the host side are copied to device memory as gpu can only access the gpu memory.

6) Then the parallel operations are done in the gpu using kernel function.

7) Then the final data are again copied back to host memory as in order to display the data we need to display it through cpu memory or host memory.

8) Finally free host aswell as device memory.

Basic cuda program

Firstprogram.cu

#include <stdio.h>

#include <cuda.h>

// Kernel that executes on the CUDA device

__global__ void sum_array(float *a, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx<N) a[idx] = a[idx] + a[idx];

}

// main routine that executes on the host

int main(void)

{

float *array_h, *array_d; // Pointer to host & device arrays

const int N = 10; // Number of elements in arrays

size_t size = N * sizeof(float);

array_h = (float *)malloc(size); // Allocate array on host

cudaMalloc((void **) &array_d, size); // Allocate array on device

// Initialize host array and copy it to CUDA device

for (int i=0; i<N; i++) array_h[i] = (float)i;

cudaMemcpy(array_d, array_h, size, cudaMemcpyHostToDevice);

sum_array <<< 3,4 >>> (array_d, N);//kernel call

// save the result from device in the device memory and copy it in host array

cudaMemcpy(array_h, array_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

// Print results

for (int i=0; i<N; i++) printf("%d %f\n", i, array_h[i]);

// Cleanup

free(array_h); cudaFree(array_d);

}

Here __global__ void sum_array(float *a, int N) is a kernel function. This kernel function runs simultaneously at the same time for all the threads which causes parralism. During the call of kernel function <<,>> is used which indicates the call to kernel function and << ,>> has << number of block,number of threads >>.

CODING EVERYTHING

Pages

Saturday, June 7, 2014

first cuda program for beginners

No comments:

Post a Comment