Monday, October 13, 2014

how to solve Microsoft visual studio 2005 has stopped working in windows 64 bit

Problem: Microsoft Visual Studio 2005 has stopped working


Everything was fine early but later it started giving such error when I opened any projects. Even opening visual studio application gave me the same error which indicated the APPCRASH of devenv.exe.
I went through the internet resources and gave me several solutions of running devenv/safemode or devenv/resetsettings but nothing went fruitful as indicated.

Solution

finally I came to the internet resources which illustrated that the above problem was the kind of crash that occurs in 64 bit operating system as VS2005 wasn't originally designed to run on such OS.
 Thus I was recommended to download the service pack SP1 form here and re run the application. I did download the SP1 from here and installed it.

Voilla! problem solved. Now I am able to run the application without error.

Thursday, September 11, 2014

how cuda code works or how the cuda code is scheduled in GPU


Cuda is a parallel programming code . This is generally done when we are in need of higher performance in terms of speed. When we are in need of speed increment we need to know some terminology which affects speed so that we can do it in a desired and best way.

First let us talk about how cuda works ?
 cuda is coded with kernels in it. we need to define the thread size and block size. cuda has important device properties like streaming Multiprocessor(SM). Different Gpu devices have different number of cuda streaming Multiprocessor. Streaming processor , multiprocessor are the same thing. when we search in the web, different webpages may mentioned any of these terms. This may confuse us but are the same thing. cuda devices have number of streaming processor with maximum number of block in each streaming Multiprocessor(SM).

 First we need to have knowledge on these two terms such that we can define the thread size and block size properly to increase occupancy and thus speed. when the program runs the block are assigned to the streaming processor as per its capacity(max number of block in each SM). The SM then "unpacks" the threadblock into warps, and schedules warp instructions on the SM internal resources (e.g. "cores or SP", and special function units), as those resources become available. Higher number of multiprocessor(SM), large number are block are scheduled parallelly and thus increase speed. the important thing we need to know is there is also limit of maximum number of thread in each SM ,thus even though the limit of maximum number of blocks may be (lets say 16) doesnot mean all 16 may be scheduled to a SM . if we have defined 1024 thread per block than (in cc 3.5 max available thread per SM is 2048) 2048/1024 = 2, only 2 block are scheduled in a SM. out of 16 , only 2 blocks are scheduled to each SM .Total 16*2 =32 blocks are scheduled at a time in 16 different SM and if we have more than 32 blocks then they will be scheduled for later execution. We need to choose thread size in each block such that total number of wraps per SM are properly utilized.

 SP in cuda architecture is called streaming processor which is also known as cuda cores. each thread in cuda is scheduled to cuda cores.

 Significance of cuda cores in cuda 
The number of cores per SM translates roughly to how many warp instructions can be processed in any given clock cycle. A single warp instruction can be processed in any given clock cycle but requires 32 cores to complete (and may require multiple clock cycles to complete, depending on the instruction). A cc2.0 fermi SM with 32 "cores" can retire at most 1 instruction per clock, average.  (it's actually 2 instructions every 2 clocks). A Kepler SMX having 192 cores can retire 4 or more instructions per clock.

for more information: http://people.math.umass.edu/~johnston/M697S12/CUDA_threads_and_block_scheduling.pdf

Thursday, September 4, 2014

Sorting of local variables or using thrust inside kernels in cuda

Thrust is a library to cuda which allows to perform several operation in an easy way i.e we just need to call the  the function without taking much care on it. Sorting is also one among  several operation  performed by thrust in cuda.

Generally thrust code is a host code. We need to call the thrust function from host and whenever we tried to used inside kernel it complains that the host code cannot be called in device function. However we may come across the situation that we need to sort the number within kernel. for e.g we may have a local variables within kernel which needs to be sorted and perform operation .

If we are in need of such operation and now it is possible in cuda with thrust . But what we need is  appropriate thrust version. In thrust version v1.7 this feature is not supported. we need to have thrust v1.8 and further if we are programming in windows platform with visual studio then beware vs 2005 doesnot support. visual studio 2010 works well in my case . I havenot tested for other newer version.


thrust:sort can be combined with the thrust:seq execution policy to sort numbers sequentially within a single CUDA thread (or sequentially within a single CPU thread). and 
#include <thrust/execution_policy.h>   needs to be added in the header.

here below is the complete code to sort the local array with cuda 5.0 with thrust v 1.8


#include <stdio.h>
#include<iostream>

#include <cuda.h>


// main routine that executes on the host
 for(int i=0;i<N; i++)
}
int main(void)
  cudaMemcpy(a_h, a_d, sizeof(int)*N, cudaMemcpyDeviceToHost);

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/device_ptr.h>
#include <thrust/execution_policy.h>

__global__ void sort_array(int *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
 int td[10];
 for(int i=0;i<N; i++)
 {
   td[i]=a[i];
 }
 thrust::device_ptr<int> t_a(td);
 thrust::sort(thrust::seq,t_a, t_a + N);

 {
  a[i] = td[i];
 }


{
  int *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(int);
  a_h = (int *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);// Allocate array on device
  std::cout<<"enter the 10 numbers";
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) 
  {
      std::cin>>a_h[i];
  }
  for (int i=0; i<N; i++) printf("%d %d\n", i, a_h[i]);
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  sort_array <<< 1,1 >>> (a_d, N);
 /* thrust::device_ptr<int> t_a(a_d);
  thrust::sort(a_d, a_d + N);*/
  // Do calculation on device:

  // Print results
  printf("sorted value\n");
  for (int i=0; i<N; i++) printf("%d %d\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);


output



Wednesday, September 3, 2014

how to start cuda programming in windows in visual studio.

Cuda is a parallel programming which supports c/c++ or even python. If we are trying to write a cuda program in c/c++ an in windows platform, then visual studio is the best IDE to start cuda programming. Further Visual Sudio 2005 is the best version that supports cuda programming.

Now first of all let’s talk on the installation stuff we need to run the cuda program
  •       Nvidia display driver
  •      Cuda toolkit(different version are available)
  •      Cuda vs wizard(integrates the cuda options in visual studio for project creation)



After we install visual studo, we need to install cuda toolkit 5 or even different version. 

  •          Download the cuda toolkit from here

  • Install the cuda toolkit, after installation cuda sample browser is installed with it.
  • Then install cuda vs wizard which enable cuda option during creation of project.



Steps of creating cuda project in visual studio
  1. Open visual studio
  2.     Click  File->New->Project.  The following form with cuda option appears


  

a 3.  Click cuda and name the project
    4.    After  creation of project , the following  screen is seen with headers ,source folder created at the left hand side of the  project.

      5.    Right click on source ->Add->new Item. The following options will appear as
   



   5)   Click on cuda, name the file as cuda.cu
    6)Then write the cuda program in it.
     7)     Go to main project->right click->custom build rules as
   



     8) Check theCuda runtime API Build rule at the last as
8




















   9)    Go to the main project->right click->properties->linker->input and  in additional dependiences  add cudart.lib
   

    10)     Go to Tools->options->project and solution->vc++ directories  and give the path to  include as
    
    
111)Go to dropdown and select library files as and set the path to the win32 as 

f

114) then finally copy and paste the simple hello world code from here build the program and  run. Happy Coding




Saturday, June 7, 2014

Extracting the content of the webpage using BeautifulSoup and Mechanize in python

There arises several condition to extract the content of the page and display in our application. In such case we can use the BeautifulSoup and Mechanize python package.

The python code to extract the price of gold and silver form the website http://www.fenegosida.org/ is shown here


''''Reap gold price from http://www.fenegosida.org/

<h1> tag contains the prices. These h1 resides in following IDS + "-content"

Sample data
{'tejabi-1tola': u'53450', 'hallmark-1tola': u'53700', 'hallmark-10gms': u'46040', 'silver-1tola': u'860', 'tejabi-10gms': u'45825', 'silver-10gms': u'737.50'}
"""

URL = "http://www.fenegosida.org/"
IDS=["hallmark","tejabi","silver"]

import sys
from BeautifulSoup import BeautifulSoup
from mechanize import Browser

if len(sys.argv) > 1 and sys.argv[1] == "-sample":
    print "{'tejabi-1tola_new': u'53450', 'hallmark-1tola': u'53700', 'hallmark-10gms': u'46040', 'silver-1tola': u'860', 'tejabi-10gms': u'45825', 'silver-10gms': u'737.50'}"
    sys.exit(0)

br = Browser()
br.addheaders = [
    ('user-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3',),
    ('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',),
    ]
page = br.open(URL)
soup = BeautifulSoup(page.read())
price = {}

for id in IDS:
    hallmark = soup.findAll('div',{'id':"{0}-content".format(id)})
    
    a = hallmark[0].h1.text
    b = hallmark[1].h1.text

    if float(a) > float(b):
        a, b = b, a

    price['{0}-10gms'.format(id)] = a
    price['{0}-1tola'.format(id)] = b

print price

sys.exit(0)

first cuda program for beginners

before starting the code directly, we need to understand the basic terms and theoretical aspects of cuda programming. First thing is we need a device popularly called GPU for cuda programming and GPU must be Nvidia gpu not any others like ATI or anyone else except Nvidia’s GPU.
Then we need to have basic concept of two terms
1)      Host
2)      Device
These two terms are frequently used in the documentation and in the resources found in the internet.
1)      Host refers to CPU where sequential execution of programs occurs.
2)      Device refers to Nvidia GPU card.



Basic concepts of Gpu Programming
Gpu programming is parallel programming where we create large number of threads and each threads executes the program concurrently at the same time thus parallism occurs.And the block of code which is executed by each thread is called kernel function.

Now let us learn how the  cuda code is written.
1)      first variables for the host and device are initialized.
2)      Memory is allocated in host side.
3)      Values are assigned in the memory at host side.
4)      Memory is allocated in the devices.
5)      Values in the host side are copied to device memory as gpu can only access the gpu memory.
6)      Then the parallel operations are done in the gpu using kernel function.
7)       Then the final data are again copied back to host memory as in order to display the data we need to display it through cpu memory or host memory.
8)       Finally free host aswell as device memory.
Basic cuda program
Firstprogram.cu
#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
__global__ void sum_array(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] + a[idx];
}

// main routine that executes on the host
int main(void)
{
  float *array_h, *array_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  array_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &array_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) array_h[i] = (float)i;
  cudaMemcpy(array_d, array_h, size, cudaMemcpyHostToDevice);
  



  
  sum_array <<< 3,4 >>> (array_d, N);//kernel call
  // save the result from device in the device memory and copy it in host array
  cudaMemcpy(array_h, array_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, array_h[i]);
  // Cleanup
  free(array_h); cudaFree(array_d);
}


Here __global__ void sum_array(float *a, int N) is a kernel function. This kernel function runs simultaneously at the same time for all the threads which causes parralism. During the call of kernel function <<,>> is used which indicates the call to kernel function and << ,>> has << number of block,number of threads >>.


Monday, June 2, 2014

Sourcecode of Newton's forward and backward interpolation using C

newton's forward and backward interpolation

#include<stdio.h>
#include<math.h>
int main()

{

    float x[10],y[15][15];
    int n,i,j;
//no. of items
    printf("Enter n : ");
    scanf("%d",&n);
    printf("X\tY\n");
    printf("enter the value of x then y\n");
    for(i = 0;i<n;i++){
            scanf("%f %f",&x[i],&y[i][0]);

    }


    printf("the entered value are\n");

     printf("x\t\ty\n");
     for(i = 0;i<n;i++){
            printf("\n");
            printf("%f\t%f",x[i],y[i][0]);

    }


    //forward difference table

    for(j=1;j<n;j++)
        for(i=0;i<(n-j);i++)
            y[i][j] = y[i+1][j-1] - y[i][j-1];
    printf("\n***********Forward Difference Table ***********\n");
//here is the  Forward Difference Table
    for(i=0;i<n;i++)
    {
        printf("\t%.2f",x[i]);
        for(j=0;j<(n-i);j++)
            printf("\t%.2f",y[i][j]);
        printf("\n");
    }
    // here is thebackward difference table
    for(j=1;j<n;j++)
//for j = 0 initially input is taken so we start from j=1
        for(i=n-1;i>(j-1);i--)
            y[i][j] = y[i][j-1] - y[i-1][j-1];
    printf("\n***********Backward Difference Table ***********\n");
//here is the Backward Difference Table
    for(i=0;i<n;i++)
    {
        printf("\t%.2f",x[i]);
        for(j=0;j<=i;j++)
            printf("\t%.2f",y[i][j]);
        printf("\n");
    }
return 0;

}