Cuda is a parallel programming code . This is generally done when we are in need of higher performance in terms of speed. When we are in need of speed increment we need to know some terminology which affects speed so that we can do it in a desired and best way.
First let us talk about how cuda works ?
cuda is coded with kernels in it. we need to define the thread size and block size. cuda has important device properties like streaming Multiprocessor(SM). Different Gpu devices have different number of cuda streaming Multiprocessor. Streaming processor , multiprocessor are the same thing. when we search in the web, different webpages may mentioned any of these terms. This may confuse us but are the same thing. cuda devices have number of streaming processor with maximum number of block in each streaming Multiprocessor(SM).
First we need to have knowledge on these two terms such that we can define the thread size and block size properly to increase occupancy and thus speed. when the program runs the block are assigned to the streaming processor as per its capacity(max number of block in each SM). The SM then "unpacks" the threadblock into warps, and schedules warp instructions on the SM internal resources (e.g. "cores or SP", and special function units), as those resources become available. Higher number of multiprocessor(SM), large number are block are scheduled parallelly and thus increase speed. the important thing we need to know is there is also limit of maximum number of thread in each SM ,thus even though the limit of maximum number of blocks may be (lets say 16) doesnot mean all 16 may be scheduled to a SM . if we have defined 1024 thread per block than (in cc 3.5 max available thread per SM is 2048) 2048/1024 = 2, only 2 block are scheduled in a SM. out of 16 , only 2 blocks are scheduled to each SM .Total 16*2 =32 blocks are scheduled at a time in 16 different SM and if we have more than 32 blocks then they will be scheduled for later execution. We need to choose thread size in each block such that total number of wraps per SM are properly utilized.
SP in cuda architecture is called streaming processor which is also known as cuda cores. each thread in cuda is scheduled to cuda cores.
Significance of cuda cores in cuda
The number of cores per SM translates roughly to how many warp instructions can be processed in any given clock cycle. A single warp instruction can be processed in any given clock cycle but requires 32 cores to complete (and may require multiple clock cycles to complete, depending on the instruction). A cc2.0 fermi SM with 32 "cores" can retire at most 1 instruction per clock, average. (it's actually 2 instructions every 2 clocks). A Kepler SMX having 192 cores can retire 4 or more instructions per clock.
for more information: http://people.math.umass.edu/~johnston/M697S12/CUDA_threads_and_block_scheduling.pdf

 
