Thursday, 6 June 2013

CUDA 5.0 grid size problem

Interesting thing. I was convinced that the maximum grid size of my machine (GeForce GTX 660M) can be [65535, 65535, 65535]. I checked the device parameters using the cudaGetDeviceProperties function and I couldn't believe the results. My GPU's limitation is actually _way_ higher than I expected and according to cudaGetDeviceProperties it's: [2147483647, 65535, 65535]. I looked my card up in the CUDA wiki and it turned out, that my device qery was not lying.

However, I had a serious problem testing the huge grid size I had unexpectedly discovered. My vector addition compiled successfully but returned errors and could not be profiled when the vector size was greater than 65535 (blocks) * 1024 (threads). I started two threads on NVIDIA dev forum and on stackoverflow and finally, this brilliant guy helped me.

What was the problem then? I didn't change the default Visual Studio setting which defines the arch and code parameters of nvcc. To change it to proper values for your card, please take a look at the CUDA wiki to locate your GPU and modify the following setting:

Now I can go beyond that old limit of 65535 and continue my research.

***I have updated my CUDA Hello World tutorial so you can update that setting as you create the project.***

Tuesday, 4 June 2013

CUDA 5.0 - optimising vector addition

This time I plan to concentrate just on optimising my vector addition a little bit. In my previous example I had a statically defined number of blocks and threads used by my kernel 'AddVectorsKernel'. This time I'd like to make it a bit more adjustable. I also got a slightly better syntax highlighter because the old one was pretty much unreadable.

To begin with, this is my vanilla code without the CPU vector addition, just plain CUDA. I will be introducing and explaining some changes to it as I go along.

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <time.h>

#pragma comment(lib, "cudart") 

typedef struct 
{
    float *content;
    const unsigned int size;
} pjVector_t;

__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
    unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
    resultVector[index] = firstVector[index] + secondVector[index];
}

int main(void)
{
    const unsigned int vectorLength = 1000000;
    const unsigned int blocks = 1000;
    const unsigned int threads = 1000;
    const unsigned int vectorSize = sizeof(float) * vectorLength;

    pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
    pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
    pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };

    float *d_firstVector;
    float *d_secondVector;
    float *d_resultVector;

    cudaMalloc((void **)&d_firstVector, vectorSize);
    cudaMalloc((void **)&d_secondVector, vectorSize);
    cudaMalloc((void **)&d_resultVector, vectorSize);

    for (unsigned int i = 0; i < vectorLength; i++)
    {
        firstVector.content[i] = 1.0f;
        secondVector.content[i] = 2.0f;
    }

    cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
    AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
    cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);

    free(firstVector.content);
    free(secondVector.content);
    free(resultVector.content);

    cudaFree(d_firstVector);
    cudaFree(d_secondVector);
    cudaFree(d_resultVector);
    cudaDeviceReset();

    return 0;
}

First of all, I'm going to modify the number of threads per block to be as high as possible on my machine to decrease the number of used blocks. To do this, I'm calling cudaGetDeviceProperties function and reading the maxThreadsPerBlock field from the populated structure cudaDeviceProp. I also got rid of const keyword in for blocks and threads since they are populated dynamically.

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>

#define VECTOR_LENGTH 1000000

#pragma comment(lib, "cudart") 

typedef struct 
{
    float *content;
    const unsigned int size;
} pjVector_t;

__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
    unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;

    if(index < VECTOR_LENGTH)
    {
        resultVector[index] = firstVector[index] + secondVector[index];
    }
}

int main(void)
{
    const unsigned int vectorSize = sizeof(float) * VECTOR_LENGTH;
    int threads = 0;
    unsigned int blocks = 0;
    cudaDeviceProp deviceProperties;

    cudaGetDeviceProperties(&deviceProperties, 0);

    threads = deviceProperties.maxThreadsPerBlock;
    blocks = (unsigned int)ceil(VECTOR_LENGTH / (double)threads);

    pjVector_t firstVector = { (float *)calloc(VECTOR_LENGTH, sizeof(float)), VECTOR_LENGTH };
    pjVector_t secondVector = { (float *)calloc(VECTOR_LENGTH, sizeof(float)), VECTOR_LENGTH };
    pjVector_t resultVector = { (float *)calloc(VECTOR_LENGTH, sizeof(float)), VECTOR_LENGTH };

    float *d_firstVector;
    float *d_secondVector;
    float *d_resultVector;

    cudaMalloc((void **)&d_firstVector, vectorSize);
    cudaMalloc((void **)&d_secondVector, vectorSize);
    cudaMalloc((void **)&d_resultVector, vectorSize);

    for (unsigned int i = 0; i < VECTOR_LENGTH; i++)
    {
        firstVector.content[i] = 1.0f;
        secondVector.content[i] = 2.0f;
    }

    cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
    AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
    cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);

    free(firstVector.content);
    free(secondVector.content);
    free(resultVector.content);

    cudaFree(d_firstVector);
    cudaFree(d_secondVector);
    cudaFree(d_resultVector);
    cudaDeviceReset();

    return 0;
}

After running this code, performance dropped to 833[µs] (~5%). 'Did I do something wrong?', I was asking myself. And actually I did - that 'if' statement in the kernel code was that 'something wrong'. Or maybe not entirely wrong, but not very well placed. I thought initially: how come now, since I'm utilising all possible threads in every block (well, maybe except the last one - it will have 448 unused threads, 977 * 1024 - 1000000 = 448), my kernel uses more time to do its job than previously, where the number of unused threads was a lot (~53 times!) higher? The answer was that 'if' - I quickly realised that every thread in every block had to evaluate it 1 million (+448) times. Why do that since I can precisely adjust the length of my vector (+ that 448 elements) so that 'if' is not needed anymore and I will never go outside my vector in the kernel?

I'm obviously just playing with the toolkit right now so it's not a problem to add something here and tweak something there to achieve my goal, but even in the real life scenario (which I highly doubt this code will ever be used for!) you could do the same thing - add some extra elements to you vector and fill them with zeros not to check if you're within your vector in the kernel code.

Let's do some (integer!) math:

  • Unused threads in my previous approach: (1024 - 1000) * 1000 = 24000 (!)
    Redundant threads: 0 (all working threads are important)
    index checked: 0 times
  • Unused threads in my current approach: ((1000000 / 1024) * 1024) - 1000000 + 1024 = -576 + 1024 = 448 (they are doing nothing)
    Redundant threads: 0 (all working threads are important)
    index checked: 1000448 times (!)
  • Unused threads in my new approach: 0 (all are used)
    Redundant threads: 1000448 - 1000000 = 448 (they are just adding zeros)
    index checked: 0 times
Now let's see some code (I'm not filling these last 448 vector items with zeros, but you get the idea):
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>

#define VECTOR_LENGTH 1000448

#pragma comment(lib, "cudart") 

typedef struct 
{
    float *content;
    const unsigned int size;
} pjVector_t;

__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
    unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
    resultVector[index] = firstVector[index] + secondVector[index];
}

int main(void)
{
    const unsigned int vectorSize = sizeof(float) * VECTOR_LENGTH;
    int threads = 0;
    unsigned int blocks = 0;
    cudaDeviceProp deviceProperties;

    cudaGetDeviceProperties(&deviceProperties, 0);

    threads = deviceProperties.maxThreadsPerBlock;
    blocks = (unsigned int)ceil(VECTOR_LENGTH / (double)threads);

    pjVector_t firstVector = { (float *)calloc(VECTOR_LENGTH, sizeof(float)), VECTOR_LENGTH };
    pjVector_t secondVector = { (float *)calloc(VECTOR_LENGTH, sizeof(float)), VECTOR_LENGTH };
    pjVector_t resultVector = { (float *)calloc(VECTOR_LENGTH, sizeof(float)), VECTOR_LENGTH };

    float *d_firstVector;
    float *d_secondVector;
    float *d_resultVector;

    cudaMalloc((void **)&d_firstVector, vectorSize);
    cudaMalloc((void **)&d_secondVector, vectorSize);
    cudaMalloc((void **)&d_resultVector, vectorSize);

    for (unsigned int i = 0; i < VECTOR_LENGTH; i++)
    {
        firstVector.content[i] = 1.0f;
        secondVector.content[i] = 2.0f;
    }

    cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
    AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
    cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);

    free(firstVector.content);
    free(secondVector.content);
    free(resultVector.content);

    cudaFree(d_firstVector);
    cudaFree(d_secondVector);
    cudaFree(d_resultVector);
    cudaDeviceReset();

    return 0;
}
And there it is: 756.479[µs] - time needed to process more data was actually ~5% shorter!



That's it for now, thanks for reading, I hope it was useful or at least interesting.

Monday, 3 June 2013

CUDA 5.0 first calculations

OK - now, since I have my 'Hello World' program done, I think it would make sense to write something that actually uses CUDA processing power. Inspired by the CUDA introductory video, I want to write a simple vector addition kernel and run a quick test on how it compares to the CPU processing. I haven't mentioned it earlier, but I'm writing and running this code on my GeForce GTX 660M laptop GPU. Let's get started!

My first piece of code runs on the CPU and its only purpose is to add two vectors (of size one million, 1,000,000) together.

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <time.h>

#pragma comment(lib, "cudart") 

typedef struct 
{
 float * const content;
 const unsigned int size;
} pjVector_t;

void AddVectors(const pjVector_t * const firstVector, const pjVector_t * const secondVector, pjVector_t * const resultVector);

int main(void)
{
 unsigned int vectorSize = 1000000;
 double cpuTime;
 clock_t begin, end;
 pjVector_t firstVector = { (float *)calloc(vectorSize, sizeof(float)), vectorSize };
 pjVector_t secondVector = { (float *)calloc(vectorSize, sizeof(float)), vectorSize };
 pjVector_t resultVector = { (float *)calloc(vectorSize, sizeof(float)), vectorSize };

 for (unsigned int i = 0; i < vectorSize; i++)
 {
  firstVector.content[i] = 1.0f;
  secondVector.content[i] = 2.0f;
 }

 begin = clock();

 AddVectors(&firstVector, &secondVector, &resultVector);

 end = clock();
 cpuTime = (double)(end - begin) / CLOCKS_PER_SEC;

 printf("Result vector calculated in: %f[sec]\n", cpuTime);
 getchar();


 free(firstVector.content);
 free(secondVector.content);
 free(resultVector.content);
 
 return 0;
}

void AddVectors(const pjVector_t * const firstVector, const pjVector_t * const secondVector, pjVector_t * const resultVector)
{
 for (unsigned int i = 0; i < firstVector -> size; i++)
 {
  resultVector -> content[i] = firstVector -> content[i] + secondVector -> content[i];
 }
}


Even though it's single-threaded it's pretty fast and takes from 5 to 6 milliseconds to calculate the result vector.

My second piece of code combines GPU and CPU code to compare the performance. A couple of things before I show the code though:

  • To simplify passing parameters to my kernel, I ditched the structures and chose plain float arrays.
  • Although I'm sure it can be done a lot more efficient, I'm running my CUDA code using 1000 blocks and 1000 threads each (1,000 * 1,000 = 1,000,000) just to show how the problem is divided into sections.
  • Performance is measured by an external tool, NVIDIA Visual Profiler which comes with the CUDA Toolkit.
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <time.h>

#pragma comment(lib, "cudart") 

typedef struct 
{
 float *content;
 const unsigned int size;
} pjVector_t;

__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
 unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
 resultVector[index] = firstVector[index] + secondVector[index];
}

void AddVectors(const pjVector_t * const firstVector, const pjVector_t * const secondVector, pjVector_t * const resultVector);

int main(void)
{
 const unsigned int vectorLength = 1000000;
 const unsigned int blocks = 1000;
 const unsigned int threads = 1000;
 const unsigned int vectorSize = sizeof(float) * vectorLength;
 double cpuTime;
 clock_t begin, end;

 pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
 pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
 pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };

 float *d_firstVector;
 float *d_secondVector;
 float *d_resultVector;

 cudaMalloc((void **)&d_firstVector, vectorSize);
 cudaMalloc((void **)&d_secondVector, vectorSize);
 cudaMalloc((void **)&d_resultVector, vectorSize);

 for (unsigned int i = 0; i < vectorLength; i++)
 {
  firstVector.content[i] = 1.0f;
  secondVector.content[i] = 2.0f;
 }

 // CPU calculatons
 begin = clock();

 AddVectors(&firstVector, &secondVector, &resultVector);

 end = clock();
 cpuTime = (double)(end - begin) / CLOCKS_PER_SEC;
 // - CPU calculatons

 // GPU calculatons
 cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
 cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);

 AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);

 cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);
 // - GPU calculatons

 free(firstVector.content);
 free(secondVector.content);
 free(resultVector.content);

 cudaFree(d_firstVector);
 cudaFree(d_secondVector);
 cudaFree(d_resultVector);
 cudaDeviceReset();

 printf("CPU result vector calculated in: %f[ms]\n", cpuTime * 1000.0);

 getchar();
 
 return 0;
}

void AddVectors(const pjVector_t * const firstVector, const pjVector_t * const secondVector, pjVector_t * const resultVector)
{
 for (unsigned int i = 0; i < firstVector -> size; i++)
 {
  resultVector -> content[i] = firstVector -> content[i] + secondVector -> content[i];
 }
}



This time the magic really happened: vector addition takes only 792.166[µs], which is roughly 15% of the previous value. That's simply amazing for the first try. Let me show some screenshots from the profiler:



But what really made me smile was the warning messages from the profiler:



It actually complains about the time needed to copy the data being longer than the time of the calculations! And see how much time the cudaMalloc needed? As it turns out, adding two vectors of size 1,000,000 is the smallest problem here! I am really pleased with my first real CUDA test and it certainly inspires me even more.

CUDA 5.0 Hello World

Time for my first CUDA 'Hello World' program. Well, there won't be any GPU processing in it, but at least I'll prove that my environment is fully operational. Let's get to work!

I'll start by creating new Win32 Console Application (selecting 'Empty Project' option). Although it is a Visual C++ project, I will try to stick to C as I simply feel more confident using this language and don't want the OOP getting in my way.

First of all, since I use 64-bit windows, I changed my program to target the x64 platform.

Next, I selected CUDA 5.0 targets in my project's build customization options.

And I modify my project's Include and Library directories adding $(CUDA_INC_PATH) and $(CUDA_LIB_PATH) respectively (these settings can be located in project's properties menu -> Configuration Properties -> VC++ Directories).

Next thing is to change the nvcc parameters (by default they are set to compute_10,sm_10 which you probably don't want). To utilise full potential of your GPU, please adjust these parameters according to CUDA wiki. In my case it is compute_30,sm_30.

Last thing to do before I started writing code was to change CUDA/C++ target machine platform, which was still 32-bit.

And finally, my project nicely compiles (Intellisense complains about the angle brackets though, but I have yet to find an answer for this one).

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>

#pragma comment(lib, "cudart") 

__global__ void mykernel(void)
{
}

int main(void)
{
 mykernel<<<1,1>>>();
 printf("Hello World!\n");
 getchar();
 
 return 0;
}

CUDA 5.0 and Visual Studio 2012 installation

I've been experimenting with OpenGL for a while and now I would like to take the next step and check out CUDA. People say amazing things about it and I figured it's about time for me to try it as well. However, the installation process is not as easy as I initially anticipated. After downloading CUDA Toolkit v5.0 and Visual Studio 2012 Express Edition, I found out that these two simply don't like each other and some extra effort had to be invested to get CUDA and VS up and running on my machine. Here's what I had to do:

  • I copied the content of this folder:
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\extras\visual_studio_integration\MSBuildExtensions
    to a safe location (let's say a folder on your desktop). There should be 4 files in total:
    • CUDA 5.0.props
    • CUDA 5.0.targets
    • CUDA 5.0.xml
    • Nvda.Build.CudaTasks.v5.0.dll
    Only first two will be modified but I copied them all to have everything in one place.
  • In my safe folder, I edited the CUDA 5.0.props file and added an extra CudaClVersion node. Was:
    <CudaClVersion Condition="'$(PlatformToolset)' == 'v90'">2008</CudaClVersion> <CudaClVersion Condition="'$(PlatformToolset)' == 'v100'">2010</CudaClVersion>
    Now:
    <CudaClVersion Condition="'$(PlatformToolset)' == 'v90'">2008</CudaClVersion> <CudaClVersion Condition="'$(PlatformToolset)' == 'v100'">2010</CudaClVersion> <CudaClVersion Condition="'$(PlatformToolset)' == 'v110'">2010</CudaClVersion>
  • Next, in the same folder, I modified the CUDA 5.0.targets file.
    • <CudaCleanDependsOn> node:
      Was:
      <CudaCleanDependsOn>
       AddCudaCompileMetadata;
       ValidateCudaBuild;
      </CudaCleanDependsOn>
      Now:
      <CudaCleanDependsOn>
       $(CudaCompileDependsOn);
       _SelectedFiles;
       CudaFilterSelectedFiles;
       AddCudaCompileMetadata;
       AddCudaLinkMetadata;
       AddCudaCompileDeps;
       AddCudaCompilePropsDeps;
       ValidateCudaBuild;
       ValidateCudaCodeGeneration;
       ComputeCudaCompileOutput;
       PrepareForCudaBuild
      </CudaCleanDependsOn>
    • <CudaCompile> node, GenerateRelocatableDeviceCode attribute:
      Was:
      GenerateRelocatableDeviceCode=""
      Now:
      GenerateRelocatableDeviceCode="%(CudaCompile.GenerateRelocatableDeviceCode)"
    • <CudaCompile> node, CodeGeneration attribute:
      Was:
      CodeGeneration=""
      Now:
      CodeGeneration="%(CudaCompile.CodeGenerationValues)"
    • <CudaCompile> node, CommandLineTemplate attribute:
      Was:
      CommandLineTemplate="&quot;$(CudaToolkitNvccPath)&quot; %(CudaCompile.ApiCommandLineTemplate) %(CudaCompile.CleanCommandLineTemplate)"
      Now:
      CommandLineTemplate="&quot;$(CudaToolkitNvccPath)&quot; %(CudaCompile.BuildCommandLineTemplate) %(CudaCompile.ApiCommandLineTemplate) %(CudaCompile.CleanCommandLineTemplate)"
  • Next, I copied all 4 files from my safe folder to the MSBuild folder, which I found here:
    C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V110\BuildCustomizations
  • Intellisense: to enable CUDA and GLSL code hightlighting I modified following Text Editor settings (can be found under Visual Studio's Tools -> Options menu):

  • I added .cu to the list of included extensions (Tools -> Options -> Projects and Solutions -> VC++ Project Settings)

  • Last step of my preparations was to locate and edit the host_config.h file. It can be found here:
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include
    Line 90 has to be changed from:
    #if _MSC_VER < 1400 || _MSC_VER > 1600
    to:
    #if _MSC_VER < 1400 || _MSC_VER > 1700
And that's it. CUDA 5.0 is integrated with Visual Studio 2012. I would like to thank Alan Tatourian for explaining how to do this, please find his blog here if you wish.