Using GPUs on the scholar cluster

Using GPUs on the scholar cluster is relatively straightforward. When you schedule a job or interactive session normally, it will be scheduled to a worker node such as scholar-fe00. This is a typical front-end fe node.

If you’d like to use GPU enabled features, or features for advanced HPC, you need to request a sub-cluster with the specs or hardware you’re wanting to use. You can do that with a command like this:

sinteractive -A gpu --gres=gpu:1 -t 4:00:00

This will attempt to grant you an interactive session on the sub-cluster you’ve specified (in this case, A which is a GPU subcluster). You should see an interactive session. Note that you’re not guaranteed the cluster you request, as you can see in the below example:

waltba04@scholar-fe00:[introCuda] $ sinteractive -A gpu --gres=gpu:1 -t 4:00:00
salloc: Granted job allocation 443481
salloc: Waiting for resource configuration
salloc: Nodes scholar-g003 are ready for job
bash-5.1$

While I requested a node in sub-cluster A, I got one in sub-cluster G. This is because while I requested A, my actual constraint was to receive a single GPU. When the A cluster wasn’t available, the first fitting cluster was provided. If you were to need very specific hardware, you would want more detailed constraints.

Once you have a GPU enabled cluster, you can attempt to load the necessary modules. You should load CUDA and a compatible version of GCC, which can be done with the following command:

bash-5.1$ module load gcc/11.4.1 cuda

Due to MODULEPATH changes, the following have been reloaded:
  1) openmpi/5.0.5

The following have been reloaded with a version change:
  1) gcc/14.1.0 => gcc/11.4.1

This will result in you being able to use NVCC, or the Nvidia C Compiler. This program accepts *.cu files and follows the same general usage patterns as gcc:

bash-5.1$ nvcc helloWorld.cu -o helloWorld

You can then execute the GPU code. Try it with the RCAC’s examples here.

If you need to see the type of GPU and its performance, you can use the following command:

bash-5.1$ nvidia-smi
Fri Apr 10 10:32:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           On  |   00000000:3B:00.0 Off |                    0 |
| N/A   33C    P0             24W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Realistically, sub-clusters need batch jobs distributed/allocated to them. If they’re not available, the job needs to wait. The sinteractive command is a wrapper that lets you interactively schedule workloads more conveniently than something like salloc or sbatch. However, if you needed to schedule a workload normally, you could do so using sbatch or salloc (assuming you have knowledge of SLURM).

sbatch --nodes=1 --ntasks=20 --constraint=A myjobsubmissionfile.sub

This is what should be done for compute intensive tasks, or tasks needing a very specific hardware profile.

Intro to GPUs

I provide this as an addendum because I (1) felt like making it, and (2) don’t think the RCAC docs give a very strong GPU intro.

Solving 1 Complex Problem

Imagine we're trying to solve a single complex problem, that has an unpredictable workload, and might include computationally difficult operations. We don't know for sure what operations we'll be doing ahead of time (lots of branching). Let's consider a model for this, what type of “worker” we want.

Because we don't know ahead of time what data we'll need, we need to be prepared to access data randomly. Deciding what data to access, and what operations to do on it, will require some complex control logic. We'll need working space (cache, registers) for all of this. We'll also need a strong toolset to support all of this (ALU)

This gives us our typical CPU model.

Random data access
Complex control logic (flags, Instruction Pointer)
Working space (registers, cache)
Strong ALU, ie
- Branch prediction (where might we access next)
- Complex data ops
- Easy pointer access (for branching)
- Atomic operations (lock cache, memory lines)

If we wanted to parallelize this task, we'd have to think about how to, given our constraints. We don't necessarily know which data will be needed ahead of time, or what we'll do to it.

Because of this, we couldn't just parallelize a fixed amount of work. We also can't parallelize the task between inputs/instances because they might diverge. This means we need separate control units, and separate working spaces.

To parallelize, we need to parallelize operationally — we want to complete more tasks (task parallelism). We need to either break our task into multiple separate tasks that can be completed in parallel and assign them to different workers, or give the instructions we execute higher throughout (SIMD).

Letting subtasks be completed in parallel requires either more workers (CPU cores) or more task switching (threads). Threads on a CPU need access to the CPU’s working space and control logic, and so threads tend to take turns controlling their respective cores (thread 1 might branch where thread 2 doesn't, or thread 3 might have 20% more elements to work on).

This model works for our problem, but what if we had a different problem? What if we:

Had simple fundamental operations to perform
Knew what operations we planned to perform ahead of time, and their order
Knew what memory belonged to what problems or sub-problems ahead of time
Had known problem independence (completing problem 1 isn't necessary to complete problem 2)

In this case, rather than wanting high task parallelism, we want higher data parallelism. We want to do the same task, just with more data items at once.

For this model, adding more CPU cores or threads still works. However, the majority of those units would be wasted. The task at hand lacks branching complexity, it doesn't use most of the ALU features, we can predict data access patterns, and since we plan to repeat the same operations everywhere we don't need separate control units.

This makes CPUs an expensive and large (physically) option. Fitting a lot of cores on a CPU die is difficult due to their complexity, and they are harder to reach due to their size (this makes engineering harder, and distance increases latency). But we aren't gaining anything from the size and complexity we're sacrificing for.

Instead, we need to be able to quickly and directly map problems to a bunch of simpler workers.

Let's look at where we can make space savings:

Due to lacking branch complexity, we can reduce control logic size
With simpler operations and memory access patterns, we can reduce working space (cache and registers)
If we simplify the control logic, operations, and memory, we can simplify the ALU
No large register operations
No complex instructions
Fewer updates to control logic (ie op flags)

We now have a streamlined processor made for handling simple linear tasks. We are now approaching a GPU core.

We can also take a look at our threading model. With the old CPU model, we weren't sure what each thread would need, so they had to occupy the entire core to have full access to features they need.

With our new simplified model we know all threads will actually be doing the same thing — they'll execute the same instruction, they'll follow the same branches, they'll execute in the same timespan. The only thing changing between threads is the data they're operating on, which has already been loaded into their local registers.

I say that we "know" these things, but really we're designing around the assumption. We know the use case we're designing for, and if a user deviates from that use case or work model, they're accepting the possibility of performance hits or errors. We're constraining our computation model to get design benefits.

With this in mind, we can make further improvements to parallelism. We can move things like the instruction pointer and branching controls up a level. We can simultaneously send a command to multiple threads at once, simultaneously read data, and simultaneously execute on that data. With the GPU all threads can be hardware threads, and can be collected into one unit. We call this model SIMT.

Under the SIMT model, we can assign a bunch of data units to a massive amount of smaller processors dedicated to handling simpler parallel tasks.

We now need to develop a system for programmers. We have threads, groups/units of threads, and the GPU’s shared memory itself. A programmer wants to be able to use these to apply a function to a lot of data. This is where CUDA comes in.

CUDA

Our goal is to execute a function in parallel on a massive amount of data. We'll call this function a kernel in the context of the GPU.

Formally, a kernel is a procedure that is called/launched once, and returns once, but is executed multiple times simultaneously in parallel.

What we want to provide is a model where a programmer can write a kernel for a thread block, and write kernels for individual threads within a block. Once a programmer has these kernels, they should be able to easily scale up by assigning more data and more threads to work it.

Remember that GPUs are optimized to batch instructions, and perform them simultaneously. GPUs also have a massive number of hardware threads, with each thread being executed by a CUDA core. We operate their threads/cores in parallel, and so threads in a unit will share the same instruction pointer . This is the Simultaneous Instruction Multiple Threads (SIMT) model, which Nvidia uses to parallelize work.

The separation between SIMD and SIMT is the structure of data. SIMD passes vectors (sequential lists) of data to an operation, which is applied to the vector as a whole. The vector can then be split into individual elements. SIMT executes an instruction in lockstep across multiple threads, but the data itself is separate between threads. This means SIMT can diverge on control flow and memory access (although, at a performance cost).

When we send instructions to the GPU, we'll send them as a block of threads for the GPU to execute (we'll get to this later). Each individual thread is a task we want the GPU to execute. The blocks are simply for us to easily batch multiple threads together, knowing they'll execute in parallel.

The fundamental unit of execution for the GPU (under the cuda framework) is a warp. A warp is a unit of 32-threads that executes in lockstep. Under the SIMT model, an instruction is sent to an entire warp at one time. Newer devices have more intra-warp freedom, but this will be ignored (and is generally detrimental to performance).

Because of this, when we send the GPU a block of threads, it will be a multiple of 32. A block is essentially a collection of warps. Neither blocks nor warps are hardware units, they're both software abstractions. A warp is just a set of 32 threads scheduled together by the SM. A block is just a collection of warps that can execute independently of other blocks.

The warp SIMT model is the reason conditional branching can be expensive on a GPU. Imagine, for a moment, if a warp had three threads. If the warp’s three threads had three different conditions to execute, the SIMT model would require that the warp executes condition A for all threads and deletes the result from two threads. It would then repeat, executing the code for condition B on all threads and deleting the result from two threads. This process would continue until all threads are satisfied — wasting massive amounts of parallel throughput.

Once you've decided a block size (which we'll address later), you need to determine how many blocks to send the GPU. For convenience, the GPU provides the option to organize blocks as a 3D array. This allows matrices and arrays to easily be mapped directly over. But in general, we need to decide how many blocks we're sending and how to organize them (how many will you have in the X, Y, and Z directions). For example if we have 16 blocks, we could create a 4x4x1 grid.

Once we have a grid, we need to assign its blocks to workers. We do this by sending them to a Streaming Multiprocessor (SM) — which is essentially the multiprocessor core of the GPU. The blocks can be assigned to SMs based on availability. For instance, if your GPU had 4 SMs, each could handle 4 blocks; If your GPU had 8 SMs, each could handle 2 blocks. Because blocks are independent, it doesn't matter if they execute in parallel or sequentially.

Warp blocks are managed by the Streaming Multiprocessor (SM). The SM has a pool of CUDA cores, for instance 1024 (32 warps), which it will assign to process threads from a block. When doing this assignment, it will break the block size into a whole multiple of warps and execute them.

Because threads are isolated in their work, this scheduling can happen either in parallel or sequentially. The CUDA runtime is the only party that needs to know this.

If multiple threadblocks & their resource requirements can fit on an SM, they can be scheduled together on the same SM.

If the threadblock is larger than the size of the SM can handle, then the block will be rejected, they will NOT be split among SMs (this is because SMs handle the synchronization of the blocks results). This means the block size needs to be optimized to the SM size in order to maximize resource usage for a given GPU. If only part of a block would fit, the GPU will not schedule it partially, and will instead leave the excess space unused.

Additionally, threadblocks MUST be able to execute independently of one another. This allows them to be scheduled sequentially or in parallel across cores/SMs, and it's this model that allows CUDA to automatically scale workloads up or down on different hardware.

Threads within the same block, however, can synchronize and share memory (this is why a block will not be split). A warp has built in thread-to-thread register syncing primitives. Specifically, threads within a warp can copy each other's registers, see which threads within a warp are active/being used, and and can request memory fencing (allows ordering of operations in a warp with a shared variable). Between warps, threads must use the Streaming Multiprocessor’s Shared Memory or shared cache, and can use certain CUDA primitives to synchronize.

The final memory location is the GPU’s RAM. This ram is shared between multiprocessors, and can be used to do interblock operations or reductions.

The general memory hierarchy is thread memory > shared cache > shared memory > global memory. One of the critical considerations for performance critical code is how we can quickly map data to the most local level

This is a light overview, but is enough that we can begin looking at some of CUDA’s primitives and workflows.

Writing CUDA

The first relevant compiler directives for CUDA are the device specifiers. The CPU and its memory are physically separate from the GPU and its memory. This means they'll run their own instructions, and have their own data!

Because of this, we need to specify which functions we want the CPU and/or GPU to have access to and execute.

Our first option, and the default, is __host__. This directive just means the function runs on the CPU. This is the default behavior, and so it's largely good disambiguation to add it to your functions.

The next option is __device__, which specifies that a method is a kernel that runs on the GPU. These kernels can only be called by functions created using the __device__ or __global__ qualifiers.

Finally, there's the __global__ qualifier. Global kernels also run on the GPU, however they have calling code on the CPU side allowing the CPU to launch their execution. Your entry point to a CUDA program will be through a __global__.

When we’re creating a program, we need to import the cuda_runtime.h header file for CUDA code. Here’s a simple GPU program template:

#include <cuda_runtime.h>
#include <iostream>
#include <cstdint>

__host__ int main(){
    // Basic Cuda template
    return 0;
}

When we access the GPU through a __global__ kernel, we'll give it a grid size and a block size. The grid size will be in terms of blocks, and the block sizes will be in terms of threads. Your block sizes should be a multiple of 32, the size of a warp, or you'll get partially filled blocks (which means you’ll have inefficiency. As mentioned earlier, the GPU will attempt to schedule blocks in ascending order either sequentially or in parallel.

To look at why this is, we need to consider how parallelism on the GPU varies from parallelism on the CPU.

On the CPU, we were primarily worried about the issue of task separation. That is, does the completion of one task require another task to be completed, or does the completion of one task alter data in another task. This is called data dependency.

A good example is counting vs summation.

When we’re counting, we have an initial variable (set equal to 0) which stores the number of elements we’ve seen. We want to increase this count whenever we see a new element, however, to increase this count we need to know what it previously was. If it was 0, it becomes 1. If it was 1, it becomes 2, and so on. This is a simple successor operation.

So, what’s the problem with scaling this? The fact that we have one shared variable, which needs both written and read for each operation. Consider what happens if 3 CPUs (assume they’re single core for the moment) attempt to read the value of count:

They will all read a value of 0. So what will they all write? They will all write the value 1, and attempt to read again. If one of the CPUs is under less load or just lucky, it might even be able to write its second value before another CPU can complete its read. This means the CPUs are racing to read and modify this value, and the count will be wrong (for the first operation, it should be 3 but will likely result in a count of 1).

We can see visually that the data dependency here is a single line, there’s nowhere we can break-up this operation as it’s posed. This is because presently we’re simply counting up. What it we were instead trying to count a set of elements in parallel?

To count in parallel, we need to modify the data dependency, such that each CPU can count by itself, and then they can add their results. We can do this by creating a local copy for each task.

Now, the counting can be done in parallel. However, finalizing the true count requires the result from all CPUs, and so the process won’t end until every CPU finishes (one CPU with a heavier load could block the entire process).

If we were to consider the memory access pattern for the CPU, it would likely be strided 1 element at a time, at memory locations divided among CPUs as such:

It’s clear that each CPU has a relatively easy time accessing its memory, and if they needed to suddenly jump addresses (say a special count function, or a pointer to a list of elements) we know they could do so easily. Consider a case where each CPU’s task is to get a count for a different type of element:

This is less than ideal, but we know that the CPU could realistically fit everything into its larger cache, and easily jump around. The CPU can also easily retrieve a single memory unit, and could relatively cheaply search the entire array.

But recall that the GPU is structured wholly differently. It’s a massively parallel processor, and we designed it to retrieve information in bulk for simultaneous processing. A GPU may have thousands of cores, and so doing an access pattern of 1-by-1 elements like this would be inefficient. It would also result in a lot of contention. Instead of considering units of 32, we’ll simplify to units of 4. This overall section is moreso about the concept, so don’t get too attached to the details for now.

When we parallelize for the GPU, we could try a pattern like we tried on the CPU - where the GPU scans sequentially for elements of a specific type. That will be our first implementation, before we work towards a better and more optimized one. We’ll have a 1 warp block of the GPU request memory sequentially, sum it among threads, store the result somewhere, and move to the next bit of memory. We’ll have a separate kernel behavior (but the same kernel) for each type of data we’re interested in, and they’ll only count their data. The blocks will choose their behavior based on their position (addressed later, for now assume each thread knows where it is in a block, and each block knows where it is in a grid).

When the GPU requests data and does an operation, each individual thread within the warp has the same instruction, and gets data at an offset from the same region. A GPU against this dataset would access memory like this:

Notice that the GPU requests significantly more memory than it will ever need. Since the elements aren’t ordered, and have no structure, the majority of the GPU’s reads will be thrown away and the three warps will fight over resources. Let’s consider the case for ordered data: before we give the GPU data, we order it for each warp to process.

We can now see that instead of each GPU needing to scan the entire array, they only need to scan two pieces. This is something we could structure as a 3x2x1 grid of block operations:

However, this still isn’t necessarily optimal. Notice that we’re occasionally reading elements from the other GPU’s sections - Meaning we’re still masking out data (this is a conditional operation, meaning our warp has to branch and execute twice). We also access shared regions, so there might be congestion between the two GPUs.

Consider if we instead created a zero element, and padded our dataset. This would make sure our data is aligned to the size of the warp, doesn’t need to mask out any elements (the zero elements just count as zero items), and still finishes in just two operations.

There’s also optimizations on the grid shape and block size, though this is an extension of good memory access patterns. Consider the following grids:

While the same kernel is running, they’re doing slightly variant things and accessing slightly different locations. Keeping things as homogenous as possible is ideal.

Dependent Data

Stepping back for a moment, recall the example we discussed earlier, just normal counting. Not counting data elements, but counting sequentially.

That process is a linear dependency chain. Because of this, there’s nowhere we can parallelize it. A CPU can handle this task well by itself, efficiently.

Other processes will have highly dependent, but not fully dependent data. These are tasks that have many serial (step-to-step) segments or computations, but may have multiple discrete paths for operations to go down. A good example is calculating fibonacci, or the search of a graph, where a set of decisions/recursions might need to be made at every branching point in the graph. A CPU can relatively easily spawn a new task to continue down a path of a graph, however a GPU would waste a lot of time as there’s few individual localized things to compute.

Consider the following data dependency graph:

Here we represent data we have initially in green, data produced by a function in blue, and the functions themselves in orange. Notice that there’s very little consistency between run to run of the function. Some functions could be parallelized easily, others couldn’t. There’s not a consistent operational depth or target location. This task would run well on a CPU, but not a GPU.

Consider, instead, the following graph:

There is a clear structure to the data and operations. The data we initially have can all be mapped to a function call with a consistent structure, the length of the operation overall can be calculated easily, and the operations are repeated in a geometric fashion. Something like the summation of an array could easily produce a graph like this.

If you wanted to handle this serially, you’d need to perform N-1 operations sequentially, or divide the data up into sections and perform the sub operations sequentially. Consider our examples of adding arrays using OpenMPI or OpenMP. Data needs sent out, calculated, collected, and reduced to a single value. We have a limited number of CPU cores or nodes, and so we’re likely only going to be able to divide the time by some element C - a linear speedup.

We still need to do this on a GPU. There is still not perfect data independence, so after the first round of additions we need another. But now we can map every 2 numbers to a core (or map every number to a core and perform additions within a warp/block).

The GPU’s massive parallelism means we have an immense amount of cores. It also means 2 elements can be added by every core (or pair of cores) every instruction cycle. So we can divide the problem among N cores, and half the operation size every round (assuming we can fit all elements on the GPU).

The CPU’s performance would be:

$$ O(N) = O(\frac{N}{C}) = O(N) $$

While the GPU’s would be

$$ O(N) = O(log_2 N*(N/C)) = O(log_2 N)$$

Here’s a comparison graph (CPU in blue, GPU in purple)

While the CPU would take 781 instruction cycles with 64 cores to add 50,000 elements, a GPU (that could fit them all) would take 16 instruction cycles.

The performance difference is so massive that if the GPU had to divide that workload across 64 cores (operating with perfect parallelism, ignoring memory bottlenecks) and compute the sum 781 times in parallel across its cores (N/C = how many workloads the GPU needs to process), it would still only take 26 instruction cycles (log(N * N/C)).

It's worth noting that the CPU implementation does have a point where it dominates the GPU function. This is because there's a minimum number of elements necessary to take advantage of the efficiency of the GPU.

Massive parallelism

One can also consider purely parallel tasks, where there is no data dependency and a flat relationship between data and operations.

This is the sort of task GPUs were designed for. In this scenario, data can be perfectly mapped to a kernel thread, and computed with massive parallelism.

Mapping Data

The general idea we’re trying to get at here is mapping data directly to threads logically and spatially. You generally want your data elements to be mapped directly to some thread or block of threads. In the example we just worked through, we mapped elements we wanted to count to threads. In the first example, where we

For instance, if you have an image of size 1024x1024 pixels, and a warp size of 32 (as under CUDA) you can try to directly map sections of 32x32 pixels in the 1024 image to blocks in a CUDA grid.

So the first 32px by 32px block will go to the first block, the second 32x32px block to the second block, and so on. There is a direct 1-to-1 mapping between image pixels and threads, because there is absolutely no data dependency between the operations.

To perform this, each thread within the kernel needs to know which pixel it’s processing — an idea we got at earlier. Since the above example, since we mapped each block to a block of the image, and each thread to a pixel of the image, they should all have their own corresponding index.

When a kernel is created, part of the work the __global__ or __device__ qualifiers do is expose to the kernel the dimensions of the block and grid, and where in the block/grid a thread actually is, alongside a thread’s ID within a block (all 0-indexed). I’ll include with this repo a presentation from LAU that I believe explores this in depth (CUDA Thread Scheduling, by Haidar M. Harmanani) but essentially imagine your blocks as being spatially positioned within a grid and getting the following info:

threadIdx.x
threadIdx.y
threadIdx.z
blockIdx.x
blockIdx.y
blockIdx.y
blockDim.x
blockDim.y
blockDim.z

The GPU runtime system will scale the number of blocks you have to the number of Streaming Multiprocessors, so generally a good starting point on CUDA is to create a program with a 1 dimensional grid (just a line) and enough blocks to solve your problem, and have a block size between 128 and 256 that’s a multiple of 32. But if you need advanced performance, or want to try optimizing your program, experimenting with different block and grid sizes is an easy way to - just note you’ll need to update your index calculation.

A linear example

If, for example, your grid is a 16x1x1 row, and each block within that grid is a 4x1 row, then you can calculate the index of each thread pretty easily. We just need to know the number of threads in a block, and the number of blocks in the grid.

If a block is 4x1, and each warp within it is 32 threads, then a block has 128 threads.

Let’s say we wanted to find out what pixel the 20th thread of the block in the 8th column was assigned to. If we wanted to calculate that linearly (without caring about x and y, but just caring about giving each thread a unique pixel) we would first want to figure out how many blocks there were behind us, or which block we are. We’re at index 7 (column 8) so there are 7 blocks behind us:

We can then do 7*128 = 896. There are 896 threads behind us.

ThreadIDs are created based on the dimensionality of the block. In the case of a 1D block like ours, all threads will have a sequential threadIdx.x. We can imagine in our case that it’s 110.

We could then take 896 + 110 = 1006 to be our index into a linear buffer of pixels.

A real example

Of course, with out real code (and the example we went through) each pixel was mapped directly to a thread in a matching structure. We had a grid of data, so we created a grid of threads. Here’s an example kernel that does a more realistic calculation (and again, consider looking at the attached PDFs for help).

An example kernel:

__global__ void grayscaleKernel(unsigned char* input, unsigned char* output, int width, int height) {
    // Calculate global thread coordinates (x, y)
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Check boundary to ensure thread is within image range
    if (x < width && y < height) {
        int index = y * width + x; // Map 2D coordinate to 1D array index
        
        // Example: Individual thread logic for one pixel
        unsigned char pixelValue = input[index];
        output[index] = pixelValue / 2; // Simple darkening effect
    }
}

You’ll notice a few things. First, the x and y position of the kernel are calculated separately.

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

Then this x and y are used to calculate an index into the image data.

int index = y * width + x; // Map 2D coordinate to 1D array index

This has the same result of creating a 1 to 1 mapping of pixels to threads, but now blocks will gather data near their neighbors, which avoids the strided access we mentioned in the past.

Next, there’s not just an id calculation, there’s also a cutoff.

if (x < width && y < height) {

Because we’re trying to do assignments one to one in parallel, there’s no parent checking if the kernel should execute on that data, or if the data exists. That’s the job of the kernel thread itself. They have to validate that there’s data for them, and do nothing if not.

Then our operation is calculated in parallel.

Actually Writing CUDA

Now that we understand generally that a program is structured and formulated differently for a GPU, we need to actually address writing a complete program. For that, we introduce a new concept.

Host vs Device Memory

With the above kernel, we wrote to an arbitrary output buffer, but in the real world GPU and CPU memory are separate. If we want memory to exist on either, we need to declare it for that target. If we want to move values from one to the other, we need to copy the memory over.

First, let’s look at cudaMalloc.

Cuda Malloc’s Documentation shows us that it expects a pointer to a pointer value it can write the address into, and the requested size of memory in bytes.

cudaError_t cudaMalloc	(void ** devPtr,
 size_t 	size	 
);

This is different from the C stdlib’s malloc, as the stdlib malloc call returns a pointer, given a size, and returns NULL if there’s an error. This difference is simply because NVIDIA decided to have cudaMalloc return a status/error code, and so they must accept a pointer to the pointer you’ll be using.

Paired with cudaMalloc is cudaFree which accepts a pointer allocated using cudaMalloc. An important note with GPU pointers and variables in on naming convention. It is good practice to precede host/CPU variables with h_ and device/GPU variables with d_. This makes it easy to tell where a variable is at a glance, which can save you debugging time.

Consider the following example usage of cudaMalloc.

…
float *d_data; // Pointer to be used on the device
size_t size = 1024 * sizeof(float);

// Allocate memory on the GPU
cudaError_t err = cudaMalloc((void**)&d_data, size);

if (err != cudaSuccess) {
    // Handle error (e.g., cudaErrorMemoryAllocation)
}

// Free memory when finished
cudaFree(d_data);
…

Memory Transport

The next relevant memory instruction is cudaMemcpy. Since the memory on the GPU and CPU is separate, we need a way to copy between devices. The Cuda memcpy documentation provides the following signature:

cudaError_t cudaMemcpy	(void * 	dst,
const void * 	src,
size_t 	count,
enum cudaMemcpyKind 	kind	 
);

The function accepts a pointer to src and dst, alongside a size in bytes to copy, and the type of memory copy. This type is passed as an enum of the following:

CUDA memory copy types

Enumerator:

cudaMemcpyHostToHost Host -> Host
cudaMemcpyHostToDevice Host -> Device
cudaMemcpyDeviceToHost Device -> Host
cudaMemcpyDeviceToDevice Device -> Device
cudaMemcpyDefault Default based unified virtual address space

You can see a more complete program example (excluded the actual kernel call) below:

#include <cuda_runtime.h>
#include <iostream>

int main() {
    const int N = 1024;
    size_t size = N * sizeof(float);

    // 1. Allocate Host Memory
    float *h_A = (float*)malloc(size);
    for (int i = 0; i < N; i++) h_A[i] = 1.0f;

    // 2. Allocate Device Memory
    float *d_A;
    cudaMalloc(&d_A, size);

    // 3. Copy from Host to Device
    // Direction: cudaMemcpyHostToDevice
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    /* --- Launch Kernel Here, do some stuff --- */

    // 4. Copy results back from Device to Host
    // Direction: cudaMemcpyDeviceToHost
    cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);

    // 5. Cleanup
    cudaFree(d_A);
    free(h_A);

    return 0;
}

Notice that the variables are preceded by their location (host/device), the pointers are in reverse order of the requested operation (because the signature is dst then src), and memory from both the CPU malloc and GPU cudaMalloc are freed.

Launching Kernels

The next to last and most critical part of this entire operation is the launching of kernels associated with the operations. To launch a kernel on the gpu, you just call it like a standard function, however you need to pass in a gridSize and blockSize. There are additionally some optional qualifiers. All of these are passed in an nvidia/CUDA specific manner that’s called the execution configuration (between these symbols <<< >>>). It has the following signature:

myKernel<<<numBlocks, threadsPerBlock, sharedMemBytes, stream>>>(arguments);

Where myKernel is defined as such:

__global__ void myKernel(float* data) {
    // Code to execute on GPU
}

The execution configuration’s parameters are as such: numBlcoks: Number of thread blocks in the grid. This can be an integer for 1D grids, or a dim3 for multidimensional grids. threadsPerBlock: Number of threads per block (maximum of 1024, or less on older hardware). This can be an integer for 1D blocks, or a dim3 for multidimensional blocks. sharedMemBytes (Optional): Bytes of dynamic memory to allocate per block. Expects an integer. This is generally used when the number of bytes isn’t known at compile time, and ensures an SM will have this much shared memory between threads. stream (Optional): Allows the user to provide an I/O stream to be used for async execution. Defaults to null, but expects a type cudaStream_t.

In general, if you don’t specify a value CUDA will fill in a one (i.e. 'dim3(2, 2)becomesdim3(2, 2, 1)`).

Here’s an example of this being done:

#include <iostream>

__global__ void add(int *a, int *b, int *c) {
    *c = *a + *b;
}

int main() {
    int h_c; // host result
    int *d_a, *d_b, *d_c; // device pointers

    // 1. Allocate GPU memory
    cudaMalloc(&d_a, sizeof(int));
    cudaMalloc(&d_b, sizeof(int));
    cudaMalloc(&d_c, sizeof(int));

    // 2. Initialize and copy to GPU
    int a = 5, b = 7;
    cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);

    // 3. Launch Kernel: 1 block, 1 thread
    add<<<1, 1>>>(d_a, d_b, d_c);

    // 4. Copy result back to CPU
    cudaMemcpy(&h_c, d_c, sizeof(int), cudaMemcpyDeviceToHost);

    std::cout << "5 + 7 = " << h_c << std::endl;

    // 5. Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Syncing Up

Our final important detail is learning the basics of how to wait for the GPU to be done working. This isn’t necessary, the GPU and CPU can communicate between or during tasks (consider a GPU processing your video game’s render pipeline in the background, it must constantly receive and supply render data and rendered frames). However, for our class activity itr’s likely the most reasonable.

When called cudaDeviceSynchronize() simply blocks the program until all requested GPU operations have run. You can read the details here

Nvidia has a good blog on all of this from their end as well.

Our Pixel Example

Revisiting the example we discussed earlier with the 1024x1024 pixel image, we can see the following code which would compute the greyscale operation over it.

#include <stdio.h>
#include <cuda_runtime.h>

#define WIDTH 1024
#define HEIGHT 1024

__global__ void grayscaleKernel(unsigned char* input, unsigned char* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height) {
        int index = y * width + x;
        unsigned char pixelValue = input[index];
        output[index] = pixelValue / 2;
    }
}

int main() {
    int size = WIDTH * HEIGHT * sizeof(unsigned char);

    // Host memory
    unsigned char *h_input = (unsigned char*)malloc(size);
    unsigned char *h_output = (unsigned char*)malloc(size);

    // Initialize input (example: gradient)
    for (int i = 0; i < WIDTH * HEIGHT; i++) {
        h_input[i] = (unsigned char)(i % 256);
    }

    // Device memory
    unsigned char *d_input, *d_output;
    cudaMalloc((void**)&d_input, size);
    cudaMalloc((void**)&d_output, size);

    // Copy input to device
    cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);

    // Define block and grid sizes
    dim3 blockDim(16, 16);  // 16x16 threads per block
    dim3 gridDim(
        (WIDTH + blockDim.x - 1) / blockDim.x,
        (HEIGHT + blockDim.y - 1) / blockDim.y
    );

    // Launch kernel
    grayscaleKernel<<<gridDim, blockDim>>>(d_input, d_output, WIDTH, HEIGHT);

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    // Copy result back to host
    cudaMemcpy(h_output, d_output, size, cudaMemcpyDeviceToHost);

    // (Optional) Print a few values
    for (int i = 0; i < 10; i++) {
        printf("%d -> %d\n", h_input[i], h_output[i]);
    }

    // Cleanup
    cudaFree(d_input);
    cudaFree(d_output);
    free(h_input);
    free(h_output);

    return 0;
}

Try to analyze what the flow of memory is through the program. Where is data residing? And how are threads mapped to pixels.

Comparing implementations - Summation benchmark

Strided vs sequential access pattern for a GPU.

#include <iostream>
#include <cuda_runtime.h>

#define N (1 << 24)

__global__ void good_kernel(float *a, float *out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        atomicAdd(out, a[tid]);
    }
}

int main() {
    float *h = new float[N];
    for (int i = 0; i < N; i++) h[i] = 1.0f;

    float *d, *d_out, h_out = 0.0f;

    cudaMalloc(&d, N * sizeof(float));
    cudaMalloc(&d_out, sizeof(float));

    cudaMemcpy(d, h, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemset(d_out, 0, sizeof(float));

    int threads = 256;
    int blocks = (N + threads - 1) / threads;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    good_kernel<<<blocks, threads>>>(d, d_out, N);
    cudaEventRecord(stop);

    cudaMemcpy(&h_out, d_out, sizeof(float), cudaMemcpyDeviceToHost);

    cudaEventSynchronize(stop);
    float ms;
    cudaEventElapsedTime(&ms, start, stop);

    float expected = (float)N;

    std::cout << "Good GPU: " << h_out << "\n";
    std::cout << "Expected: " << expected << "\n";
    std::cout << "Time (ms): " << ms << "\n";

    cudaFree(d);
    cudaFree(d_out);
    delete[] h;
}

#include <iostream>
#include <cuda_runtime.h>

#define N (1 << 24)

__global__ void bad_kernel(float *a, float *out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        // Access elements in a strided manner: Each thread accesses every 2nd element
        atomicAdd(out, a[tid * 2]);
    }
}

int main() {
    float *h = new float[N];
    for (int i = 0; i < N; i++) h[i] = 1.0f;

    float *d, *d_out, h_out = 0.0f;

    cudaMalloc(&d, N * sizeof(float));
    cudaMalloc(&d_out, sizeof(float));

    cudaMemcpy(d, h, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemset(d_out, 0, sizeof(float));

    int threads = 256;
    int blocks = (N + threads - 1) / threads;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    bad_kernel<<<blocks, threads>>>(d, d_out, N);
    cudaEventRecord(stop);

    cudaMemcpy(&h_out, d_out, sizeof(float), cudaMemcpyDeviceToHost);

    cudaEventSynchronize(stop);
    float ms;
    cudaEventElapsedTime(&ms, start, stop);

    float expected = (float)N;

    std::cout << "Bad GPU: " << h_out << "\n";
    std::cout << "Expected: " << expected << "\n";
    std::cout << "Time (ms): " << ms << "\n";

    cudaFree(d);
    cudaFree(d_out);
    delete[] h;
}

The end

There’s more to cover. We didn’t address synchronization primitives or shared memory much. But this is a good general preface in my opinion. Feel free to extend this with your own additional details.

Additional Reading + Sources

I know I didn’t cite superbly, I have rushed to create this, and there’s a host of content that I got from a video or forum or discussion and needed to verify. However, these are sources and citations I believe would be useful (as I referenced them multiple times)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
1strideaccess.png		1strideaccess.png
AComputingModel.png		AComputingModel.png
CPU_accessPattern.png		CPU_accessPattern.png
CPU_flatPattern.png		CPU_flatPattern.png
CUDA-Thread-Indexing-Cheatsheet.pdf		CUDA-Thread-Indexing-Cheatsheet.pdf
GPUAccessPadded.png		GPUAccessPadded.png
GPUAccessPattern.png		GPUAccessPattern.png
GPUvsCPU.png		GPUvsCPU.png
GridBlockExample1.png		GridBlockExample1.png
Lecture15.pdf		Lecture15.pdf
MultipleGrids.png		MultipleGrids.png
README.md		README.md
RandomAccessPatternCPU.png		RandomAccessPatternCPU.png
SortedAccessGPU.png		SortedAccessGPU.png
counting.png		counting.png
cpuTasks.png		cpuTasks.png
cpucorevsgpucore.png		cpucorevsgpucore.png
cpuparrallelism.png		cpuparrallelism.png
cpuvssomethingnnew.png		cpuvssomethingnnew.png
flatdatadependency.png		flatdatadependency.png
gpucoresmapping.png		gpucoresmapping.png
rcac_gpuclusterintro.md		rcac_gpuclusterintro.md
rcac_gpuclusterintro.pdf		rcac_gpuclusterintro.pdf
seriallydependentgraph.png		seriallydependentgraph.png
simulAccess.png		simulAccess.png
structuredDAG.png		structuredDAG.png
threadmapping.png		threadmapping.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using GPUs on the scholar cluster

Intro to GPUs

Solving 1 Complex Problem

CUDA

Writing CUDA

Dependent Data

Massive parallelism

Mapping Data

A linear example

A real example

Actually Writing CUDA

Host vs Device Memory

Memory Transport

Launching Kernels

Syncing Up

Our Pixel Example

Comparing implementations - Summation benchmark

The end

Additional Reading + Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Using GPUs on the scholar cluster

Intro to GPUs

Solving 1 Complex Problem

CUDA

Writing CUDA

Dependent Data

Massive parallelism

Mapping Data

A linear example

A real example

Actually Writing CUDA

Host vs Device Memory

Memory Transport

Launching Kernels

Syncing Up

Our Pixel Example

Comparing implementations - Summation benchmark

The end

Additional Reading + Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages