CUDA: Why Modern GPUs and LLMs Depend on it

This article examines CUDA as the execution contract between software and GPU hardware

CUDA’s Modern Relevance in GPU Architecture and LLM Design

CUDA is the Rosetta Stone between programmers and GPUs - but how is it used in modern GPU architecture, and why does it remain so critical in today’s AI-driven compute landscape?

Compute Unified Device Architecture, or CUDA, has been a cornerstone in GPU programming, acting as the soul of the hardware-software interface in modern GPU devices.

In this article, it will be discussed how CUDA acts as more than a simple programming language, and how it contributes greatly to abstraction and parallelism within modern devices.


CUDA Fundamentals: Parallelism as a First-Class Primitive

To understand CUDA’s continued relevance, one must first examine how it encodes parallelism. GPUs are architected for throughput rather than latency. Unlike CPUs, which optimize for complex control flow and fast single thread execution, GPUs are designed to execute massive numbers of simple operations concurrently. CUDA exists to make that execution model programmable.

CUDA exposes GPU parallelism through a hierarchical execution model composed of three primary units:

  • ● Threads
    Threads are the smallest execution unit in CUDA. Each thread executes a single instruction stream and typically corresponds to one element wise operation, such as operating on a single tensor element.

  • ● Blocks
    Threads are grouped into blocks. Threads within a block can synchronize and share data through shared memory, a low latency on-chip memory region. Blocks form the fundamental unit of cooperative parallelism.

  • ● Grids
    A grid is a collection of blocks launched to execute a kernel. Blocks within a grid execute independently and may be scheduled in any order, enabling large-scale parallelism.

This hierarchy is deliberately hardware agnostic. CUDA code written for a small GPU scales to a larger GPU without modification. The programmer specifies parallel structure, while the runtime and hardware map that structure onto available execution resources. This separation of intent and execution is one of CUDA’s most important architectural decisions.


Why CUDA Is Necessary: A Concrete Comparison with C++

Consider the trivial vector addition:

C = A + B

In standard C++, this is typically implemented as:

for (int i = 0; i < n; i++) { C[i] = A[i] + B[i]; }

This loop is fundamentally sequential. Even when compilers apply vectorization, the execution model assumes a small number of powerful cores executing instructions over time.

This clashes with GPU architecture, where thousands of lightweight cores are designed to execute the same instruction simultaneously across different data elements. The CUDA equivalent expresses the operation differently:

__global__ void addVector(float* A, float* B, float* C, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

Here, parallelism is explicit. Each thread handles one element, and the CUDA runtime maps threads across Streaming Multiprocessors (SMs). The result is not simply faster execution it is execution aligned with the physical execution model of the GPU.

CUDA does not accelerate code by abstraction alone; it enables programmers to describe work in a form that GPUs are architecturally designed to execute efficiently.

Tensor Optimization: Throughput/Stability Balancing

Now that a fundamental understanding of CUDAs operation has been achieved, let's investigate its use in modern architecture. 

The primary operation in modern GPUs, specifically in LLM operation, is matrix multiplication. CUDA cores cannot on their own perform matrix multiplication, however they do enable this operation to be performed by tensor cores. This is achieved through a special API, called WMMA or Warp Matrix Multiply and Accumulate. Fundamentally, WMMA allows GPU programmers to treat matrices as tiles rather than individualized numbers. This gives rise to the notion of mixed precision, a concept which allows for the balancing act between stability and throughput. 

The reader may recall from previous articles that the fundamental matrix operation in tensors is: 

D = A ∗ B + C

Where A and B are input matrices while C acts as the accumulator matrix. For throughput maximization, A and B should ideally be 16-bit matrices, while the accumulator should be 32-bit. The problem that arises is that in this case, a mixed precision calculation is needed. Attempting to "force" this calculation without the use of CUDA yields one of two results, either the programmer makes the bit count the same across accumulator and input matrices and achieves stability, but at the cost of throughput, or the programmer uses different bit counts, and achieves the converse outcome. Neither case is ideal. CUDA however allows the programmer to achieve the best of both worlds in such tensor operations, performing mixed-precision operations while achieving both stability and throughput. 

Fabric Scaling, Interconnectivity, and LLMs 

In addition to tensor optimization, CUDA is critical in fabric scaling. LLMs, especially commercial LLMs, cannot run on single standalone GPUs. There simply is not enough processing power within even the most powerful modern GPUs to support the multitude of simultaneous operations demanded by LLMs. When one GPU isn't enough, the solution is simple: add more GPUs. However, doing so isn't as simple as slapping two GPUs together and calling it a day. If one knows anything about GPUs, its clear that this won't work because much of GPU optimization comes from the ability to synchronize workloads across the cores of a GPU, and in order to do this, communication is essential.

Engineers at large graphics card companies have come up with some clever solutions to the problem of making multiple GPUs "talk", for instance NVIDIA's NVLink and NVSwitch, which act as a direct GPU-GPU interconnect and a literal switch that connects multiple of such interconnects, respectively (Figure 2). 

Figure 2: NVLink/NVSwitch (OFF): This is a simple graphic to illustrate the basic difference between NVLink and NVSwitch. Here, NVLink connects the two GPUs, however, NVSwitch acts as the enabler between GPUs. In a more complex system, NVSwitch would have multiple NVLink interconnects “wired” through it, acting as a master switch between them. 

The key to either system is NVIDIA's UVA, or Unified Virtual Addressing. UVA allows for the allocation of a virtual address space for all GPU memories in a GPU network. This allows programmers to use identical syntax to access data on any GPU that is connected within NVLink. UVA is of course only possible through the use of CUDA, as it is an addressing system which acts in the hardware-software interface. Furthermore, CUDA enables a communication protocol known as IPC or inter-process communication, which allows different operating system processes to use the same GPU pointers in the memory. This eliminates the need for the data to be first copied to the GPU and then back to the target GPU, a tedious interaction which would result in idle timeouts and a tremendous loss of efficiency. 

Whether it be the optimization of simple tasks within GPUs, the facilitation of tensor-based matrix operations, or the interconnectivity between vast networks of GPUs as seen in LLMs, CUDA is the critical magic which lets modern GPUs and GPU systems work. CUDA isn’t just an optimization, it is an indispensable cornerstone of modern GPU programming, without it, GPU programming would be dominated by tediousness and idle timeouts, making the modern miracles of AI and LLMs nothing more than science fiction. 

I hope this article was informative and thank you as always for reading!