v11labs

There is no shortage of amazing material on GPUs. CUDA tutorials explain writing kernels, and architecture texts such as model explain how GPU's work at different layers. What's less developed is material that treats ML systems as an interactive hardware-software problem, where model structure, data layout, and execution are considered alongside hardware constraints.

Because that connection is often implicit, optimization techniques are frequently applied without a clear understanding of when they are appropriate or why we use them in the first place. Their dependencies on factors like bandwidth, scheduling, and access patterns are easy to miss from the software layer alone.

This series intends to look at how GPUs execute ML workloads and how those execution details should inform software and system design.

Why CPUs Break at Scale for ML

GPUs were initially designed for workloads with a large amount of uniform arithmetic like graphical rendering as the name implies. To understand the motivation behind GPUs for ML systems today its helpful to briefly discuss CPUs.

CPUs are optimized around instruction level parallelism (ILP). Simply put, instruction level parallelism attempts to run multiple instructions that use different units such as arithmetic and accessing a value in memory.

Thus, optimization attempts to keep the scheduler full while utilizing as many available units through processes such as deep pipelines, out of order execution, and speculative executions. This works well when programs have tight dependency chains with complex control flows.

Most traditional workloads work well with ILP, but ML workloads do not.

alt text

ML Workloads are Data Parallel

Training and inference are dominated by matrix multiplications and operations that apply the same computation across large tensors.

Traditional CPU optimizations bring diminishing returns here as there are fewer independent instructions per thread to exploit and branch prediction adds little value to simple repetitive control flow.

Most modern models reduce to a small set of linear algebra primitives

linear layers = matrix multiplications
attention = batched matrix multiplications
convolutions = structured matrix ops

These primitives have high arithmetic intensity and regular memory access patterns. GPUs are built to exploit exactly this combination. They trade sophisticated control logic for wide execution units and the ability to run many ops concurrently.

A modern CPU might sustain 100-500 GFLOPs on dense linear algebra. A single GPU sustains 50-100 TFLOPs. Thats a 100x difference in raw throughput per chip.

While these numbers depend heavily on precision (FP32 vs FP16/BF16) and whether tensor cores are used, the order-of-magnitude gap remains. GPUs are not just faster at arithmetic—they are designed to sustain that arithmetic under extreme parallelism.

Hopefully you now understand why GPUs outperform CPUs so dramatically for these workloads.

The hardware matches the shape of the computation.

GPU Execution Model

GPU performance for ML can be introduced with three main concepts: streaming multiprocessors (SMs), warps, and SIMT execution. We'll briefly introduce these to aid in later sections.

SMs

A GPU is organized around SMs, which are often compared to CPU cores in the sense that they are able to execute computations and use designated local memory. Each SM contains its own arithmetic units, schedulers, and register files.

Warps

A warp is a group of threads (typically 32 threads on NVIDIA GPUs). All threads in a warp operate in lockstep, meaning they execute the same instruction at the same time on different data.

This maps extremely well to ML training and inference, which can be decomposed into billions of identical multiply-accumulate operations. Each thread in a warp can handle one element of a matrix or activation, whereby a single instruction computes 32 independent pieces of the same matrix operation in parallel.

It is equally important to understand what happens in the case of diverging behavior across threads. For example imagine an if/else branch, different threads want to take different paths depending on their data. But the gpu cannot execute both paths simultaneously for a single warp. Instead it masks (temporarily disables) threads that don't need the first branch path, while executing. Then it executes the second branch path similarly. Only after both paths finish does the warp converge. In a simple if/else branch the worst case is you cut your throughput in half. This is why branch-heavy workloads do poorly on GPUs.

alt text

SIMT Execution

SIMT refers to Single Instruction, Multiple Threads execution. The same instruction is issued to all threads in a warp. Individual threads can read/write different data, but control flow is synchronized (ie. all threads are read/write at the same time). Latency is tolerated by switching between warps when one is stalled (e.g. waiting for memory).

SIMT makes GPU performance predictable for ML workloads in the sense that operations have minimal divergence and memory access patterns can be coalesced.

NVIDIA's Hopper Architecture h100 has 144 SMs, with each SM containing 128 CUDA cores.

Scheduling, Occupation, and Latency Hiding

Up to this point, we've described the static execution model: SMs execute warps, warps execute in lockstep, and SIMT makes uniform math fast while punishing divergence. But thus alone doesn't explain why GPUs are able to stay so close to peak throughput in practice, even when individual operations have long memory latencies or uneven resource usage. The missing piece is not just how instructions are executed, but how they are scheduled. Unlike CPUs, which reduce latency for a single thread, GPUs assume latency is unavoidable and instead hide it by running many warps at once. When one warp stalls on memory, the SM immediately switches to another ready warp. If enough warps are resident, computation continues uninterrupted and the arithmetic units remain busy. Performance depends less on the math itself and more on resident warps per SM, how registers and shared memory limit that number, and how effectively the scheduler can swap between warps to cover stalls; model shape, batch size, and data layout directly determine occupancy and scheduling efficiency.

Occupancy

Occupancy refers to the number of warps actively resident on a SM. Higher occupancy allows the GPU to better hide latencies, because there are more warps to switch between when one stalls.

While high occupancy sounds like it should correlate with high performance, this is an incomplete picture. Occupancy is limited by resources like registers, shared memory, and threads per block. GPUs have finite resources. Each warp consumes registers and shared memory. If a kernel requires excess of either, fewer warps can be scheduled per SM, causing underutilization of the GPU. Occupancy introduces the first of many optimization problems.

To make this concrete, consider an NVIDIA H100 SM with 256k registers. If a kernel uses 128 registers per thread and launches 256 thread blocks, each block consumes 32k registers. That limits the SM to 8 resident blocks, or 64 warps total. If the kernel instead used 64 registers per thread, the same SM could support twice as many warps, improving its ability to hide memory latency.

This tradeoff appears constantly in kernel design: more registers reduce re-computation and spills, but too many reduce occupancy and stall the machine.

Kernel design is influenced by inference latency. Memory coalescing refers to ensuring that threads access memory at the same time, minimizing the number of memory transactions that need to take place. The way data is arranged in memory affects how efficiently it can be accessed by warps. Optimizing memory access patters to align with the GPU's hierarchy (L1, L2, global) reduces latency, increasing throughput. Large batch sizes help hide memory latency by amortizing overhead, but again must balance size with SM resources.

Understanding the Bottleneck

GPUs have a structured multi-level memory hierarchy to balance bandwidth and latency. To understand bottlenecks, its important to understand this arrangement:

Registers: The fastest memory, directly tied to each thread. However limited in size, if data exceeds space threads go up the hierarchy to slower memory.
Shared Memory: Shared between threads in the same block. Because shared memory is visible to all threads in a block, it becomes the natural point for synchronizing threads.
L1/L2 Cache: L1 is located close to the SM, while L2 is larger and shared across SMs. Caching reducing fetching from global memory.
High Bandwidth Memory (HBM): HBM is the slowest level in the GPU memory hierarchy, but provides bandwidth necessary to handle large datasets.

alt text

Bandwidth refers to how much data can be transferred in time (GB/s), while latency refers to the delay between issuing a memory request and receiving the data. Hence how HBM has high bandwidth but poor latency.

Now, we can explain how GPUs are bottlenecked.

Simply increasing the number of FLOPs in a GPU doesn't always result in better performance for ML tasks. If the memory hierarchy isn't optimized for the data being processed, the GPU will wait for the data, wasting the increased computational power of the FLOPs.

The real bottleneck lies in memory access patterns and the GPU's ability to fetch and use data efficiently.

Memory Coalescing and Why Tensor Cores Need it

Memory coalescing is the process of grouping memory accesses together so that multiple threads access memory in a single transaction. Non-coalesced access forces multiple memory transactions, reducing throughput.

Threads in a warp must access consecutive memory locations for the memory controller to combine accesses into into a single transaction. If threads access non-contiguous data, the memory controller needs to issue multiple transactions, slowing things down. Strides refer to the distance between consecutive memory accesses in a sequence. If the stride is large (i.e. if threads access memory locations that are far apart), coalescing becomes difficult. Thus smaller strides or aligning memory accesses is crucial for high performance.

While the computational intensity of a task is important, the layout of data in memory determines how efficiently it can be accessed by warps. This is where quadratic attention in modern ML systems comes in. Take self-attention in models like transformers, which is the core of many Modern ML tasks. Self-attention has a quadratic time complexity of O(n²) in sequence length, meaning the amount of memory required grows quickly as sequence length increases.

alt text

*Consider a model with context size of 32k tokens. The attention score matrix requires 32k x 32k ~ 1 billion elements.

The key issue is not the size of the matrix alone, but reuse. Attention scores are read multiple times across softmax, masking, and value projection. If those reads miss cache, the same data is streamed repeatedly from HBM, turning attention into a bandwidth-bound operation regardless of available FLOPs. *

At FP16 this would require about 2GB of memory just for storing attention scores. The GPU must read this matrix multiple times for softmax and value projection, adding more data movement. This saturates the GPU's HBM, causing a significant memory bottleneck, not because the matrix multiplication is slow, but because data has to move across HBM several times. The compute power of the GPU is largely wasted on waiting for memory.

In attention models, if the data required for these operations is not stored in contiguous and coalesced fashion, the cores cannot utilize the full bandwidth of the memory system. Optimizing data locality and stride patterns in attention matrices is salient in reducing memory traffic and improving tensor cores efficiency.

Tensor Cores for Modern ML Problems

Tensor cores are designed to accelerate matrix multiply-accumulate (MMA) operations. Tensor cores perform multiple operations per clock cycle, allowing them to execute the same multiply-accumulate operations on many pieces of data at once. Instead of using FP32, tensor cores use lower precision formats like FP16, BF16, or INT8, allowing them to process more data in each clock cycle without sacrificing too much model accuracy.

Operating at a lower precision presents both an opportunity and tradeoff. While lower precision speeds up computation and reduces memory usage, it introduces the potential for loss of accuracy, especially if models rely on high precision arithmetic. FP16 is often used for deep learning training especially in NLP, as it offers a good tradeoff between speed and accuracy. BF16 has the same exponent size as FP32 but uses only 16 bits for the mantissa. This makes it suitable for neural networks by preserving large range and preventing numerical instability during gradient computes. It balances lower memory use with higher precision than FP16. INT8 is often used in inference tasks. The reduced size allows large batch sizes, but the accuracy loss can be large.

Quantization in this context is the process of converting higher precision into lower precision reducing the overall model size and improving throughput by increasing the number of operations a tensor core can perform.

The main challenge with quantization is ensuring that precision loss doesn't compromise the model's performance. For example in NLP tasks like transformer inference, small errors introduced by low precision arithmetic can accumulate across layers and lead to degradation in accuracy. This is where quantization-aware training (QAT) comes in, where the model is trained while accounting for effects of quantization.

Crucially, tensor cores do not eliminate memory bottlenecks. They amplify them. As compute becomes cheaper per operation, the cost of moving data dominates even faster, making layout, tiling, and reuse more important than raw arithmetic throughput.

Hardware Approach to Quantization

In production, quantization is often the first step toward reducing computational cost. However, quantization isn't one size fits all. The hardware itself determines the precision at which it performs best.

For example consider tensor cores and INT8:

Tensor cores, as mentioned, are optimized for low-precision arithmetic. When operating in FP16 or BF16, tensor cores utilize fewer resources to perform the same operation as FP32, thereby increasing throughput. However, when operating in INT8, tensor cores can take advantage of 8-bit integers, which are much smaller and more efficient than floating-point formats. FP16 and BF16 are still relatively expensive. They need exponent handling, normalization, rounding logic, and more transistors per multiply. INT8 multipliers are dramatically simpler, no exponent, fixed scaling, much smaller circuits, and lower power per operation. Because INT8 are 8 bit it means more operands fit in the same register file, more data fits in shared memory, and more multipliers can be packed into the same silicon area. This means you can run multiple INT8 MMAs in the space and time of one FP16. Additionally, INT8 cuts bandwidth requirements in half vs FP16. This means tensor cores spend less time waiting for data and more time computing, addressing the bottleneck.
Accumulation is often done in FP32 despite matrix operations being done in INT8 due to higher data precision requirements.

However INT8 quantization is not always the straightforward solution. Quantization error must be taken into account to maintain accuracy and precision requirement. Post training quantization tries to address this by mapping floating point values into integer space while trying to minimize accuracy loss and applying per-layer quantization to ensure that each layer of the model has its precision reduced independently rather than one static quantization level.

Inference vs Training

Training typically operates with larger batch sizes and handles long term dependencies, while inference prioritizes latency over throughput due to real time performance requirements.

In production, requests can arrive with different sequence lengths or even individual samples. GPUs excel at processing large batches but can become inefficient with smaller, variable-sized batches due to underutilization of resources. While training is often run on longer sequences of data, inference must deal with rapid responses. For tasks like NLP, where one token predicts the next, inference must be efficient even in single token batch scenarios. These tasks are sensitive to tail latency, which can be severely impacted by inefficient memory access or underutilized hardware during smaller batch runs. In training, kernels are typically larger and have more predictable execution patterns. Inference kernels, especially for large models like transformers, often launch multiple smaller operations, adding more overhead per request. The extra time required to launch and manage multiple kernels can add significant latency.

A central challenge in transformer-based models during inference is the Key Value (KV) cache. As transformers process long sequences, they store key value pairs in memory to preserve attention information across layers. This cache grows exponentially with context length. Memory access times slow down as the model attempts to fetch and store data across multiple levels of memory hierarchy.

alt text

The prefill phase (where the entire context is processed before token generation) requires the full context to be present in memory. Once this phase completes, the decode phase (where tokens are generated one-by-one) can experience underutilization of available resources, as the GPU often doesn't use all warps for a single token prediction.

At this point, a pattern should be clear: GPU performance for ML is not determined by FLOPs alone, but by how well model structure, data layout, and execution align with the GPU’s execution and memory model. Training workloads tend to cooperate with this design large batches, regular access patterns, and predictable kernels keep SMs busy and memory bandwidth saturated. Inference does not. Variable batch sizes, growing KV caches, small kernel launches, and token by token execution all stress the exact assumptions GPUs rely on to stay efficient. Hardware that excels at training begins to fracture under real-time inference demands. Understanding why inference breaks these assumptions is the focus of the next part of this series.

Thanks for reading!