CUDA: Why Modern GPUs and LLMs Depend on it
This article examines CUDA as the execution contract between software and GPU hardware
gpucudahardware
This article examines CUDA as the execution contract between software and GPU hardware
Tensor cores deliver extreme compute density, but their throughput is bounded by scheduling, tiling, and memory movement long before peak FLOPs are reached. This article examines tensor cores as fixed hardware constraints
An overview of how GPUs run ML workloads in practice, with attention to occupancy, latency hiding, memory bottlenecks, and why throughput depends more on data movement than FLOPs.