๐Ÿ“บ Watch the video version: ThinkSmart.Life/youtube
๐ŸŽง
Listen to this article
Who this is for: AI engineers, local AI enthusiasts, and technically curious readers who want to understand how GPUs work at a deeper level โ€” not just "GPUs are fast," but why they're fast, how they're organized, what the bottlenecks are, and why this matters for running LLMs.

Introduction: The GPU Is No Longer a Black Box

For most of the last decade, a GPU was a magic box you threw compute at. You installed CUDA, called model.to("cuda"), and watched your training loop run 40ร— faster than on CPU. The internals were someone else's problem.

That era is over. If you're running LLMs today โ€” whether on a cloud H100, a local RTX 3090 rig, or a Mac Studio โ€” the specific architecture of your GPU determines which models you can run, how fast tokens generate, whether you can serve multiple users simultaneously, and how much quantization degrades quality. Every optimization decision, from choosing between FP16 and INT4 to understanding why FlashAttention matters, connects back to hardware realities.

This article is a ground-up technical breakdown of how NVIDIA GPUs work, why memory bandwidth is the bottleneck for LLM inference, and how to reason about hardware when choosing or configuring a local AI system. We'll go from silicon organization to warp scheduling to tensor cores to practical GPU selection โ€” no hand-waving allowed.

6,912
CUDA cores in H100 SXM5
4.8
TB/s H200 memory bandwidth
32
Threads per warp (always)
128
CUDA cores per SM (Blackwell)

The GPU Philosophy: Parallelism Over Latency

To understand GPUs, you first need to understand the design choice they're built around: throughput over latency. This is the exact opposite of what CPUs optimize for.

A modern CPU core has enormous complexity dedicated to running a single thread as fast as possible: deep out-of-order execution pipelines, large branch predictors, massive caches, speculative execution. A CPU core can execute one instruction in a fraction of a nanosecond and handle arbitrary branching logic. It's a virtuoso soloist.

A GPU core โ€” a CUDA core โ€” is none of those things. It's a simple arithmetic unit. It executes instructions in order. It has minimal cache. It can't speculate. Individually, it's slow. But a modern GPU has thousands of them, all running simultaneously. It's a massive ensemble playing in unison.

Why Parallelism Wins for Matrix Math

Deep learning is, at its core, matrix multiplication. An attention layer computes Q ร— KT. A feed-forward layer computes X ร— Wโ‚. These operations are fundamentally parallel: every row of the output matrix is independent of every other row. There are no data dependencies between them. A matrix multiply of shape [1024 ร— 4096] ร— [4096 ร— 4096] involves 1024 ร— 4096 = ~4 million independent dot products.

Output[i,j] = ฮฃ A[i,k] ร— B[k,j] for all (i,j) independently

A CPU with 16 cores must serialize these 4 million operations. A GPU with 10,000 CUDA cores can run 10,000 of them simultaneously. The GPU is ~625ร— faster not because each core is faster, but because it has so many more cores doing parallel work.

The catch: GPU parallelism only works when your computation has minimal branching, large matrix operands, and data locality. Irregular workloads โ€” like tree traversals, graph algorithms, or sequential token generation โ€” are where CPUs fight back.

Inside the GPU: The Compute Hierarchy

NVIDIA GPUs aren't a flat pool of CUDA cores. They're organized in a strict hierarchy, and understanding it is essential for reasoning about performance.

GPU Die โ†’ GPC (Graphics Processing Cluster) โ†’ TPC (Texture Processing Cluster)
SM (Streaming Multiprocessor) โ†’ CUDA Cores + Tensor Cores

The Streaming Multiprocessor (SM): The True Unit of Compute

The SM is the fundamental building block of a GPU. Think of it as a mini-processor with its own scheduler, register file, shared memory, and execution units. A modern data-center GPU has dozens to hundreds of SMs. The H100 SXM5 has 80 SMs; the RTX 4090 has 128 SMs; the Blackwell B200 has 192 SMs.

Inside each SM on the Hopper H100, you'll find:

  • 128 CUDA cores (FP32 units) โ€” handle general arithmetic
  • 4 Tensor Cores (4th generation) โ€” handle matrix multiply-accumulate
  • 4 SFUs (Special Function Units) โ€” handle transcendentals like sin, cos, exp, reciprocal sqrt
  • 32 Load/Store Units โ€” handle memory access
  • Register File: 256KB โ€” the fastest storage on the GPU, private to each SM
  • Shared Memory / L1 cache: 228KB (configurable) โ€” shared across all threads in a thread block on this SM

On the Blackwell B100, NVIDIA doubled the SM count and restructured internals: 128 CUDA cores per SM, 4ร— 5th-generation Tensor Cores, and 4ร— RT cores per SM โ€” though RT cores are for ray tracing and irrelevant for ML.

Threads, Warps, Blocks, and Grids: The CUDA Execution Model

When you launch a CUDA kernel, you specify a grid of blocks, each containing threads. This maps directly onto the hardware hierarchy:

CUDA Concept Hardware Mapping Notes
Thread One CUDA core (one execution lane) Smallest unit of execution; has own registers
Warp (32 threads) The SM's scheduling unit All 32 threads execute the same instruction simultaneously
Block Runs entirely on one SM Shares the SM's shared memory; threads can sync via __syncthreads()
Grid Distributed across all SMs Blocks are dispatched to available SMs; GPU handles load balancing

SIMT: Not Quite SIMD

GPUs use a model called SIMT (Single Instruction, Multiple Threads). At any clock cycle, all 32 threads in a warp execute the same instruction โ€” this looks like SIMD. But unlike SIMD, individual threads can have different register state and can take different branches. When threads diverge (e.g., some threads hit if (x > 0) and others don't), the GPU serializes: it runs the threads that took the branch first, masking the others inactive, then runs the other branch. This warp divergence is a major performance hazard in GPU code.

WARP EXECUTION โ€” 32 threads, same instruction each cycle
All active (no divergence) โœ“
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
T25
T26
T27
T28
T29
T30
T31
Diverged (some stalled during memory fetch)
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
T25
T26
T27
T28
T29
T30
T31

Warp Scheduling and Latency Hiding

Here's the GPU's killer feature for handling memory latency: when a warp issues a memory load (which takes hundreds of cycles to come back from VRAM), the warp scheduler doesn't stall the whole SM. It switches to another warp that's ready to execute. When the memory eventually returns, the original warp becomes eligible again and gets scheduled.

This is called latency hiding via warp switching. An SM might have 4โ€“8 warps in flight simultaneously. As long as at least one warp is always ready to execute, the arithmetic units stay busy and memory latency becomes invisible.

NVIDIA SMs use three warp scheduling strategies:

  • Greedy Then Oldest (GTO) โ€” stick with the current warp until it stalls, then pick the oldest eligible warp. Minimizes cache thrash.
  • Round-Robin โ€” rotate through warps equally. Maximizes latency hiding but increases cache pressure.
  • Two-Level Scheduling โ€” groups warps into "fetch groups" that share L1 cache lines, balancing hiding and locality.

Occupancy is the measure of how many warps are active per SM simultaneously, relative to the maximum. Higher occupancy = better latency hiding = better throughput. Low occupancy (e.g., few warps per SM) means memory stalls become visible as idle time. This is why kernel tuning โ€” adjusting thread block sizes, reducing register pressure, optimizing shared memory usage โ€” matters so much for GPU performance.

Tensor Cores: The Secret Weapon

CUDA cores are general-purpose arithmetic units. They can multiply and add FP32 values one at a time. Tensor Cores are something fundamentally different: they perform an entire matrix multiply-accumulate (MMA) operation in hardware in a single clock cycle.

Specifically, a Tensor Core computes:

D = A ร— B + C

Where A, B, C, and D are small matrices (e.g., 4ร—4 for older generations, 8ร—8 or larger for newer). The entire operation โ€” tens to hundreds of individual multiply-accumulate steps โ€” happens in one hardware instruction, not a loop of scalar operations. This is orders of magnitude more efficient than implementing matrix multiply with CUDA cores.

Why Transformers Depend on Tensor Cores

The transformer attention mechanism is nothing but matrix multiplications: Q ร— KT to get attention scores, softmax, then multiplying by V. The feed-forward network (FFN) layers are also matrix multiplications. In fact, >90% of FLOP-count in a transformer forward pass is matrix multiply. Tensor Cores handle all of this.

Without Tensor Cores, an H100 would deliver ~67 TFLOPS (FP32 CUDA core throughput). With Tensor Cores doing FP16, it delivers ~989 TFLOPS โ€” a 14ร— increase. For INT8, that climbs to ~1979 TOPS. For FP8 (Hopper generation), ~3958 TOPS.

Key insight: When people say "a GPU has X TFLOPS," they almost always mean Tensor Core throughput, not CUDA core throughput. The CUDA core number is often 10โ€“15ร— lower. Verify which number you're comparing.

Tensor Core Generation History

Generation Architecture Year Supported Precisions Notable Feature
1st Gen Volta (V100) 2017 FP16 First Tensor Cores; introduced the MMA concept
2nd Gen Turing (T4, RTX 20xx) 2018 FP16, INT8, INT4 INT8/INT4 support for inference workloads
3rd Gen Ampere (A100, RTX 30xx) 2020 FP16, BF16, INT8, TF32 BF16 support; Sparse Tensor Cores (2ร— speedup on 50% sparse weights)
4th Gen Hopper (H100) 2022 FP8, FP16, BF16, INT8, TF32 FP8 (up to 4ร— faster than FP16); Transformer Engine auto-selects precision
5th Gen Blackwell (B100/B200, RTX 50xx) 2024โ€“25 FP4, FP6, FP8, FP16, BF16, INT8 FP4 support; 2nd-gen Transformer Engine; NVLink 5.0

Precision Cascade: Why Lower Bits Matter

Each step down in precision roughly doubles Tensor Core throughput, because you can pack more operands into the same hardware lanes:

  • FP32 (CUDA cores): 67 TFLOPS on H100
  • BF16/FP16 (Tensor Cores): ~989 TFLOPS on H100
  • INT8: ~1979 TOPS on H100
  • FP8: ~3958 TOPS on H100
  • FP4 (Blackwell): ~9500 TOPS on B200

This precision cascade is the hardware reason why quantization (INT4, INT8) matters for inference speed โ€” beyond just fitting models into VRAM, lower precision enables higher throughput on Tensor Cores.

Memory: The Real Bottleneck

Here's the dirty secret of GPU performance for LLM inference: FLOPS don't matter. Memory bandwidth does.

GPU memory is organized in a strict hierarchy, ordered from fastest to slowest:

Register File ~20 TB/s Per-thread, ~256KB per SM. Fastest storage on the chip. Zero latency if the compiler assigns variables to registers.
Shared Mem / L1 ~15 TB/s Shared across all threads in a block on one SM. 48KBโ€“228KB configurable. Used for tiling in matrix ops.
L2 Cache ~5 TB/s Shared across all SMs on the chip. ~50MB on H100. First off-chip cache level.
HBM / GDDR6X 0.9โ€“4.8 TB/s The main GPU memory (VRAM). Where model weights and KV cache live. This is the bottleneck for LLM inference.
PCIe / CPU RAM ~60 GB/s System RAM via PCIe bus. 50โ€“80ร— slower than VRAM. Crossing this bus kills inference performance.

Arithmetic Intensity: The Key Concept

Arithmetic intensity is the ratio of FLOPs performed to bytes of memory accessed:

Arithmetic Intensity = FLOPs performed / Bytes accessed from DRAM

A workload is compute-bound when its arithmetic intensity is high โ€” it does lots of math per byte loaded. It's memory-bandwidth-bound when its arithmetic intensity is low โ€” it loads data and does little math on it before needing more data.

The threshold between the two depends on the GPU's ratio of compute throughput to memory bandwidth, called the ridge point. For the H100 SXM5:

Ridge Point = 989 TFLOPS / 3.35 TB/s โ‰ˆ 295 FLOPs/byte

To saturate H100's FLOPS, a workload needs to perform at least 295 FLOPs per byte of data fetched from HBM. Large batch matrix multiplies easily exceed this. LLM inference during autoregressive decode... does not.

Why LLM Inference Is Memory-Bandwidth-Bound

During autoregressive token generation, the model generates one token at a time. Each token generation requires a forward pass through all transformer layers. In each layer, the weight matrices (billions of parameters) must be loaded from VRAM into the SM's shared memory and registers. But the computation against those weights is tiny: for batch size 1, you're multiplying a single [1 ร— d_model] vector against each [d_model ร— d_ffn] weight matrix. That's d_model ร— d_ffn multiplications per matrix โ€” very few FLOPs relative to the bytes you had to load.

LLaMA-3 70B: ~140 billion parameters ร— 2 bytes (BF16) = ~280 GB to load per token At H100's 3.35 TB/s bandwidth: ~83ms per token = ~12 tokens/sec

This is why the H200 โ€” with 4.8 TB/s memory bandwidth vs H100's 3.35 TB/s โ€” generates tokens faster even though its FLOP count is similar. More bandwidth โ†’ faster weight loading โ†’ more tokens/sec.

The inference bottleneck in plain English: Every single token you generate requires loading essentially the entire model from VRAM into the compute units. The speed of that load โ€” not the speed of the math โ€” determines your tokens-per-second. This is why memory bandwidth is the number that matters most for LLM inference benchmarking.

HBM vs GDDR6X vs Unified Memory

HBM (High Bandwidth Memory) โ€” used in H100, H200, A100, MI300X โ€” achieves high bandwidth by stacking memory dies vertically and connecting them to the GPU via a wide interposer with thousands of connections. HBM3e in H200 achieves 4.8 TB/s. The tradeoff: HBM is expensive and physically limited in capacity (80GB on H100).

GDDR6X โ€” used in consumer GPUs like RTX 3090 and RTX 4090 โ€” is cheaper and offers more capacity per dollar, but lower bandwidth (936 GB/s on RTX 3090, up to 1,008 GB/s on RTX 4090). GDDR7 in the RTX 5090 reaches 1,792 GB/s, narrowing the gap significantly.

Apple Unified Memory โ€” used in M-series Macs โ€” is architecturally different: the CPU and GPU share the same physical memory pool. There's no PCIe transfer overhead. The bandwidth (~800 GB/s on M3 Ultra) is lower than H100 HBM, but the 192GB capacity at that bandwidth makes it uniquely capable for running very large models locally.

How LLMs Actually Use the GPU

Armed with hardware knowledge, let's trace exactly how a transformer inference run flows through the GPU.

Prefill vs Decode: Two Different Compute Regimes

LLM inference has two distinct phases with completely different performance characteristics:

Prefill โ€” processing the entire input prompt in one shot. If your prompt is 2,048 tokens long, the GPU processes all 2,048 tokens simultaneously. The attention layer computes Q, K, V projections for all tokens in parallel, then computes the full 2048ร—2048 attention score matrix. This is compute-intensive: large matrix multiplies, high arithmetic intensity, Tensor Cores are fully utilized. Prefill is compute-bound on a well-batched system.

Decode โ€” generating tokens one at a time. Each decode step processes exactly one new token. The attention layer must attend over the full context, but now we're computing scores for one query vector against all cached K,V pairs. The matrix operations collapse to vector-matrix products. Arithmetic intensity plummets. Decode is memory-bandwidth-bound.

Why batching helps decode: If you process 32 users simultaneously (batch size 32), each decode step generates 32 tokens from the same weight load. You amortize the weight loading cost over 32 results instead of 1. This is why serving systems like vLLM and TensorRT-LLM work hard to maximize decode batch size โ€” it's the key lever for improving GPU utilization during inference.

The KV Cache

During decode, each new token needs to attend to all previous tokens. Rather than recompute Key and Value projections for all past tokens at every step, inference engines cache them: the KV cache. Each transformer layer stores K and V tensors of shape [n_heads ร— seq_len ร— d_head].

For a LLaMA-3 70B model serving a 32K context window:

  • 80 layers ร— 8 KV heads ร— 32K seq_len ร— 128 d_head ร— 2 bytes (BF16) ร— 2 (K+V) = ~42 GB of KV cache

With 80GB HBM on an H100, that leaves only 38GB for model weights โ€” which means you need to quantize the 70B model (which weighs 140GB at BF16) anyway. Managing this tradeoff is the core engineering challenge in LLM serving.

FlashAttention: A Hardware-Aware Solution

Standard attention computes the full Nร—N attention score matrix and stores it in HBM before applying softmax. For a 32K context, that's 32Kร—32Kร—2 bytes = 2GB per layer โ€” and you have to read and write it multiple times. This is catastrophically bandwidth-inefficient.

FlashAttention (Dao et al., 2022) restructures the attention computation to avoid ever materializing the full matrix in HBM. It tiles the computation into blocks that fit in the SM's shared memory (SRAM), computes attention in chunks, and accumulates the result. The total HBM reads/writes drop by ~5โ€“10ร— for long contexts. FlashAttention isn't mathematically different โ€” it's numerically identical โ€” but it's dramatically faster because it respects the memory hierarchy.

FlashAttention-2 and FlashAttention-3 further optimize for modern hardware, achieving near-theoretical peak efficiency on H100 Tensor Cores by overlapping compute and memory transfers.

PagedAttention: Managing the KV Cache

PagedAttention (Kwon et al., 2023), introduced in vLLM, treats the KV cache like virtual memory. Instead of allocating a contiguous block for each request's KV cache (which leads to fragmentation when request lengths vary), PagedAttention manages a pool of fixed-size "pages" and allocates them on demand. This allows serving systems to use near-100% of available VRAM for KV cache rather than reserving worst-case contiguous buffers, dramatically improving throughput for variable-length batches.

Choosing Your GPU for Local AI

With the hardware model in mind, how do you evaluate GPUs for local LLM inference? Two dimensions dominate: VRAM capacity (determines which models fit) and memory bandwidth (determines tokens/sec).

RTX 3090
VRAM24 GB GDDR6X
Bandwidth936 GB/s
CUDA Cores10,496
Tensor Cores328 (3rd gen)
Price (used)~$700โ€“900
Best Value 2024
RTX 4090
VRAM24 GB GDDR6X
Bandwidth1,008 GB/s
CUDA Cores16,384
Tensor Cores512 (4th gen)
Price (new)~$1,599
Consumer King
RTX 5090
VRAM32 GB GDDR7
Bandwidth1,792 GB/s
CUDA Cores21,760
Tensor Cores680 (5th gen)
Price (new)~$1,999
New Consumer King
H100 SXM5
VRAM80 GB HBM3e
Bandwidth3,350 GB/s
CUDA Cores16,896
Tensor Cores528 (4th gen)
Price (new)~$25,000+
Data Center Ref
M3 Ultra (Mac Studio)
Memory192 GB Unified
Bandwidth800 GB/s
GPU Cores60
ArchitectureApple Silicon
Price (new)~$3,999+
Unique Architecture

What Fits Where: VRAM Capacity

A model's VRAM requirement depends on its size and quantization level:

Model BF16 (2 bytes/param) Q8_0 (1 byte/param) Q4_K_M (~0.5 bytes/param) Fits in RTX 3090 (24GB) at Q4?
LLaMA-3.1 8B 16 GB 8 GB 4.7 GB โœ… Easily
LLaMA-3.1 70B 140 GB 70 GB 39 GB โŒ Needs 4ร— RTX 3090 or H100
Qwen3 32B 64 GB 32 GB 18 GB โœ… Yes, with headroom
Llama-3.1 405B 810 GB 405 GB ~228 GB โŒ Needs 8+ H100s or M3 Ultra (192GB)

Bandwidth vs Capacity: The Real Tradeoff

The M3 Ultra is a fascinating case study. At 192GB unified memory and 800 GB/s bandwidth, it can hold massive models that no single GPU can touch โ€” including LLaMA-3.1 405B at 4-bit quantization. But its bandwidth (800 GB/s) is less than a quarter of the H100 (3.35 TB/s), so token generation is slower for models that do fit in both.

For a 70B model at Q4:

  • M3 Ultra: ~39GB model fits, ~800 GB/s bandwidth โ†’ ~12โ€“18 tokens/sec with MLX
  • H100: ~39GB model fits, ~3.35 TB/s bandwidth โ†’ ~45โ€“65 tokens/sec
  • 2ร— RTX 3090: ~39GB fits across both cards, ~1.87 TB/s combined โ†’ ~25โ€“35 tokens/sec

Quantization as the Equalizer

Quantization deserves special attention here because it's the primary lever for fitting large models into consumer hardware. The key insight from quantization research is that 4-bit quantization achieves ~4ร— model size reduction with less than 1% quality degradation on most benchmarks. This means:

  • A model that requires 80GB (H100) at BF16 fits comfortably in 24GB at Q4
  • Running Q4 vs BF16 on the same GPU is ~4ร— faster (you load 4ร— fewer bytes from VRAM per token)
  • The sweet spot for most users: Q4_K_M or Q6_K quantization via GGUF with llama.cpp
Practical recommendation: For local AI, maximize VRAM capacity first, then bandwidth second. A 24GB RTX 3090 at $700โ€“900 (used) running Q4_K_M models outperforms an 8GB card running smaller models. If you can afford the RTX 5090 at 32GB + 1.79 TB/s, it's the consumer king for 2025โ€“2026.

What Comes Next

GPU architecture is evolving rapidly, driven almost entirely by AI workloads. Several trends are worth watching:

Blackwell GB200 NVLink and Rack-Scale Systems

NVIDIA's Blackwell GB200 NVLink Rack connects 72 Blackwell GPUs with NVLink 5.0, presenting them as a single unified memory space of up to 13.5TB at 130 TB/s aggregate bandwidth. This eliminates the PCIe and inter-node bottlenecks that fragment large model deployments today. The GB200 system enables running trillion-parameter models in a single addressable memory space โ€” a qualitative shift in what's possible without complex pipeline parallelism.

Unified Memory: Convergence of Apple and NVIDIA

Apple's unified memory architecture โ€” where CPU and GPU share the same physical DRAM โ€” eliminates the PCIe transfer overhead that slows heterogeneous CPU+GPU systems. NVIDIA is moving in the same direction with NVLink-C2C (Chip-to-Chip), which connects the Grace CPU and Hopper/Blackwell GPU with coherent, high-bandwidth interconnect (900 GB/s on Grace-Hopper Superchip). The GH200 and GB200 NVL systems use this architecture, effectively creating a CPU+GPU unified memory pool at server scale.

Rust ML and New Hardware Abstractions

The emergence of ML frameworks in Rust (including Psionic and the Candle framework from HuggingFace) signals that the era of Python-only ML is ending. Rust's zero-cost abstractions and memory safety allow lower-overhead kernel dispatch and more predictable performance across heterogeneous hardware (CUDA, Metal, WebGPU). This matters because as ML moves to edge devices with varied GPU architectures, having portable low-level primitives becomes critical.

Photonic Computing: Beyond Silicon

Companies like Lightmatter and Celestial AI are developing photonic processors that perform matrix multiplications using light rather than electrons. Photonic matrix multiply is orders of magnitude more energy-efficient and theoretically immune to the memory bandwidth bottleneck (optical data transfer operates at the speed of light with minimal heat generation). Commercial photonic ML accelerators are still in early stages, but they represent a genuine alternative trajectory to the silicon GPU roadmap for AI inference at scale.

๐Ÿ”ฎ The Key Takeaways

Understanding GPU architecture changes how you think about every optimization decision. Memory bandwidth is the bottleneck โ€” not FLOPS. Warp scheduling hides latency โ€” keep occupancy high. Tensor Cores are 14ร— faster than CUDA cores for matrix math โ€” use FP16/BF16 or lower. FlashAttention works because it respects the memory hierarchy โ€” avoid materializing large tensors in HBM.

For local AI: more VRAM = more model. More bandwidth = more tokens/sec. Quantization = both. The RTX 5090 at 32GB/1.79 TB/s and Mac Studio M3 Ultra at 192GB/800 GB/s represent two different design philosophies โ€” the former wins on throughput, the latter wins on model capacity.

The GPU is no longer a black box. It's the most important piece of hardware in the AI stack, and knowing how it works makes you a better engineer.

References

  1. NVIDIA. (2022). NVIDIA H100 Tensor Core GPU Architecture. NVIDIA Hopper Architecture Whitepaper
  2. NVIDIA. (2024). NVIDIA Blackwell Architecture Technical Brief. nvidia.com/blackwell
  3. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Rรฉ, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arxiv.org/abs/2205.14135
  4. Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arxiv.org/abs/2307.08691
  5. Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. arxiv.org/abs/2309.06180
  6. Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM.
  7. Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro.
  8. NVIDIA. (2020). NVIDIA A100 Tensor Core GPU Architecture. Ampere Architecture Whitepaper
  9. Shah, J., et al. (2024). FlashAttention-3: Fast and Accurate Attention for H100 GPUs. arxiv.org/abs/2407.08608
  10. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arxiv.org/abs/2210.17323

Published March 29, 2026. Part of the ThinkSmart.Life deep-dive series on AI fundamentals.