Home X Github About

Interconnects

Compute Interconnects: From On-Die to Datacenter-Scale

Master reference for the wires, fabrics, protocols, and software stacks that move data between transistors, chips, boards, racks, and datacenters. Covers the full electrical/optical/protocol stack from sub-nanosecond on-die NoCs to multi-millisecond WAN, with a focus on what matters for high-performance database, AI/ML, HPC, and storage systems in 2024-2026.

Existing related references:


Table of Contents

  1. Mental Model and 7-Tier Latency Cheat Sheet
  2. Tier 1 — On-Die / Chiplet
  3. Tier 2 — Board / Internal Server
  4. Tier 3 — Storage Fabrics
  5. Tier 4 — Datacenter Network
  6. Tier 5 — HPC Fabrics
  7. Tier 6 — Software Stacks
  8. Tier 7 — Optical / Future
  9. RDMA Semantics Deep Dive
  10. Lossless Fabric Tuning
  11. Tail Latency Pathology
  12. Topology-Aware Collective Scheduling
  13. Cache Coherence on Fabric
  14. Bandwidth Math, Bisection BW, Oversubscription
  15. Power and Cost at Scale
  16. Security
  17. Mental Models — Decision Framework
  18. Practical Skills — Commands and Benchmarks
  19. Further Reading

1. Mental Model and 7-Tier Latency Cheat Sheet

The interconnect stack spans 9 orders of magnitude in latency and 4 orders of magnitude in bandwidth. Every order of magnitude up in latency forces a different protocol, encoding, error model, and software paradigm. Keep this picture in your head:

   ON-DIE        PACKAGE        BOARD          RACK          ROW       DC-WIDE       WAN
    ~1 ns        ~5-10 ns      ~50-200 ns    ~500ns-1µs   ~1-3 µs   ~5-20 µs     1-200 ms
   1 TB/s        2-4 TB/s      32-128 GB/s   25-400 GB/s  ...        ...         1-100 Gb/s
  cache-          coherent     coherent or   non-coherent  packet    packet      TCP/QUIC
  coherent       coherent      non-coherent  (RDMA verbs)  (RDMA/     (RDMA/
  (snoopy        (UCIe/IFOP/                              UDP/IB)    Ethernet)
   /MESI)        NVLink-C2C)
TierDistanceTypical LatencyTypical BW (point-to-point)Coherent?Example techSoftware paradigm
1. On-die< 20 mm1-3 ns1-10 TB/syesAXI/CHI, ring/mesh NoCload/store
1b. Chiplet (in-package)1-30 mm3-10 ns1-4 TB/syes (mostly)UCIe, IFOP, NVLink-C2C, AIB, BoWload/store, coherent DMA
2. Board5-30 cm50-200 ns32-128 GB/syes (CXL) or no (PCIe)PCIe gen5/6/7, CXL 2/3, NVLink, GMI3, xGMImmio, DMA, CXL.mem
3. Rack (HPC interconnect, NVL72)1-3 m200 ns - 1 µs100-1800 GB/syes/no (NVLink Scale-Up coherent up to 72 GPUs)NVLink+NVSwitch (NVL72), UALink, ICI, SlingshotNCCL, MPI, verbs
4. Row / ToR10 m1-3 µs25-400 Gb/snoInfiniBand HDR/NDR, RoCE v2, Slingshot 11verbs, libfabric, NCCL
5. DC-wide100-500 m5-20 µs25-800 Gb/snoEthernet 100/400/800G + dragonfly/ClosgRPC, RDMA WRITE, NVMe-oF
6. Cross-DC (metro)1-100 km0.1-2 ms10-1600 Gb/sno400ZR, dark fiber, MACsecasync replication
7. WAN100-15000 km5-200 ms1-100 Gb/snosubmarine cables, QUIC/BBReventual consistency

Bandwidth scaling law (rough rule): Per-pin SerDes signaling has roughly doubled every 3 years for two decades (NRZ 1 → 2 → 4 → 10 → 25 → 50 → 100 Gbaud PAM4 → 200 Gbaud PAM4). When per-pin SerDes hits practical limits (~224 Gbaud is the current frontier for electrical), the only way to scale bandwidth is more pins (UCIe Advanced has thousands of bumps per mm²) or optics (CPO).

Latency floor: Light in fiber travels ~5 ns/m (n ≈ 1.5). 100 m of fiber = 500 ns one-way, irreducible. 100 km of metro fiber = 0.5 ms one-way. Speed of light is the budget; software hops within a datacenter spend most of their time at NICs and switches, not on the wire.

Coherent-vs-non-coherent boundary: Cache coherence has historically lived inside the package (CPU socket, GPU complex). With CXL 2.0+ and NVLink generation 5, coherence now spans up to a tray (CXL.mem pooled across a 2-3 m chassis) and up to 72 GPUs (NVL72 NVLink fabric coherent loads/stores). Beyond that, the cost of snooping (or directory traffic) exceeds the value: latency, jitter, and tail propagation make it impractical. Software paradigm therefore shifts: snoopy/MESI for shared memory inside-the-rack; explicit RDMA verbs / message passing across the rack.

Why interconnects matter for AI workloads: A B200 GPU has ~10 TFLOPS FP64, ~2.25 PFLOPS FP8 dense (4.5 sparse), and 8 TB/s HBM3E. To keep it fed during LLM training (gradient AllReduce on tens of GB of optimizer state), the GPU must talk to its 71 NVL72 peers at 1.8 TB/s (NVLink5 bidir) and to remote racks at 400 Gb/s (ConnectX-7) or 800 Gb/s (ConnectX-8). The fabric, not the FLOPs, sets the ceiling for training throughput on models > 100B params.


2. Tier 1 — On-Die / Chiplet

2.1 AMBA Family (Arm)

The ARM AMBA family is the de-facto on-chip interconnect spec for ARM-based SoCs and is also widely licensed in non-ARM designs (FPGA fabrics, RISC-V SoCs, GPU chiplets). Six standards matter today, in increasing order of capability:

SpecYearUse caseCoherenceNotes
APB1995Low-bandwidth peripherals (UART, GPIO)noSingle 32-bit data, simple handshake
AHB1999Mid-range memory, ROMnoPipelined, multi-master, single-cycle
AXI32003Mainstream high-bandwidthnoFive independent channels
AXI42010MainstreamnoUp to 256 beat bursts, QoS signals
AXI52017Mainstream + IO coherencepartial (ACE-Lite)Atomic transactions, unique-ID interleave
ACE / ACE-Lite2011CPU cache coherenceyes (snoopy MOESI)Adds Snoop channels AC/CR/CD
CHI2014Mesh/ring, server-class coherenceyes (directory or snoopy)Packet-based, scales to hundreds of nodes

AXI4 channels (most-used variant on FPGA / DMA / accelerator):

Master                                Slave
  │                                     │
  │ AW (write address)  ──────────────► │
  │ W  (write data, burst, last) ─────► │
  │ B  (write response, OKAY/SLVERR) ◄─ │
  │                                     │
  │ AR (read address) ────────────────► │
  │ R  (read data + last + RESP) ◄───── │

The five channels are independent (separate VALID/READY handshakes), allowing reads and writes to interleave at the bus master's whim. Out-of-order ID semantics let masters issue many in-flight transactions and match responses by AWID/RID/BID.

CHI (Coherent Hub Interface): CHI is what Arm-based server chips (Neoverse N1/N2/V1/V2/V3, Ampere Altra, AWS Graviton 3/4) use as the on-die fabric. Key differences from ACE:

  • Packet-based, not channel-based. Requests and snoops fly as packets on a routed mesh, not on signal-channel-per-direction wires.
  • Three logical channels (Request, Response, Snoop), each layered over a physical NoC (mesh, ring).
  • Directory-based or snoopy. CHI-A is snoopy (broadcast); CHI-B/C/D/E add directory support, multi-chiplet fabrics, atomic transactions, persistent CMO (cache maintenance), trace, and Realm Management Extension (CCA).
  • Hierarchical coherent gateways. CMN-700 (Neoverse mesh) scales to 256 cores per die and supports multi-die coherence via Coherent Mesh Gateways (CMG) and CCIX/CXL bridges.

ACE-Lite: A reduced form of ACE for non-cacheable masters (DMA engines, accelerators) that still need to participate in system-level coherence (snoop the CPU caches). Used heavily for GPU/NPU integration on mobile SoCs.

2.2 Intel UPI / Predecessors

UPI (Ultra Path Interconnect) is Intel's coherent inter-socket and inter-die interconnect, introduced with Skylake-SP (Xeon Scalable v1, 2017) and evolved through Sapphire Rapids, Emerald Rapids, Granite Rapids. Predecessor: QuickPath Interconnect (QPI, Nehalem 2008 through Broadwell).

GenerationReleasedSpeed (GT/s)Per-link BW (one direction)Used in
QPI 1.02008 (Nehalem)6.412.8 GB/sXeon 5500-5600
QPI 1.12011 (Sandy Bridge-EP)8.016 GB/sXeon E5
QPI 1.22014 (Haswell-EP)9.619.2 GB/sXeon E5/E7 v3-v4
UPI 1.02017 (Skylake-SP)10.420.8 GB/sXeon SP gen1/2/3
UPI 2.02023 (Sapphire Rapids)16.032 GB/sXeon SP gen4/5
UPI 2.0 (GNR)2024 (Granite Rapids)24.048 GB/sXeon 6

MESIF protocol: Intel's MESI extension adds a Forward (F) state. Exactly one cached copy of a shared line holds F state; that cache is responsible for responding to read requests, eliminating wasteful "all sockets respond simultaneously" traffic. The home agent maintains a directory; the F-state holder is the designated forwarder. Compare with AMD MOESI which uses an Owned (O) state to allow shared-dirty caching.

HitMe cache: UPI's directory-based coherence is augmented by a "HitMe" cache at the home agent — a small (~1 MB per channel) directory cache holding recently-snooped line metadata to skip the (slow) DDR directory bit lookup. Hit on HitMe = snoop only the relevant agent; miss = broadcast snoop (with directory consult to filter).

Flit layer: UPI uses 192-bit (24 byte) flits with 8 bytes of header/CRC and 16 bytes of payload. Three message classes (Request, Snoop, Response) share the link with credit-based flow control.

2.3 AMD Infinity Fabric

Infinity Fabric (IF) is AMD's umbrella term for its coherent on-package and inter-socket interconnect, spanning from Zen 1 (2017) through Zen 5 (2024) and the MI300 series. It is built on the HyperTransport 3.x electrical layer with a custom AMD-defined protocol layer.

VariantScopeGenerationLanes / linkPer-link BWNotes
IFOP (On-Package)CCD ↔ IODZen 2/3: 16 GT/s, Zen 4: 32 GT/s, Zen 5: 36 GT/s32 lanes64 GB/s (Zen 4)One read + one write link per CCD; IOD is the central crossbar
IFIS (Inter-Socket)Socket ↔ SocketxGMI gen3 (Zen 3): 18 GT/s, gen4 (Zen 4): 32 GT/s, gen5 (Zen 5): 32 GT/s16 lanes per link, 3-4 links per socket64 GB/s/linkUsed in 2P EPYC; can re-purpose as PCIe lanes
GMI / GMI3 (Global Memory Interconnect)CCD ↔ IOD on EPYCGMI3: 36 GT/s, narrow (Zen 4); GMI3-Wide (Zen 4 SP5 64-core+): 32 lanes 36 GT/sup to 32 lanes~36-72 GB/sReplaces IFOP in datacenter EPYC; coherent
xGMIEPYC ↔ EPYC (2P), MI300 ↔ MI300Up to 32 GT/s (gen4), 36 GT/s (gen5)16-32 lanesup to ~144 GB/s per pairMI300X uses 7 xGMI links = ~896 GB/s per GPU
NVLink-style on MI300MI300 cluster (8 GPUs)xGMI7 links × 16 lanes896 GB/s aggregate per GPUAll-to-all in 8-GPU cube; basis of MI300X reference platform

Coherence: MOESI base protocol. Each CCD has a private L3 (Zen 2/3: 16 MB; Zen 4: 32 MB; Zen 5: 32 MB; X3D: +64 MB). Cross-CCD lines are snooped through the IOD. The IOD also houses the memory controllers, PCIe root complex, and Infinity Fabric Switch.

SCF (System Coherent Fabric): The data fabric inside the IOD that routes between CCDs, memory controllers, PCIe, and IO. SCF clock (FCLK) is independent of memory clock (MCLK); UCLK = unified memory controller clock. Crossing FCLK/MCLK domains incurs a ~10 ns penalty per crossing. For best memory latency, FCLK = MCLK = UCLK (1:1:1 ratio); for high memory speed past 6000 MT/s on Zen 4, expect 2:1 desync.

NVLink-C2C is NVIDIA's chip-to-chip variant of NVLink, used in:

  • Grace-Hopper (Grace CPU ↔ H100/H200 GPU): 900 GB/s bidirectional, coherent
  • Grace-Blackwell GB200 (Grace CPU ↔ 2× B200 GPUs): 900 GB/s per CPU-GPU link
  • Custom partner chips via NVIDIA's NVLink-C2C IP

It uses the same 100G/lane PAM4 electrical signaling as NVLink-5 but is optimized for short on-board / on-substrate distances. Critically: it is fully cache-coherent, which means GPU code can issue LD/ST against host LPDDR5X memory directly — no explicit cudaMemcpy needed. The CPU and GPU see a single 624 GB unified address space (Grace: 480 GB LPDDR5X + Hopper: 144 GB HBM3e on H200).

The trade-off: coherence at this bandwidth requires aggressive directory traffic; latency is ~250 ns CPU→GPU, vs ~80-100 ns same-socket DRAM. Use NVLink-C2C for working sets too large to fit in HBM (KV cache spillover, parameter offload), not for inner-loop bandwidth-bound kernels.

2.5 UCIe (Universal Chiplet Interconnect Express)

UCIe is the open standard for die-to-die (D2D) chiplet interconnect, ratified by an industry consortium (Intel, AMD, Arm, Google, Microsoft, Qualcomm, Samsung, TSMC, Meta, NVIDIA later) in March 2022. Goal: a PCIe-like ecosystem of interoperable chiplets, where any chiplet supplier can mix-and-match dies from any vendor.

SpecReleasedPHY data rateBump pitchReachPer-mm shore BW (Advanced)Notes
UCIe 1.0Mar 20224-32 GT/sStd 110-100 µm, Adv 45 µm (later 25 µm)<2 mm Adv, <25 mm Std~10-32 GB/s/mm Std, up to ~165 GB/s/mm AdvPCIe + CXL protocol; retimer support
UCIe 1.1Aug 20234-32 GT/sAdv 45/25/36/55 µmsamesameAutomotive + manageability + raw streaming
UCIe 2.0Aug 20244-32 GT/sAdv adds 25/55 µmsamesame3D stacking, system architecture (D2D Manageability Architecture), Memory + Raw streaming for HBM-on-die
UCIe 2.1Aug 2025up to 64 GT/sAdvsameup to ~331 GB/s/mmDoubles base data rate; refined for AI clusters

Layered architecture:

+---------------------------------------------+
|  Protocol Layer (PCIe, CXL, Streaming Raw)  |
+---------------------------------------------+
|  D2D Adapter (CRC, retry, link state mgmt)  |
+---------------------------------------------+
|  PHY (sideband + main band, lane training)  |
+---------------------------------------------+
|  Bumps / Interposer                          |
+---------------------------------------------+

Standard vs Advanced packaging:

  • Standard Package: Organic substrate (PCB-style routing, 110-100 µm bump pitch), reach up to 25 mm, requires retimers for longer distances. Cheaper, looser tolerances.
  • Advanced Package: Silicon interposer or embedded bridge (CoWoS, EMIB, InFO), 45 µm pitch (UCIe 1.0), tightening to 25 µm (UCIe 2.0+), reach < 2 mm. Multi-thousand-bump shoreline, sub-pJ/bit energy.

Protocol mapping: UCIe transports PCIe and CXL natively (so a chiplet that speaks PCIe can plug into a UCIe link transparently), plus a "streaming raw" mode for custom protocols. CXL.io, .cache, and .mem all map cleanly.

KP4 BoB: The KP4 Bunch-of-Bumps test vehicle defined by Ayar Labs/Eliyan/others is a common reference physical layout used in early UCIe interop demos (2023-2024). It standardizes ~512-1024 bumps in a regular grid for D2D testing.

TileLink is an open chip-coherence protocol developed at UC Berkeley (Asanović et al., 2014+) and the standard fabric on SiFive/Chipyard RISC-V cores (Rocket, BOOM, NaxRiscv). Three conformance tiers:

TierCapabilityUse case
TL-UL (UncachedLight)Single in-flight transaction per channel, simple R/WLow-bandwidth peripherals (UART, SPI)
TL-UH (UncachedHeavy)Multiple outstanding, burst, atomics, hintsMemory controllers, DMA
TL-C (Coherent)Full MOESI-like with probe/grant channelsL1↔L2↔LLC, multi-core

TileLink has 5 channels (A, B, C, D, E) carrying Acquire/Probe/Release/Grant messages — analogous to AMBA CHI's REQ/SNP/RSP but with explicit probe channels for snooping. Open spec, used in OpenTitan, BOOM, and many academic chips.

2.7 BoW, AIB, OpenHBI (Pre-UCIe Chiplet Standards)

Before UCIe consolidated the chiplet market, several alternatives competed:

StandardBackerStatusNotes
BoW (Bunch of Wires)OCP / OIF / MarvellLargely subsumed by UCIe but still used in some custom designsTargeted simple parallel wires across 2 mm; data rates up to 16 GT/s
AIB (Advanced Interface Bus)IntelOpen-sourced 2020, used in EMIB-based Intel chiplets (Sapphire Rapids HBM-tile interconnect, Ponte Vecchio); influenced UCIe AdvancedFirst widely-deployed advanced-pkg D2D; 1024 wires per "channel"
OpenHBIOCP HBI workgroupLargely overlapping with HBM PHY; targeted memory-class D2DDefines link to HBM-like memory dies
XSR (Extra Short Reach)OIFSurvives for high-end CPO/optical engine interfaces56-112 GT/s SerDes for <2 cm reach

UCIe 1.0+ has consolidated the mainstream chiplet ecosystem; AIB, BoW, OpenHBI remain in legacy/specialized designs.

2.8 HBM3, HBM3E, HBM4 as On-Package Interconnect

HBM (High-Bandwidth Memory) is technically DRAM, but the JEDEC-defined HBM PHY also acts as an interconnect to the host die (CPU/GPU/accelerator). Pin count and per-pin rate matter as much as bandwidth.

StandardReleasedStacks (typical)Bus width per stackPer-pin ratePer-stack BWNotes
HBM220164-8 high1024 bits2 Gbps256 GB/sUsed in V100, Vega 20
HBM2E20208 high1024 bits3.6 Gbps460 GB/sA100 (5 stacks = 2 TB/s)
HBM320228-12 high1024 bits6.4 Gbps819 GB/sH100 (5 stacks active = 3.35 TB/s)
HBM3E202412 high1024 bits9.2 Gbps1.18 TB/sH200 (6 stacks = 4.8 TB/s), B200 (8 stacks = 8 TB/s)
HBM42026 (sampling)12-16 high2048 bits (doubled!)8 Gbps~2.0 TB/sMI400, Rubin generation; doubling bus width is the big change
HBM4E~202816 high2048 bits12 Gbps~3.0 TB/sProposed

The HBM4 bus-width doubling matters: Per-pin signaling has hit thermal/electrical walls (the HBM3E 9.2 Gbps is already aggressive). HBM4 doubles the parallel bus width from 1024 to 2048 bits while keeping per-pin rate moderate (8 Gbps). This forces ~2× the bumps between memory stack and host die — driving demand for advanced packaging (CoWoS-S/L, Intel Foveros, Samsung X-Cube).

Implication for accelerator design: A B200 with 8 HBM3E stacks consumes ~8 × 1024 = 8192 wires just for the HBM interface (plus command/address). With HBM4 at 2× width, this becomes ~16384 wires. Combined with NVLink (18 links × ~100 wires) and UCIe to neighbor chiplets, shoreline (bumps per mm of die edge) becomes the critical scaling constraint, not transistor density. This drives the move to 3D stacking (logic die underneath HBM stack) where the interconnect goes vertical rather than horizontal.


3. Tier 2 — Board / Internal Server

3.1 PCI Express Summary

(For exhaustive PCIe coverage, see pcie_internals.md.) Single-table reference:

GenYear ratifiedEncodingRaw GT/s/laneUseful GB/s/lanex4 BWx16 BW (one direction)x16 bidirStatus
1.020038b/10b2.50.2501 GB/s4 GB/s8 GB/sLegacy
2.020078b/10b5.00.5002 GB/s8 GB/s16 GB/sLegacy
3.02010128b/130b8.00.9853.94 GB/s15.75 GB/s31.5 GB/sMainstream LegacyServer
4.02017128b/130b161.977.88 GB/s31.5 GB/s63 GB/sMainstream
5.02019128b/130b323.9415.75 GB/s63 GB/s126 GB/sDatacenter standard 2024
6.02022PAM4 + FLIT 256B + FEC647.5630.25 GB/s121 GB/s242 GB/sShipping 2025-2026 (Granite Rapids, Turin)
7.02025 (released 2025)PAM4 + FLIT 256B + FEC12815.1360.5 GB/s242 GB/s484 GB/sSpec released; first silicon 2026-2027

Key shift at PCIe 6.0: PAM4 signaling (4 levels = 2 bits/symbol) replaces NRZ, and 64b/66b encoding is replaced by 256-byte FLIT mode with forward error correction (FEC). FEC adds ~2 ns of latency for the FLIT roundtrip but is essential at PAM4 SNR margins. CXL 3.0+ uses the same FLIT mode.

CXL is the cache-coherent, low-latency interconnect built on PCIe physical and link layers. The protocol layer multiplexes three sub-protocols:

Sub-protocolWhat it transportsUsed by
CXL.ioPCIe TLPs (discovery, configuration, BAR, MMIO)All CXL devices
CXL.cacheCoherent caching requests from device to host cachesType 1, Type 2
CXL.memHost-issued loads/stores to device memoryType 2, Type 3

Device types:

TypeDescriptionProtocolsExample
Type 1Accelerator with its own caches, no device-side memory.io + .cacheSmartNICs, FPGA caches
Type 2Accelerator with caches AND attached memory; both protocols.io + .cache + .memGPUs (future), AI inference accelerators
Type 3Memory expander (no caches, just bulk memory exposed via .mem).io + .memSamsung CMM, SK Hynix CMM, Micron CZ120, Astera Leo

Version timeline:

VersionYearTopologyPoolSwitchingNotable additions
CXL 1.0Mar 2019Direct attach 1 host - 1 devicenonoFirst public release; runs on PCIe 5 PHY
CXL 1.1Jun 2019SamenonoCompliance + small fixes; the first widely-implemented spec
CXL 2.0Nov 2020Switchedyes (multi-LD)single-level switchMemory pooling across up to 16 hosts; CXL switching; persistence flush; IDE encryption
CXL 3.0Aug 2022Fabric (multi-host, multi-switch)yesMulti-level switchingDoubles to 64 GT/s (PCIe 6 PHY, PAM4); 256B FLIT; peer-to-peer (P2P) device-to-device; GFAM (Global Fabric Attached Memory); HDM-DB (Device-managed back-invalidation); coherence over fabric
CXL 3.1Nov 2023Same fabricyes+ scale-out via PBRTrusted Security Protocol (TSP) on top of TDISP; Port-Based Routing (PBR) for large fabrics; GFAM enhancements; Global Integrated Memory (GIM) attachment
CXL 3.2Dec 2024Sameyesoptimized PBROptimized fabric management; CCI (CXL Compliance Inspector); post-quantum considerations in IDE; sysfsmanagement attestation enhancements

HDM-DB (Host-managed Device Memory — Device-managed coherence): Critical concept in CXL 3.0+. In HDM-H (Host-managed coherence, CXL 2.0 default), the host CPU owns the coherence directory; every cache line in CXL.mem space is tracked by the host. This scales poorly past ~256 GB of pooled memory.

HDM-DB lets the device manage coherence — device caches a line, device tracks which hosts cached it, device issues back-invalidations (BI) to evict from host caches when needed. The host's only coherence obligation is to respond to BI messages. This decouples coherence directory size from host LLC size and is essential for fabric-attached memory >1 TB.

Fabric mode (CXL 3.0+): The biggest architectural shift. Up to 4096 nodes (hosts + devices) in a single coherent fabric. Multiple switching layers, port-based routing (so routes don't need full Tree-based hierarchical IDs), peer-to-peer DMA between devices through the fabric, and Global Fabric Attached Memory (GFAM) — pooled memory accessible from any host with sub-microsecond latency.

GFAM: Memory devices that sit in the fabric and serve any host as a shared memory pool. Imagine 8 TB of pooled DRAM accessible from 32 hosts; each host sees it as a transparent memory region. Use cases: large database buffer pools (shared across nodes), in-memory caches (Redis-like), AI checkpoint storage. Reference designs: Samsung Memory Expander Modules, Astera Leo Gen 2.

IDE (Integrity and Data Encryption): CXL line-level encryption (AES-GCM) on every flit. Configurable per virtual channel. Adds <3 ns latency in modern controllers.

TDISP (TEE Device Interface Security Protocol): PCIe spec adopted by CXL for attesting confidential devices. Lets a TEE (Intel TDX, AMD SEV-SNP) verify that a CXL device is genuine and operating in a trusted mode before mapping its memory. Required for confidential AI workloads on cloud.

Vendors and parts:

VendorProductTypeCapacityNotes
SamsungCMM-D (CXL Memory Module DRAM)Type 3128/256 GBFirst mass-market CXL 2.0 module
SK HynixCMM-DDR5 / CMM-2LMType 396/256 GBUsed in Tier-2 hot data offload
MicronCZ120Type 3128 GBE3.S form factor, PCIe 5
Astera LabsLeoType 3 + retimersup to 2 TB per moduleLeading independent CXL memory IC supplier (Aries retimers, Leo controllers, Scorpio fabric switches)
MarvellStructera CXL-XType 3up to 240 GB+Disaggregated memory + cache acceleration
MicrochipSMC 2000Type 3128/256/512 GBHigh-end DDR5 controller
PanmnesiaCXL 3.1 switchswitch64 lanesFirst CXL 3.1 switch demoed late 2024

NVLink is NVIDIA's proprietary high-bandwidth GPU-to-GPU (and now GPU-to-CPU via NVLink-C2C, plus GPU-to-NVSwitch) interconnect.

GenGPU debutYearPer-link bidir BWLinks per GPUTotal per GPU (bidir)Notes
1P100201640 GB/s4160 GB/sFirst NVLink
2V100201750 GB/s6300 GB/sNVSwitch 1.0 introduced in DGX-2 (16-GPU all-to-all)
3A100202050 GB/s12600 GB/sNVSwitch 2.0
4H100202250 GB/s18900 GB/sNVSwitch 3.0 with NVLink Sharp (NVLS); 50G PAM4 per lane
5B200 / GB2002024100 GB/s181.8 TB/s100G PAM4 per lane; NVLink Switch tray; NVL72 enables 72-GPU coherent domain
6 (Rubin)R100 (expected)2026-2027200 GB/s18+~3.6 TB/s200G per lane

NVSwitch generations:

Switch genGPU genPer-switch BWTotal switches per node/rackUsed in
NVSwitch 1.0V10050 GB/s × 18 ports = 900 GB/s6 switches per HGX-2DGX-2 (16 GPUs all-to-all)
NVSwitch 2.0A1001.6 TB/s aggregate6 per HGX-A100DGX A100 (8 GPUs)
NVSwitch 3.0H1003.2 TB/s aggregate, with NVLS in-switch reduction4 per HGX-H100 (8 GPU); 9 trays for NVL72DGX H100, HGX H100/H200, NVL72
NVSwitch 4.0B200/B300~7.2 TB/s aggregate, supports 72-GPU fabric9 NVSwitch trays per NVL72GB200 NVL72, GB300 NVL72

NVLink Sharp (NVLS): In-switch reduction. Instead of every GPU sending data to a root and reducing serially, NVSwitch 3.0+ has dedicated reduction ALUs inside the switch silicon. AllReduce moves from O(N) message exchanges per GPU to O(log N) with the switch doing the math. For an N-GPU ring AllReduce on M bytes, the time model goes from 2(N-1)/N × M/B (ring) to M/B + α log N (NVLS / tree). On 72-GPU NVL72 doing FP8 AllReduce on 64 GB of tensors, NVLS halves AllReduce time (and frees compute streams from waiting).

NVL72 — the 72-GPU rack-scale architecture:

NVL72 Rack (single coherent NVLink domain — 72 B200 GPUs, 130 TB/s aggregate):
+============================================================================+
| Spine (NVLink interconnect — copper backplane, ~5000 cables, water cooled) |
+============================================================================+
| NVSwitch tray 9                                                              |
| NVSwitch tray 8                                                              |
| NVSwitch tray 7                                                              |
| NVSwitch tray 6                                                              |
| NVSwitch tray 5                                                              |
| NVSwitch tray 4   ← 9 NVSwitch trays in middle of rack                       |
| NVSwitch tray 3                                                              |
| NVSwitch tray 2                                                              |
| NVSwitch tray 1                                                              |
+============================================================================+
| Compute tray 18  (4 B200 GPUs + 2 Grace CPUs)                                |
| Compute tray 17                                                              |
| Compute tray 16                                                              |
| Compute tray 15                                                              |
| Compute tray 14                                                              |
| ...                                                                          |
| Compute tray 1  (4 B200 + 2 Grace via NVLink-C2C; 18 trays × 4 GPU = 72 GPU) |
+============================================================================+
| Power shelf + management                                                    |
+============================================================================+
  • 18 compute trays, each with 2 Grace CPUs + 4 B200 GPUs (or 2 GB200 Superchips = 2 Grace + 4 B200)
  • 9 NVSwitch trays at the middle of the rack, providing 130 TB/s aggregate bisection
  • All 72 GPUs in a single NVLink fabric: each GPU has 1.8 TB/s to every peer (1-hop)
  • Compute-to-switch: copper backplane (called the "NVLink spine") with ~5000 individual NVLink cables totaling >2 miles
  • Power: 120 kW peak per rack; liquid-cooled (cold plates on every GPU and CPU)

For inference (large-context LLM serving on 1T+ params), NVL72 enables tensor-parallel + pipeline-parallel mapping with NVLink-only communication — no IB/Ethernet step needed for in-rack tokens. For training, NVL72 acts as a fast "scale-up" domain; multiple NVL72 racks connect via 800G InfiniBand (NDR/XDR) into the "scale-out" cluster.

MI300X 8-GPU node: Each MI300X has 7 xGMI links of 64 GB/s each, organized as a fully-connected 8-GPU graph (each pair has 1 direct xGMI). Reference HGX-MI300X compute board mirrors NVIDIA HGX. Per-GPU peer BW: 7 × 128 GB/s = 896 GB/s (each link bidir). No external switch tier yet — limit is 8 GPUs in one domain.

UALink (Ultra Accelerator Link): Consortium launched 2024 by AMD, Broadcom, Cisco, Google, Intel, Meta, Microsoft, Hewlett Packard Enterprise — explicitly to create an open NVLink alternative. Spec 1.0 released Apr 2025. Targets:

  • Scale to 1024 GPUs in one coherent fabric (vs 72 for NVL72)
  • 200 Gb/s per lane
  • Memory semantics (load/store) — not just message passing
  • UALink switches will be ASICs from Broadcom, Astera, Cornelis, etc.

UALink uses Ethernet PHY (so re-use 200 Gbps SerDes IP), but the protocol layer is a custom coherent protocol (not Ethernet, not CXL, not NVLink). The first UALink chips are expected late 2026.

3.5 Google ICI — TPU Pod Interconnect

Google's TPUs use a custom ICI (Inter-Chip Interconnect) in a 3D torus topology, with optical reconfiguration in v4+.

GenTopologyBW per chipTotal chips per podNotes
TPU v22D torus~600 GB/s aggregate256First public ICI
TPU v32D torus~900 GB/s1024Liquid-cooled
TPU v43D torus, OCS-reconfigurable~1200 GB/s4096 chips per podOptical Circuit Switch (Palomar/Apollo) enables runtime topology reshape per job; ISCA 2023 paper Jouppi et al.
TPU v5e2D torus~1200 GB/s256 (single pod)Cost-optimized
TPU v5p3D torus + OCS~3600 GB/s aggregate8960 chipsLarger pod, similar topology to v4
TPU v6 (Trillium)3D torus + OCS~1800 GB/s per chip256 per "cube"Energy-optimized; matches H100 perf at 1/3 the power
TPU v6e / v6p (Ironwood, 2025)Same~3600 GB/s?9216 (Ironwood pod)Targets inference scaleout; FP8 + integer formats

3D torus topology is preferred at TPU scale because it has constant per-chip wire count (6 neighbors) regardless of pod size, vs Clos which scales links per chip with the fanout. Bisection bandwidth scales as N^(2/3) (the cross-section of a 3D torus), but for the dominant collective patterns (AllReduce, AllGather on tensor-parallel groups) torus is a natural fit. OCS reconfiguration: An optical circuit switch lets the cube be reshaped per job — a 3D torus can be split into multiple 2D tori, or reorganized as 2×4×8 vs 4×4×4. This is critical when TPU pod must run many parallel jobs with different shapes.

AWS Trainium2 (2024) and Trainium3 (2025) use NeuronLink-v3, a proprietary interconnect for scaling 16-64 Trainium2 chips in a single "UltraServer". Per-chip aggregate NeuronLink BW: ~12 Tbps (1.5 TB/s). Topology: hypercube or modified Clos depending on UltraServer size. Designed to match Nvidia NVL72 economics for inference at scale, used in Project Rainier (Anthropic's training cluster) for Claude-family training.


4. Tier 3 — Storage Fabrics

Storage networking has converged on NVMe-oF (NVMe over Fabrics) as the modern standard, replacing earlier protocols.

4.1 NVMe-oF — NVMe over Fabrics

NVMe-oF lets a host issue NVMe commands over a network fabric instead of PCIe. The same NVMe submission/completion queue semantics, with fabric-specific transport.

TransportWireLatency overhead vs PCIe NVMeNotes
NVMe/RDMA (RoCE v2)UDP/IP over Ethernet w/ RDMA+5-10 µsMost common; uses verbs; needs lossless network (PFC)
NVMe/RDMA (InfiniBand)IB transport+3-5 µsLower latency than RoCE; IB clusters
NVMe/FC (FC-NVMe)Fibre Channel+20-50 µsDrop-in replacement for SCSI/FC in enterprise SANs
NVMe/TCPTCP/IP+30-80 µs (CPU-bound)Most portable, runs over any IP network; CPU-heavy without offload

Capsule semantics: Every NVMe-oF command is wrapped in a "capsule" containing the NVMe command opcode/parameters plus inline or referenced data. For small commands and writes, the data is inlined with the capsule (single round-trip). For larger I/O, the data is fetched via RDMA READ from the host's buffer (for writes) or pushed via RDMA WRITE (for reads). The capsule contains:

  • 64-byte NVMe Submission Queue Entry (SQE)
  • Optional payload (inline data)
  • For reads: SGL/PRP pointer to host buffer (RDMA registered MR)

Queue mapping: NVMe has admin + I/O queues. Over fabrics, each I/O queue is mapped to a single QP (Queue Pair) in RDMA or a single TCP connection in NVMe/TCP. Multi-queue NVMe-oF therefore uses many QPs; CPU pinning of queues to cores matters significantly for performance.

NVMe/TCP optimizations 2024-2026:

  • TCP-DDP/zero-copy: Linux kernel 6.x supports TCP zero-copy receive (MSG_ZEROCOPY) for NVMe/TCP, eliminating the buffer copy from sk_buff to user space.
  • kTLS offload: Encryption offloaded to NIC for secure NVMe/TCP.
  • iouring submission: Hybrid NVMe/TCP via io_uring is the modern path — competitive with NVMe/RDMA on light loads.

4.2 Fibre Channel

Despite predictions of its demise, FC remains entrenched in enterprise SANs. Speeds: 8/16/32/64/128 GFC (Gigabit Fibre Channel). 64GFC = 64 Gbps per port (~6.4 GB/s after encoding).

FCP (Fibre Channel Protocol): The SCSI-over-FC mapping. Largely replaced by NVMe-oF/FC for new deployments.

NPIV (N_Port ID Virtualization): Lets multiple "virtual" FC ports share one physical HBA — essential for VM passthrough on FC SANs.

Zoning:

  • Soft zoning: Name-server enforced; the FC switch's name service hides devices in other zones. Bypassable if attacker has hard-coded WWPNs.
  • Hard zoning: ASIC-enforced; switch hardware drops frames that violate zone rules.

FCoE (Fibre Channel over Ethernet): Was supposed to be the unified-fabric answer (FC over lossless Ethernet, with DCB extensions). Largely deprecated in favor of iSCSI and NVMe-oF; few new FCoE deployments since ~2018.

4.3 iSCSI

iSCSI (SCSI over TCP) is the long-time low-end alternative to FC. Still in widespread use for general-purpose SANs on commodity Ethernet. Latency: ~100-500 µs (TCP + SCSI translation). Increasingly displaced by NVMe/TCP, which has lower latency and the same simplicity.

4.4 SAS / SATA (Drive-Local Fabrics)

These are direct-attach drive interfaces, not network fabrics, but worth a note:

StandardGenerationPer-drive BWLatencyUse case
SATA III6 Gbps600 MB/sµsBoot SSDs, legacy HDDs
SAS-422.5 Gbps2.25 GB/s10s of µsEnterprise SAS HDDs
NVMe (U.2/U.3/E1.S/E3.S)PCIe 4/5/6up to 16 GB/s (PCIe 5 x4)<10 µsAll modern NVMe SSDs

SAS/SATA traffic to a JBOD enclosure flows over SAS expanders (12 Gbps SAS-3 or 22.5 Gbps SAS-4). NVMe-oF + E1.S/E3.S enclosures (EDSFF form factors) are replacing SAS JBODs for high-density flash deployments.


5. Tier 4 — Datacenter Network

5.1 Ethernet Evolution

Ethernet has scaled from 10 Mbps in 1983 to 1.6 Tbps in 2025 — 5 orders of magnitude in 42 years. The driver since ~2010 has been SerDes per lane × lane count.

SpeedFirst standard (IEEE)YearLanes × per-laneModulationFECReachStatus 2026
1 GbE802.3z19981 × 1 GbpsNRZnoneSR/LR/CXLegacy
10 GbE802.3ae20021 × 10 GbpsNRZnoneSR/LR/ERLegacy
25 GbE802.3by20161 × 25 GbpsNRZRS(528,514) opt.SR/LRMainstream edge
40 GbE802.3ba20104 × 10 GbpsNRZnoneSR4/LR4Largely deprecated
50 GbE802.3cd20181 × 50 GbpsPAM4RS(544,514)SR/FR/LRServer NIC
100 GbE802.3bj/bm20144 × 25 Gbps (NRZ); later 2 × 50 Gbps (PAM4)NRZ → PAM4KR4 / RS(544,514)SR4/LR4/CR4Mainstream server NIC
200 GbE802.3bs20174 × 50 Gbps PAM4PAM4RS(544,514)SR4/DR4/FR4/LR4Common
400 GbE802.3bs20178 × 50 Gbps PAM4 (early); 4 × 100 Gbps PAM4 (2022+)PAM4RS(544,514)SR4/DR4/FR4/LR4/ZRMainstream AI cluster spine 2023-2025
800 GbE802.3df20248 × 100 Gbps PAM4PAM4RS(544,514)SR8/DR8/FR8/2xFR4 + ZRLatest AI cluster spine
1.6 TbE802.3dj (project)20268 × 200 Gbps PAM4PAM4RS(544,514) or "concatenated"SR8/DR8 (3-5m DAC), VR8, CPONext gen — sampling 2025-2026
3.2 TbEFuture2028+8 × 400 Gbps PAM6 or coherentlikely PAM6 or coherentnewmostly opticalRoadmap

SerDes generations (per-lane signaling):

  • NRZ 10 Gbaud (10 Gbps NRZ) — through ~2014
  • NRZ 25 Gbaud — through 2018; backbone of 100GbE 4×25
  • PAM4 50 Gbaud (50 Gbps) — 2018-2022; 100GbE 2×50 and 400GbE 8×50
  • PAM4 100 Gbaud (100 Gbps) — 2022-2025; 800GbE 8×100, ConnectX-7
  • PAM4 200 Gbaud (200 Gbps) — 2024-2027; 1.6TbE 8×200, ConnectX-8 (200 Gbps per lane, 800 Gb/s port)

Above ~224 Gbaud, electrical SerDes hits SNR walls (PCB loss, connector reflections). The frontier above this is co-packaged optics (CPO) — putting the optical engines next to the switch ASIC so SerDes only traverses millimeters of substrate, not centimeters of PCB.

FEC (Forward Error Correction):

  • KR4: Original 802.3 short-reach Reed-Solomon-like FEC for 100GbE backplanes.
  • RS(528,514): Reed-Solomon code, 528-symbol codeword with 514 data + 14 parity (BER from ~1e-5 to ~1e-15). Used in 25GbE (optional), 50GbE, 100GbE.
  • RS(544,514): Used in 100-800GbE. 544 symbols, 30 parity. Stronger code needed for PAM4's lower per-symbol SNR.
  • Concatenated FEC: For 1.6TbE, an additional outer code may be layered on top of RS(544,514) for the most challenging PAM4 channels.

FEC adds latency — typically 100-200 ns of switching latency at 100 GbE PAM4. For latency-sensitive HPC, this is significant; for general DC traffic, it's invisible.

5.2 Data Center Bridging (DCB) Stack

For lossless Ethernet (required by RoCE v1/v2 and FCoE), IEEE 802.1 added four extensions, collectively called DCB:

StandardNameWhat it does
802.1Qbb (PFC)Priority Flow ControlPer-class pause: instead of pausing all traffic, pause only one of 8 traffic classes (CoS)
802.1Qaz (ETS)Enhanced Transmission SelectionBandwidth allocation: assign min/max % of link to each class group
802.1Qau (QCN)Quantized Congestion NotificationSwitch sends explicit congestion feedback to source (rarely used today, superseded by ECN)
802.1Qaz (DCBX)DCB Exchange protocolLLDP-based exchange of DCB capabilities/config between switch and endpoint

PFC is the workhorse. A receiving switch port that fills its buffer sends a 16-bit PAUSE frame back to the sender's switch, listing which of 8 priorities should stop sending. The sender pauses that class for a quanta-encoded time. PFC pause storms (cyclic dependencies) are the bane of large lossless networks (see §11).

ETS assigns bandwidth groups: e.g., 60% for RDMA storage traffic, 30% for compute RDMA, 10% for management. Within a group, classes share bandwidth proportionally.

5.3 InfiniBand

InfiniBand (IB) is the gold standard for HPC/AI fabrics: lower latency, higher BW, and richer semantics than Ethernet. Maintained by the InfiniBand Trade Association (IBTA). Vendors: NVIDIA Mellanox is the dominant supplier (~80%+); Intel exited (Omni-Path) and Cornelis Networks now produces a competitor.

SpecYearPer-lane signalingx4 link BW (one direction)Notes
SDR20032.5 Gbps NRZ8 Gbps (1 GB/s)First gen
DDR20055 Gbps NRZ16 Gbps
QDR200810 Gbps NRZ32 Gbps
FDR201114 Gbps NRZ56 GbpsFirst w/ 64b/66b encoding
EDR201425 Gbps NRZ100 GbpsConnectX-4
HDR201750 Gbps PAM4200 GbpsConnectX-6 / Quantum HDR switches
NDR2021100 Gbps PAM4400 GbpsConnectX-7 / Quantum-2 switches; dominant in 2024 AI clusters
XDR2024200 Gbps PAM4800 GbpsConnectX-8 / Quantum-X switches; AI scale 2025-2026
GDRFuture400 Gbps PAM4 or coherent1.6 TbpsPlanned ~2027-2028

IB queue pair types:

QP typeReliableConnectedBest forUsed by
RC (Reliable Connection)yesyes (1-to-1)Bulk RDMA WRITE/READ, latency-sensitiveMost apps; MPI; NCCL; RDMA storage
UC (Unreliable Connection)no (drops silently)yes (1-to-1)Streaming where loss is OKRare in modern code
UD (Unreliable Datagram)nono (1-to-many)Multicast, discovery, low-msg-rate broadcastsOpenSM SA queries, MPI bootstrap
XRC (Extended Reliable Connection)yessemi-connected; one QP serves many remote processes per nodeMany-process MPI to reduce QP scalingMellanox MPI variants
DCT (Dynamically Connected Transport)yesdynamic (connect on demand)Scale: 10000s of processes without 10000s of QPsNVIDIA stack; UCX

For an N-process MPI job using RC, each rank needs N-1 QPs — at N=10000 that's 100M QPs in the cluster, blowing through NIC QP-context memory. DCT solves this by reusing a small pool of QPs that get dynamically rewired to peers as messages arrive.

Verbs basics (covered in §9):

  • MR (Memory Region): a registered, pinned, IOMMU-mapped region. Has lkey (local) and rkey (remote) used in RDMA WRITEs/READs.
  • CQ (Completion Queue): receives completions when WRs finish.
  • WR (Work Request): a single SEND/RECV/READ/WRITE/ATOMIC posted to a QP's send or receive queue.
  • Doorbells: MMIO writes that tell the NIC "new work posted" — kernel-bypass.

OpenSM (Open Subnet Manager): Software that runs on one fabric-attached server and:

  • Discovers all switches, routers, and end-nodes (HCAs)
  • Assigns LIDs (16-bit local routing IDs)
  • Computes routing tables (per-switch, per-destination)
  • Configures partitions, QoS, SL-to-VL mappings
  • Monitors fabric health

A typical NDR cluster has one or two OpenSM masters with hot-standby — if the master fails, standby takes over fabric management.

Partitions (P_Keys): IB equivalent of VLANs. A 16-bit P_Key tags every packet; switches/HCAs enforce isolation. Used in multi-tenant clusters where job A must not see job B's traffic.

Topology: IB clusters at scale almost always use fat-tree (Charles Leiserson, 1985) — a Clos network where bandwidth doubles toward the spine, giving full bisection bandwidth at any cut. Variants:

  • Full fat tree: Every leaf has full BW upward — most expensive.
  • Tapered (2:1 oversubscribed): Leaf has 2× downlinks vs uplinks. Common in cost-sensitive deployments.
  • Dragonfly+ / Dragonfly: Used in some Mellanox/HPE Slingshot clusters for very large fabrics.

5.4 RoCE — RDMA over Converged Ethernet

RoCE brings IB verbs to Ethernet. Two versions:

VersionLayerEncapsulationRoutable?
RoCEv1L2 onlyEthertype 0x8915 directly in Ethernet frameNo — single broadcast domain
RoCEv2L3UDP/IP encap, UDP port 4791Yes — runs over any IP network

RoCEv1 is dead — every modern deployment is RoCEv2.

Requirements:

  • Lossless fabric (PFC enabled) — RoCE inherits IB's no-drop assumption. A packet drop forces a go-back-N retransmit, killing throughput.
  • DCQCN congestion control (see §5.5) — without it, microbursts cause head-of-line blocking.
  • ECN marking on switches (set CE bit at congestion).

Tuning the DCB triangle:

  1. Configure PFC on the RDMA priority (typically priority 3).
  2. Enable ECN with watermarks (Kmin ~10-15% buffer, Kmax ~80%) so most congestion is signaled via ECN before PFC fires.
  3. Run DCQCN at the endpoint to react to ECN by rate-throttling.

When tuned right: ECN does 99% of the congestion management, PFC is a safety net for rare bursts.

Soft-RoCE (rxe): A pure-software RoCE implementation in the Linux kernel. Useful for development/test on hardware without RoCE NICs (any Ethernet NIC works). Performance is poor (verbs over UDP without offload), but the API surface matches real hardware.

5.5 Datacenter Congestion Control — A Deep Dive

Datacenter congestion control is its own subfield. The fundamental tension: low latency requires small buffers / short queues, while high throughput requires near-100% link utilization. Solving both at line rate, across thousands of concurrent flows, is hard.

DCTCP (Alizadeh, Greenberg, Maltz, Padhye, Patel, Prabhakar, Sengupta, Sridharan, SIGCOMM 2010): Uses ECN with fractional marking. The receiver computes a moving average α = (1-g)α + g × F, where F is the fraction of recent packets ECN-marked. Sender then reduces cwnd by α/2 (vs TCP's 50% cut). Works on commodity Ethernet with ECN.

DCQCN (Zhu, Eran, Firestone, Guo, Lipshteyn, Liron, Padhye, Raindel, Yahia, Zhang, SIGCOMM 2015): The Microsoft Azure solution for RoCEv2.

  • Receiver-side: when receiving an ECN-marked packet, sends a CNP (Congestion Notification Packet) to the sender.
  • Sender adjusts a "target rate" and "current rate" based on CNP feedback.
  • Parameters: Kmin, Kmax (switch ECN watermarks), α smoothing, fast/active recovery rules.
  • Default settings are notoriously hard to tune; Microsoft's experience report (Guo et al., SIGCOMM 2016, "RDMA over Commodity Ethernet at Scale") documents painful real-world deployment.

TIMELY (Mittal, Lam, Dukkipati, Blem, Wassel, Ghobadi, Vahdat, Wang, Wetherall, Zats, SIGCOMM 2015): RTT-based, not ECN-based. Sender measures fine-grained RTT (NIC timestamps), and reduces rate when RTT exceeds a "target." Works without ECN-aware switches but requires precise NIC timestamps. Used at Google in pre-Swift era.

HPCC (Li, Miao, Liu, Zhou, Sridharan, Kumar, Bao, Zhou, Yang, Tewari, SIGCOMM 2019): Uses In-band Network Telemetry (INT) — switches embed per-hop queue depth + tx_bytes into packet headers. Sender computes precise per-hop utilization U and adjusts window. ~3× better tail latency than DCQCN. Used at Alibaba RDMA deployments.

Swift (Kumar, Dukkipati, Jouppi, Lam, Madhavan, Mittal, Mittal, Wassel, Wetherall, Wu, Yang, Zats, SIGCOMM 2020): Google's evolution of TIMELY. Decouples fabric delay (network RTT) from endpoint delay (NIC + host stack). Two-loop control: one for fabric congestion, one for endpoint congestion. Production protocol at Google for both TCP and RDMA-like traffic.

PowerTCP (Addanki, Apostolaki, Ghobadi, Schmid, Vanbever, NSDI 2022): Combines window (queue-based) and rate (delay-based) signals. Uses the power (queue × throughput) as the congestion signal. Especially good for short flows that don't get many RTT samples.

EQDS (Olteanu, Agache, Voinescu, Raiciu, NSDI 2022): Receiver-driven scheduling. Senders post intentions; receivers issue per-packet "credits" controlling who sends when. Eliminates congestion at the receiver side entirely; well-suited for AI training where receiver = parameter server. Adopted in NVIDIA's BlueField stack experiments.

IRN (Improved RoCE NIC; Mittal, Shpiner, Panda, Zahavi, Krishnamurthy, Ratnasamy, Shenker, SIGCOMM 2018): Replaces go-back-N with selective-ACK + bitmap retransmit for RoCEv2. Allows running RoCE on lossy fabric (no PFC needed), trading off some throughput for elimination of PFC pause storms.

Annulus (Stephens, Akella, Swift, SIGCOMM 2019): Per-flow scheduling at the host via fast NIC primitives; complement to switch-side CC.

5.6 iWARP

iWARP (Internet Wide Area RDMA Protocol) is RDMA layered over TCP, not UDP. Three protocol layers stack:

  • RDMAP — RDMA verbs (above DDP)
  • DDP (Direct Data Placement) — handles segmentation and reassembly into pre-registered buffers
  • MPA (Marker PDU Aligned framing) — frames DDP segments and adds CRCs

Pros: Runs over any IP network. Works in WAN. No special PFC tuning. Cons: TCP overhead (slower start, complex congestion control) limits throughput vs RoCE. NIC implementations are rare today; Chelsio is the main vendor. Largely a niche choice in 2026.

5.7 Ultra Ethernet Consortium (UEC) 1.0

UEC is a Linux Foundation project (launched July 2023) explicitly chartered to build a lossy, packet-spraying, modern transport for AI workloads that beats InfiniBand on scale-out cost while matching its latency. Spec 1.0 released June 2025; member companies include AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta, Microsoft, NVIDIA (joined late 2024).

Key innovations:

InnovationWhat it doesVs traditional
RUD / RUDI (Reliable Unordered Delivery)Transport delivers packets unordered to NIC; reorder in NIC hardware on receiveVs RC's strict in-order
Packet sprayingEvery packet of a flow takes a different path (per-packet ECMP)Vs traditional 5-tuple-hashed ECMP which sticks one flow to one path
Out-of-order deliveryNIC + transport handle reorder; sender doesn't pace per pathEliminates head-of-line blocking
Ephemeral connectionsConnection state set up at first message, torn down after idle; no persistent QPsvs RC's persistent QPs
Modernized CCBuilt-in HPCC/Swift-style signalingvs DCQCN tuning headaches
libfabric providerSoftware accessed via OFI providersFamiliar APIs

The goal is to use commodity Ethernet switches (which can ECMP per-packet via load balancing on packet hashes) to achieve near-100% utilization without the IB premium. AMD (Pensando NICs) and Broadcom (Tomahawk switches) are leading hardware deployment.

UEC is the industry's bet that the next generation of AI scale-out fabrics will run on Ethernet, not IB.

5.8 Topologies — ASCII Diagrams

Two-tier Clos (Leaf-Spine) — typical DC topology:

            ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
  Spine     │ S1   │ │ S2   │ │ S3   │ │ S4   │
            └─┬─┬──┘ └─┬─┬──┘ └─┬─┬──┘ └─┬─┬──┘
              │ │      │ │      │ │      │ │
       ╔══════╪═╪══════╪═╪══════╪═╪══════╪═╪═════╗
       ║      │ │      │ │      │ │      │ │     ║   Each leaf
       ║    every leaf has 4 uplinks (1 to each spine) ║   ────────
       ╚══════╪═╪══════╪═╪══════╪═╪══════╪═╪═════╝   ─ 32-64 server
              │ │      │ │      │ │      │ │            downlinks
            ┌─┴─┴──┐ ┌─┴─┴──┐ ┌─┴─┴──┐ ┌─┴─┴──┐         ─ 4-16 spine
  Leaf      │ L1   │ │ L2   │ │ L3   │ │ L4   │           uplinks
            └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
               │        │        │        │
            servers  servers   servers   servers
            (~32)    (~32)     (~32)     (~32)

Three-tier fat-tree (Charles Leiserson 1985):

                  Core (Super-spine)
                ┌──┬──┬──┬──┬──┐
                │  │  │  │  │  │
        ┌───────┴──┴──┴──┴──┴──┴───────┐
        │          full bisection      │
   ┌────┴────┐  ┌────┴────┐  ┌────┴────┐
   │ Spine 1 │  │ Spine 2 │  │ Spine 3 │
   └─┬─┬─┬─┬─┘  └─┬─┬─┬─┬─┘  └─┬─┬─┬─┬─┘
     │ │ │ │      │ │ │ │      │ │ │ │
   ┌─┴─┐         ┌─┴─┐         ┌─┴─┐
   │L1 │  ...    │L9 │  ...    │L17│  ...
   └─┬─┘         └─┬─┘         └─┬─┘
   servers       servers       servers

Dragonfly+ (HPE Slingshot, Cray, IB Quantum-2 dragonfly mode):

        ┌─────────── group 1 ──────────┐    ┌─────── group 2 ──────┐
        │  ┌───┐  ┌───┐  ┌───┐  ┌───┐  │    │  ┌───┐ ┌───┐ ┌───┐   │
        │  │S1 ├──┤S2 ├──┤S3 ├──┤S4 │  ◄════►  │S1 ├─┤S2 ├─┤S3 │.. │
        │  └─┬─┘  └─┬─┘  └─┬─┘  └─┬─┘  │    │  └─┬─┘ └─┬─┘ └─┬─┘   │
        │    │      │      │      │     │    │    │     │     │     │
        │  servers  servers ...           │    │  ...                 │
        │   (1 link to every other        │    │                      │
        │    switch in same group)        │    │                      │
        └─────────────────────────────────┘    └──────────────────────┘

Topology: 3-tier
  1. Within a group, switches fully meshed (1 hop)
  2. Between groups, fewer long links (typically 1-4 per pair of groups, called "global links")
  3. To reach a faraway server: src_sw → src_group_egress_sw → dst_group_ingress_sw → dst_sw  (3 hops max)

Dragonfly's advantage over fat-tree: ~30% fewer optical links for the same bisection. Disadvantage: requires adaptive routing — picking which global link to use based on congestion — to avoid traffic concentrating on a few global links. Cray's Cassini (Slingshot 11/12) and Mellanox Quantum-2 dragonfly mode both implement adaptive routing in switch silicon.

Rail-optimized topology (critical for AI):

    GPU 0       GPU 1       GPU 2       GPU 3       (per server)
   ┌──┴──┐    ┌──┴──┐    ┌──┴──┐    ┌──┴──┐
   │NIC0 │    │NIC1 │    │NIC2 │    │NIC3 │
   └──┬──┘    └──┬──┘    └──┬──┘    └──┬──┘
      │          │          │          │
   ┌──┴─────────────────────────────────┴──┐
   │   Rail 0:  Leaf "Rail-0" connects     │
   │   ALL servers' NIC0 to same leaf      │  ← rail-optimized leaf-spine
   │   Rail 1:  Leaf "Rail-1" connects all NIC1
   │   Rail 2:  Leaf "Rail-2" connects all NIC2
   │   Rail 3:  Leaf "Rail-3" connects all NIC3
   └────────────────────────────────────────┘

In rail-optimized topology, GPU-i on every server connects to the same leaf in "rail i". For AllReduce, where every GPU communicates only with the same rank on other servers (ring reduction stays within a rail), traffic never crosses rails — eliminating cross-rail congestion. NCCL with NCCL_IB_HCA set per-rail uses this naturally.

5.9 Other DC Network Topologies

  • HyperX (Ahn et al., SC 2009): Generalized hypercube; trade-off between dragonfly and fat-tree.
  • 3D Torus (Tofu-D, BlueGene): Each switch connects to 6 neighbors. Used in supercomputers but rare in commodity DCs.
  • Jellyfish (Singla et al., NSDI 2012): Random regular graph topology. Higher throughput than fat-tree for a given switch budget, but routing is harder. Academic mostly.
  • F10 (Liu et al., NSDI 2013): Fault-resilient symmetric fat-tree variant.

6. Tier 5 — HPC Fabrics

HPC fabrics target the largest supercomputers and AI training clusters where commodity Ethernet/IB still leave performance on the table. Today, three live ecosystems matter:

6.1 HPE Slingshot

Slingshot is the interconnect used in HPE Cray EX supercomputers (Frontier, El Capitan, Aurora's Slingshot variant). Based on Cassini NIC ASIC + Rosetta switch ASIC.

GenerationYearPer-port BWTopologyDeployed in
Slingshot 112022200 Gbps Ethernetdragonfly+Frontier (9408 nodes), Adastra, LUMI
Slingshot 122024400 Gbps Ethernetdragonfly+El Capitan, Aurora (variants), next-gen Cray EX

Cassini NIC: AMD-designed (HPE-acquired) RDMA-capable NIC with:

  • Adaptive routing (per-packet)
  • Selective congestion management (small-flow priority)
  • HPC-specific extensions over Ethernet: source routing, in-network telemetry, on-NIC reductions for collectives (similar to SHARP)

Slingshot adds HPC features to Ethernet, including:

  • Adaptive routing in the dragonfly to avoid hot-spot global links
  • Fine-grained per-flow buffer credits
  • Custom congestion control (not stock DCQCN)
  • Ethernet compatibility: still speaks 200/400 GbE to commodity NICs (so a Slingshot cluster can also host generic ML pods)

Deployed at scale in Frontier (first exascale system, Oak Ridge, 9408 EPYC nodes × 4 MI250X each), El Capitan (LLNL, ~11000 MI300A nodes), and others.

6.2 Fujitsu Tofu Interconnect D (TofuD)

Fujitsu's TofuD is the proprietary interconnect of Fugaku (Riken supercomputer, 158k A64FX nodes, peak #1 in TOP500 from 2020-2022). 6D mesh/torus topology, no central switch.

Key features:

  • 6D structure: Each node has 10 links in a "TofuD unit" of 12 nodes (A64FX chips); units stack into a 6D mesh
  • 28 Gbps per link, 10 links per node → 280 Gbps per node aggregate
  • Virtual 2D/3D mapping: Applications request a logical 2D or 3D subdomain; the OS maps onto the 6D physical topology to minimize hops
  • HW collectives: AllReduce-style barrier + reduction primitives in switch silicon
  • Multi-rail in software: MPI rank-to-link assignment optimizable per phase

TofuD's 6D structure means any pair of nodes is at most ~5-6 hops apart in a 158k-node system, vs ~3 in a fat-tree but with no expensive optical cabling between groups. A great fit for stencil computations (CFD, weather modeling) where neighbors-only communication dominates.

6.3 Cray Aries, Gemini (Legacy)

  • Aries (Cray XC-series, 2013-2020): Dragonfly topology; first widely-deployed dragonfly. Used in Piz Daint, Cori, Theta, Trinity.
  • Gemini (Cray XE/XK-series, 2010-2014): 3D torus. Used in Titan (Oak Ridge), Hopper (NERSC).

Both retired in current production (last large Gemini system: Blue Waters, decommissioned 2019). Slingshot replaced Aries.

6.4 Intel Omni-Path → Cornelis Networks CN5000

Intel acquired QLogic's Trad-PSM in 2012, evolved it into Omni-Path (OPA), but exited the business in 2019. The IP was acquired by Cornelis Networks (founded 2020 by Omni-Path veterans), which continues development as:

  • Omni-Path Express (CN5000): 400 Gbps per port, deployed in some DOE labs (LLNL, ANL) and HPC academic clusters.
  • Features: PSM3 (Performance Scaled Messaging 3) software stack, libfabric integration, low-overhead RDMA.
  • Niche but active in HPC; not a major AI play.

6.5 IBM BlueGene Tree + Torus (Historical)

The BlueGene family (L/P/Q, 2004-2018) at LLNL used a 3D (BG/L, BG/P) or 5D (BG/Q) torus for nearest-neighbor traffic, plus a separate collective tree for reductions/broadcasts and a global interrupt/barrier network. Three physical networks for three traffic patterns. This pattern (separate network per traffic class) was efficient but expensive — modern systems consolidate via virtual channels on a unified fabric. BlueGene retired ~2019 (Sequoia decommissioned).

6.6 Anton 2 / Anton 3 (DE Shaw Research)

The Anton series are ASIC-based molecular-dynamics machines built by D. E. Shaw Research. Each Anton 2 chip integrates 64 specialized processing tiles plus a dedicated 3D-torus interconnect that runs molecular-dynamics specific kernels (PME-like FFTs, bond/non-bond force computations) at low latency. Per-link BW: ~5 Gbps × 6 directions per chip. Total system: 512 chips in 3D torus.

Anton 3 (announced 2021, SC22 paper): 64 tiles per ASIC at 6 nm, faster torus links, simulates 100+ µs of MD per day on multi-million-atom systems — far beyond any GPU cluster for this specific workload. The lesson is that for a fixed compute pattern (MD), a custom ASIC + custom topology beats general-purpose hardware by 50-100×.


7. Tier 6 — Software Stacks

7.1 libibverbs (Verbs API)

The core RDMA API on Linux, originally Mellanox/QLogic, now in rdma-core (https://github.com/linux-rdma/rdma-core). Header: <infiniband/verbs.h>. Object lifecycle:

struct ibv_context *ctx = ibv_open_device(dev);
struct ibv_pd *pd = ibv_alloc_pd(ctx);
struct ibv_mr *mr = ibv_reg_mr(pd, buf, len, IBV_ACCESS_LOCAL_WRITE |
                                              IBV_ACCESS_REMOTE_READ |
                                              IBV_ACCESS_REMOTE_WRITE);
struct ibv_cq *cq = ibv_create_cq(ctx, depth, NULL, NULL, 0);
struct ibv_qp_init_attr attr = {
    .send_cq = cq, .recv_cq = cq,
    .cap = { .max_send_wr = 64, .max_recv_wr = 64,
             .max_send_sge = 4, .max_recv_sge = 4 },
    .qp_type = IBV_QPT_RC,
};
struct ibv_qp *qp = ibv_create_qp(pd, &attr);

// transition: INIT → RTR (ready to receive) → RTS (ready to send)
ibv_modify_qp(qp, &qp_init_attr,
              IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS);
// ... (RTR / RTS transitions with remote_qpn, remote_lid)

// Post a send WR
struct ibv_send_wr swr = {
    .wr_id = 1,
    .opcode = IBV_WR_RDMA_WRITE,
    .send_flags = IBV_SEND_SIGNALED,
    .wr.rdma.remote_addr = remote_addr,
    .wr.rdma.rkey = remote_rkey,
    .sg_list = &sge, .num_sge = 1,
};
struct ibv_send_wr *bad;
ibv_post_send(qp, &swr, &bad);

// Poll completion
struct ibv_wc wc;
while (ibv_poll_cq(cq, 1, &wc) == 0) { /* spin */ }
if (wc.status != IBV_WC_SUCCESS) { /* handle error */ }

Cost of MR registration: ibv_reg_mr() pins pages, builds NIC translation tables, and (with strict IOMMU) sets IOTLB entries. For a 1 GB region, this can take 10s of milliseconds. Hot path apps register large MRs at startup and reuse them. ODP (On-Demand Paging) avoids the pinning by letting the NIC take a page fault (via PCIe ATS+PRI), translated by host kernel. Latency penalty: ~5-10 µs per fault. Use for sparse access patterns over very large regions.

7.2 rdma-core, librdmacm

rdma-core: Single package containing libibverbs, librdmacm, libmlx5, etc. The reference RDMA userspace library.

librdmacm (RDMA Connection Manager): higher-level helpers for setting up RC connections. Models on BSD sockets — rdma_create_id(), rdma_resolve_addr(), rdma_connect(), rdma_accept(). Translates IP addresses to QP numbers + RDMA-specific routing. Used by NVMe-oF kernel target, Lustre, GPFS, ceph rdma-msgr, etc.

7.3 UCX — Unified Communication X

UCX (https://openucx.org/) is the unified software stack chosen by both NCCL and most modern MPI implementations.

Layers:

┌─────────────────────────────────┐
│ Applications  (MPI, NCCL, ...)  │
└─────────────────────────────────┘
            │
┌─────────────────────────────────┐
│ UCP   protocol layer            │  ← per-message protocol selection (eager/rendezvous), tag matching
└─────────────────────────────────┘
            │
┌─────────────────────────────────┐
│ UCT   transport layer           │  ← per-transport (verbs/RC, verbs/UD, DC, RDMA-CM, CUDA, ROCm, TCP)
└─────────────────────────────────┘
            │
┌─────────────────────────────────┐
│ UCS   services (config, mempool)│
└─────────────────────────────────┘

UCX automatically picks the fastest transport per peer (RC for short distances, DC for many-peer, GPU-Direct for intra-node GPU-to-GPU). UCX is used by:

  • NCCL (2.10+) — primary plugin
  • Open MPI (4.x+) — via mca pml ucx
  • MVAPICH2-X
  • HPC-X (Mellanox stack)
  • Charm++, Legion, HPX

7.4 libfabric / OFI Providers

libfabric (OFI = OpenFabrics Interfaces) is an alternative high-level RDMA API maintained by the OpenFabrics Alliance. Different abstraction from UCX — more API-focused, less protocol-driven.

Providers (transports):

  • verbs — generic libibverbs
  • efa — AWS Elastic Fabric Adapter (custom RDMA-like transport over AWS-specific NIC)
  • psm3 — Cornelis Omni-Path / Intel Performance Scaled Messaging
  • cxi — Cray/HPE Slingshot (Cassini)
  • tcp — sockets fallback
  • sockets — UDP-based testing
  • shm — shared-memory (intra-node)
  • opx — Omni-Path Express
  • ucx — libfabric over UCX (interop)

libfabric is preferred by AWS, Cray/HPE, Intel stacks. UCX dominates NVIDIA + Mellanox stacks. Both interoperate but choosing one is usually deterministic per-vendor.

7.5 MPI — Comparison Table

ImplementationVendorPrimary transport pluginStrengthsWeaknesses
Open MPIOpen consortiumUCX, libfabric, BTLMost portable; works everywhereTuning surface; defaults rarely optimal
MPICHANLlibfabric (CH4), ch3 (legacy)Reference impl; many forks (MVAPICH, Intel)Less rich collective lib than NCCL
Intel MPIIntellibfabricTight x86 + OPX integrationLess common on AMD/ARM
MVAPICH2-XOSU/NSFUCX + verbsInfiniBand specialist; GPU-awareLess Ethernet/cloud support
HPC-X / NVIDIA Mellanox MPINVIDIAUCX, SHARP, NCCLTop performance on IB + NVLinkVendor-tied
Cray MPI (MPICH-derived)HPEOFI/cxi for SlingshotSlingshot specialistTied to Cray EX

Collective algorithms (key MPI primitives):

CollectiveNaiveBetterUsed When
Broadcastflat (root → all, N msgs)binomial tree O(log N)Always (default)
AllReduceflat (gather + scatter)recursive doubling, Rabenseifner (split big msgs)depends on msg size
AllGatherflat ringBruck (log N steps, longer per step)small msgs
AllToAllspread / BruckBruck for small msgs, pairwise exchange for largealways tuned
Reduce_scatterrecursive halvingRabenseifnermedium-large msgs

Modern MPI implementations include adaptive selection: choose algorithm per (msg size, ranks, topology). MPI-4 added persistent collectives (MPI_Bcast_init + MPI_Start) — re-using a pre-planned schedule for repeated collectives, eliminating planning overhead.

7.6 NCCL — NVIDIA Collective Communications Library

NCCL (https://github.com/NVIDIA/nccl) is the dominant GPU collective library for AI training. Key concepts:

  • AllReduce algorithm choice:
    • Ring: All-Reduce in 2(N-1) steps, each GPU sends M/N data per step. Time: 2(N-1)/N × M/B. Used for medium-to-large messages where bisection BW dominates.
    • Tree: Reduce up a binary tree (log N depth), broadcast down (log N depth). Total: 2 log N hops. Time: 2 log N × α + 2 M/B. Used for small messages where latency α dominates.
    • NVLS (NVLink Sharp): In-network reduction via NVSwitch ALUs. Time approximately M / (2 × B) — halves AllReduce bandwidth. Used for HGX-H100+ and NVL72 where NVSwitch 3+ is present.
    • SHARP (Mellanox InfiniBand): Switch-side reduction in IB switches. Same idea as NVLS but on IB. Used in HGX nodes + IB scale-out.

NCCL dynamically picks the algorithm per (message size, topology, ranks).

  • Channels: Each AllReduce uses N parallel "channels" — independent ring/tree paths through the topology. More channels = more concurrent NVLink/IB flows = higher BW. Default 8-16; tuned via NCCL_MIN_NCHANNELS / NCCL_MAX_NCHANNELS.

  • Topology detection: At init, NCCL probes the system topology (PCIe layout, NVLink topology, NIC binding) and builds a tree representation. NCCL_TOPO_DUMP_FILE=topo.xml writes the detected topology to inspect.

  • Critical environment variables:

VariablePurpose
NCCL_DEBUG=INFOVerbose logging including topology decisions
NCCL_DEBUG_SUBSYS=ALLPer-subsystem logs (INIT, COLL, P2P, NET, GRAPH)
NCCL_TOPO_DUMP_FILE=topo.xmlDump system topology XML
NCCL_IB_HCA=mlx5_0,mlx5_1Restrict NCCL to specific IB HCAs (e.g., rail-aware)
NCCL_IB_GID_INDEX=3RoCE v2 GID (RDMA over Ethernet — match VLAN/network)
NCCL_NET_GDR_LEVEL=PHBEnable GPU-Direct RDMA threshold (PHB = same PCIe host bridge)
NCCL_P2P_DISABLE=1Disable peer-to-peer NVLink (debug only)
NCCL_COLLNET_ENABLE=1Enable SHARP (in-network reduction)
NCCL_ALGO=Tree / Ring / NVLSForce algorithm selection
NCCL_NCHANNELS_PER_PEER=NChannels per peer link

NCCL-tests repo (https://github.com/NVIDIA/nccl-tests) provides standard benchmarks. Run all_reduce_perf -b 8 -e 8G -f 2 -g 8 to test AllReduce bandwidth from 8 B to 8 GB across 8 GPUs.

7.7 RCCL (AMD) and Gloo

RCCL (https://github.com/ROCm/rccl): AMD's NCCL-compatible reimplementation for ROCm/MI series. API-compatible with NCCL but uses xGMI/Infinity Fabric + HIP RDMA primitives.

Gloo (Facebook/Meta): CPU and GPU collective library originally built for PyTorch when NCCL was less mature. Still used as a CPU-only fallback (e.g., parameter sharding) and on networks where NCCL doesn't work (older Ethernet, mixed-vendor). Slower than NCCL on GPU clusters.

7.8 DPDK and XDP

DPDK (Data Plane Development Kit): User-space, poll-mode driver framework. Bypasses the Linux kernel completely; the NIC is unbound from the kernel driver and bound to vfio-pci. DPDK PMD (poll mode driver) constantly polls the NIC RX rings from a dedicated core, eliminating interrupts. Achieves 30-40 Mpps (million packets per second) per core for 64-byte packets — orders of magnitude beyond kernel networking.

Use cases: software switches (OVS-DPDK), NFV, Click-style routing, 5G UPF, AI ingress.

XDP (eXpress Data Path): In-kernel, eBPF-based programmable packet processing. Hooks at the NIC driver before sk_buff allocation. Three modes:

  • XDP_DROP / XDP_PASS / XDP_REDIRECT (to another NIC or userspace via AF_XDP)
  • Can run at 100+ Mpps on modern NICs
  • Used by Cilium, Cloudflare load balancer, Meta's Katran

AF_XDP: Userspace socket type that lets userspace receive packets via XDP_REDIRECT — combines kernel safety with userspace performance.

7.9 io_uring, SPDK (Storage Side)

(See io_uring_internals.md, vfio_internals.md for full coverage.)

  • io_uring: Async I/O via SQ/CQ rings, optionally SQPOLL (kernel poller). For NVMe-oF clients, gives near-DPDK performance with mainline kernel.
  • SPDK (Storage Performance Development Kit): User-space NVMe driver framework, the storage analog of DPDK. Used to build high-performance NVMe targets (vhost-user-blk, NVMe-oF target, blobfs).

7.10 GPU-Direct: RDMA, Storage, Magnum IO

GPU-Direct RDMA (GDR): NIC writes directly into GPU HBM via PCIe peer-to-peer, no CPU/host-memory bounce. NIC must be on same PCIe root complex (PHB) as GPU; IOMMU must allow P2P (or be set to passthrough). NCCL uses this transparently for IB/RoCE transports.

GPU-Direct Storage (GDS): NVMe-oF (or local NVMe) reads/writes go directly to GPU HBM. Path: NVMe → PCIe → GPU. Used heavily for large LLM checkpoint load/save (e.g., load Llama 3 70B weights from NFS into GPUs in seconds, not minutes).

Magnum IO (NVIDIA umbrella SDK): GDR + GDS + UCX optimizations + DALI (data loader). Used to design end-to-end I/O paths in DGX clusters.

7.11 NIXL — NVIDIA Inference Transfer Library (2024-2025)

NIXL (NVIDIA Inference Transfer Library, late 2024 / 2025) is NVIDIA's new abstraction layer for disaggregated LLM inference — moving KV-cache, model partitions, and intermediate activations between GPUs/nodes for inference systems like Dynamo, vLLM, TensorRT-LLM Serving.

Use case: in LLM serving, prefill (compute KV cache) and decode (autoregressive) have very different compute profiles. Disaggregated inference puts prefill on one pool of GPUs and decode on another, and ships KV-cache between them via NIXL. NIXL supports:

  • Multiple transports (NVLink, RoCE, IB)
  • Async fire-and-forget transfers
  • Tensor partitioning / sharding semantics
  • Integration with KV-cache prefix-sharing systems (Mooncake, vLLM)

Released open-source mid-2025; rapidly becoming the standard transport layer for inference disaggregation.


8. Tier 7 — Optical / Future

8.1 Silicon Photonics Fundamentals

Silicon photonics is integrated photonic circuits on silicon (or SiGe) substrates — light modulators, waveguides, and photodetectors all in CMOS. The two dominant modulator topologies:

Mach-Zehnder Modulator (MZM):

                   ┌─── arm 1 (active phase shift) ───┐
   Light in ─────┤                                    ├──── Light out (mod amp = sin²(Δφ/2))
                   └─── arm 2 (reference) ────────────┘

A Mach-Zehnder modulator splits the input light into two paths, applies a phase shift on one arm electrically (via thermo-optic or carrier-injection), and recombines. Output amplitude is cos²(Δφ/2) — full extinction is possible. Bandwidth: 50-100+ GHz on modern Si photonics. Power: ~mW per modulator. Used in commercial 400ZR/800ZR pluggables.

Microring Modulator:

   Light in ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Light out
                            │
                          (ring resonator: small ring waveguide,
                           heater on top, voltage shifts ring resonance,
                           which absorbs / transmits the input)

A small (10-50 µm diameter) ring resonator coupled to a straight waveguide. When the ring is at resonance, it absorbs light at that wavelength (notch filter). Electrically tuning the resonance modulates input transmission. Much smaller and lower-power than MZM (~10s of µW per modulator), but narrow wavelength range (sensitive to temperature, requires active tuning). Used in dense WDM photonic chiplets (Ayar Labs, Lightmatter).

WDM / DWDM (Wavelength Division Multiplexing): Multiple optical signals at different wavelengths share one fiber. Per fiber:

  • CWDM (Coarse): 8-18 wavelengths, 20 nm spacing
  • DWDM (Dense): up to 96 wavelengths at 0.4-0.8 nm spacing (50/100 GHz grid)

A DWDM 80-channel link at 100 Gbps per wavelength = 8 Tbps per fiber. Long-haul WAN fibers commonly carry 10-40 Tbps via DWDM.

8.2 Co-Packaged Optics (CPO)

The premise: at 1.6+ Tbps per port, electrical SerDes can no longer reach across a PCB to a pluggable optical module (the QSFP-DD/OSFP cage). The solution is co-packaged optics — placing the optical engine (laser, modulator, photodetector) directly on the switch package, with millimeter-scale electrical hops only.

Broadcom Tomahawk 5 / 6:

  • Tomahawk 5: 51.2 Tbps switch ASIC (2023), 64 × 800 GbE ports. Pluggable optics standard; CPO variant in development.
  • Tomahawk 6: 102.4 Tbps switch ASIC (2024-2025). 64 × 1.6 TbE ports. CPO variants announced for 2026.

NVIDIA Quantum-X Photonics / Spectrum-X Photonics (announced GTC 2025): NVIDIA's first co-packaged-optics switches. Quantum-X (InfiniBand variant) and Spectrum-X (Ethernet/UEC variant). Each has 144 × 800 Gbps ports = 115 Tbps. Massive reduction in optical-module cost and power (no separate pluggable transceivers).

TSMC COUPE: TSMC's Compact Universal Photonic Engine — a packaged photonic engine reference design (announced 2024) targeted at 1.6T+ switching. Available to ASIC partners.

8.3 400ZR / 800ZR — Coherent Pluggables

For metro/regional optical transport, coherent optics replaces direct-detection IM-DD.

StandardYearPer-port rateModulationReach (unamplified)Form factor
100G-ZR2014100 GbpsDP-QPSK80 kmC Form Pluggable
400ZR2020400 GbpsDP-16QAM120 kmQSFP-DD
400ZR+2021400 GbpsDP-16QAM with FEC enhancements500+ kmQSFP-DD
800ZR2024800 GbpsDP-16QAM at higher baud or DP-64QAM~80-120 kmQSFP-DD800 / OSFP
1.6T ZRfuture1.6 TbpsDP-64QAM or DP-256QAM probabilistically shaped~80-120 kmOSFP-XD

DSP-based coherent: Modern coherent pluggables have an integrated DSP that performs:

  • Dispersion compensation (chromatic + polarization-mode)
  • Polarization tracking
  • Phase noise compensation
  • Soft-decision FEC (LDPC + outer staircase)

These DSPs are sophisticated ASICs (3-5 nm) consuming 10-20 W and contributing most of the pluggable's cost.

Use cases: 400ZR/800ZR replaces traditional "transponder + line card" architectures. Datacenter Interconnect (DCI) for metro DCs, cloud regions, hyperscale Edge nodes. Some vendors (Google, Microsoft, Meta) routinely use 400ZR for DCI between buildings within a metro region.

8.4 Optical Circuit Switching (OCS)

OCS = a switch that routes light entirely in the optical domain (no electrical conversion). Slow to reconfigure (1-100 ms) but infinite bandwidth through any single circuit while connected, and very low power per bit.

Google Apollo / Apollo 2 / Sirius / Lightning:

GenerationYearSwitching techLatencyUse case
Apollo (Liu et al., SIGCOMM 2021)2018-2021MEMS mirrors~10 ms reconfigSpine layer of Jupiter datacenter network
Apollo 22022MEMS, larger radix~10 msJupiter Rising (Poutievski et al., SIGCOMM 2022)
Sirius (Ballani et al., SIGCOMM 2020)researchTunable laser + AWG (passive optical)sub-µsMicrosoft / academic prototype
Lightning / Lightning-22024 announcementsOCS for AI trainingµs-scaleSpecialized for AI workloads

Why OCS in datacenters? Bursty traffic patterns: 80% of bytes flow between 20% of node pairs. If you can dynamically configure circuit-switched links to the hot pairs, you save 4-8× in spine bandwidth vs always-on packet-switched bisection. Google's Jupiter uses OCS at the spine to dynamically reroute capacity to where it's needed.

Google Apollo MEMS: Tiny micromirror arrays (~256-radix per OCS) steer light from input port to output port. Reconfiguration takes ~10 ms (mirror settling time). Apollo-class OCS chassis are deployed in tens-of-petabit-per-second Google networks.

Lightning is a newer OCS class targeting AI training topology reshaping — letting the same physical cabling host both a fat-tree (for one training job) and a dragonfly (for another) by reconfiguring OCS.

8.5 Optical Chiplets / Photonic Fabrics

The cutting edge: instead of co-packaging optics with a single switch ASIC, the goal is photonic chiplets that any chiplet vendor can drop in.

Lightmatter Passage:

  • An "active photonic interposer" — silicon substrate with both electrical chiplets (compute) and integrated photonic transceivers + waveguides on top.
  • Lets you place 8-16 compute chiplets on a single interposer with photonic interconnect between them (sub-pJ/bit energy, multi-Tbps per chiplet pair).
  • Targeted at AI training where chip-to-chip bandwidth bottlenecks scaling. Production samples mid-2024.

Ayar Labs TeraPHY:

  • CMOS chiplet that does WDM laser + modulator + detector at 8 × 256 Gbps = 2 Tbps per chiplet
  • UCIe-compatible electrical interface to the host die
  • Demoed in Intel Sapphire Rapids systems and Cornelis Networks switches.

Celestial AI Photonic Fabric:

  • Hierarchical photonic switch with explicit "Photonic Fabric" abstraction layer.
  • Targets AI training/inference at hyperscale; partnership with AMD, Samsung announced 2024-2025.

These are all in early production. By 2027-2028, expect photonic chiplets to be a normal part of high-end AI accelerator packaging — much as HBM became normal in 2020.


9. RDMA Semantics Deep Dive

RDMA verbs are the lingua franca of high-performance networking. Understanding the precise semantics is critical for correctness and performance.

9.1 Two-Sided: SEND / RECV

Like BSD sockets — both ends post WRs.

// Sender:
struct ibv_send_wr swr = { .opcode = IBV_WR_SEND, /* ... */ };
ibv_post_send(qp, &swr, NULL);

// Receiver — MUST have a RECV posted in advance:
struct ibv_recv_wr rwr = { .wr_id = 1, .sg_list = &sge, .num_sge = 1 };
ibv_post_recv(qp, &rwr, NULL);

If no RECV is posted when SEND arrives → RNR NAK (Receiver Not Ready), sender retries with backoff. Tune min_rnr_timer and retry_cnt in QP attrs.

Use for: message passing, control plane, RPC. Costs both ends a verb posting per message.

9.2 One-Sided: WRITE

Sender pushes data into receiver's pre-registered memory without involving the receiver's CPU.

struct ibv_send_wr swr = {
    .opcode = IBV_WR_RDMA_WRITE,
    .wr.rdma.remote_addr = remote_buf_addr,
    .wr.rdma.rkey = remote_mr->rkey,
    .sg_list = &sge,
    .num_sge = 1,
    .send_flags = IBV_SEND_SIGNALED,
};
ibv_post_send(qp, &swr, NULL);

The receiver's CPU is unaware until it polls memory or receives an out-of-band signal. Use cases:

  • Distributed shared memory
  • KV-cache transfer in disaggregated inference
  • Log streaming (receiver polls a tail counter)
  • Cache eviction propagation

WRITE_WITH_IMM is a variant that also includes a 32-bit "immediate" delivered in a CQE on the receiver side — combining one-sided data placement with a notification.

9.3 One-Sided: READ

Sender pulls data from a remote registered memory region.

struct ibv_send_wr swr = { .opcode = IBV_WR_RDMA_READ, /* ... */ };
ibv_post_send(qp, &swr, NULL);

Read latency is 2x WRITE (request flight + response flight), so for the same payload sizes WRITE is faster. But READ is sometimes the right semantic — "what is the current value at address X?"

9.4 ATOMIC: FetchAdd / CmpSwap

8-byte atomic operations on remote memory:

// FetchAdd: atomically fetch *remote and add value
struct ibv_send_wr swr = {
    .opcode = IBV_WR_ATOMIC_FETCH_AND_ADD,
    .wr.atomic.remote_addr = addr,
    .wr.atomic.rkey = rkey,
    .wr.atomic.compare_add = value_to_add,
    // ...
};

// CmpSwap: atomically compare and swap *remote
swr.opcode = IBV_WR_ATOMIC_CMP_AND_SWP;
swr.wr.atomic.compare_add = expected;
swr.wr.atomic.swap = new_value;

On most NICs, atomics are slow. They take a separate path inside the NIC (vs the bulk WRITE/READ engine) and may serialize across the link. Throughput: ~1-10 M ops/s vs ~100 M ops/s for WRITEs. Used in lock-free distributed shared memory (FaRM, RAMCloud, LITE) carefully.

ConnectX-6/7 supports enhanced atomics with better throughput, but they're still not free.

9.5 Signaled vs Unsignaled Completions

When you post a SEND with IBV_SEND_SIGNALED, the NIC generates a CQE when the WR completes. The CQE consumes a CQ slot and a poll cycle.

For batching, you can post many WRs unsignaled (send_flags = 0) and only signal every Nth. The signaled CQE is a "synchronization point" — it confirms all prior unsignaled WRs also completed. Reduces CQ pressure by N×.

Modern apps signal every 16-64 WRs.

9.6 CQ Moderation

To reduce interrupt rate (or polling overhead), the NIC can batch CQEs:

  • CQ events (interrupt-driven): NIC generates an interrupt only after N CQEs or T µs.
  • CQ polling: App busy-polls; CQ moderation determines how often new CQEs are visible.

Tune via ibv_modify_cq() with cq_count (CQEs per moderation) and cq_period (max µs between).

9.7 Memory Region Cost

Each ibv_reg_mr():

  1. Pins all pages of the region (get_user_pages())
  2. Builds NIC translation tables (Memory Translation Table, MTT) — one entry per 4 KB page
  3. Programs the IOMMU (if active)
  4. Returns lkey + rkey (32-bit each)

For a 1 GB MR with 4 KB pages: 256K MTT entries; takes 10-50 ms to register. At scale, never register in the hot path — pre-register all working memory at init.

ODP (On-Demand Paging): Replaces pinning with page faults. Configure with IBV_ACCESS_ON_DEMAND. NIC issues PCIe ATS request to get translation; on TLB miss, the IOMMU walks the page table; on actual page fault, the NIC issues PRI (Page Request Interface) to the OS, which faults the page in. Fault latency: ~5-10 µs. Use for sparse access on large MRs (e.g., 1 TB sparse data); avoid for dense streaming.

9.8 RNR NAK and Retry Tuning

When the receiver doesn't have a posted RECV (or its RX buffer is exhausted), it sends an RNR NAK. The sender then waits min_rnr_timer (default 0 = 655 ms!) and retries. Default 7 retries.

Misconfigured RNR causes 4-second connection stalls. Always set min_rnr_timer = 12 (640 µs) or so, rnr_retry = 7 or IBV_QP_INFINITE_RNR_RETRY (poll forever).

Network-loss retries are governed by retry_cnt (default 7) and timeout (default 14 = ~67 ms). Tune lower (8 ms) for low-latency apps.

9.9 DCT (Dynamic Connected Transport) for Scale

Each RC QP holds full state (~ several KB of NIC SRAM per QP). With N peers, you need N-1 QPs per process. At 10,000 processes, that's 10⁸ QPs total in the cluster — blowing NIC QP-context memory and consuming hundreds of MB of host pages.

DCT keeps a small pool of "DC initiator" and "DC target" QPs on each NIC. When you want to send to a new peer, you don't allocate a new QP — you reuse an existing initiator QP, providing the target's DCT key + GID in the WR. The NIC dynamically re-targets the QP.

Trade-off: DCT has slightly higher per-message latency (additional state setup on first message to a new peer), but at scale it's the only way. UCX with the dc transport uses this by default.


10. Lossless Fabric Tuning

PFC + ECN tuning is dark magic. The fundamentals:

10.1 PFC Headroom

When a switch receives a PFC PAUSE from downstream, it must buffer all packets in flight on the wire + already-decoded-in-NIC packets, until the PAUSE clears. The minimum headroom buffer is:

   PFC_headroom = (max_packet_size) + (cable_RTT × link_rate / 8)
  • For a 100m DAC link at 100 Gbps: RTT ≈ 1 µs → headroom ≈ 12.5 KB + 1500B MTU = ~14 KB.
  • For a 200m fiber at 400 Gbps: RTT ≈ 2 µs → ~100 KB.

Per-port, per-priority. Hash this across 8 priorities × 64 ports × 400 Gbps: a modern switch needs ~50-200 MB of buffer just for PFC headroom.

10.2 ECN Watermarks (Kmin, Kmax)

Switch marks ECN-CE on packets when queue length > Kmin (probability ramps from 0 at Kmin to 100% at Kmax). DCQCN at the endpoint then throttles based on ECN.

Rule of thumb:

  • Kmin = 10-15% of buffer
  • Kmax = 50-80% of buffer

The relationship (Kmax - Kmin) / link_rate defines the ECN sensitivity. Smaller window → faster reaction, more transient throughput loss. Larger → slower reaction, queueing latency rises.

Microsoft DCQCN paper documents typical Azure settings: Kmin = 5 MB, Kmax = 100 MB on 100 Gbps Mellanox switches.

10.3 Buffer Architecture

  • Cut-through: Switch starts forwarding a packet as soon as the header is parsed (typically 96-128 bytes in). Lower latency (~few hundred ns). Used by IB, modern Ethernet HPC switches.

  • Store-and-forward: Switch buffers the entire packet, validates FCS, then forwards. Higher latency (depends on MTU). Used historically by some Ethernet for FCS check; modern switches usually do cut-through.

  • Shared buffer: All ports / priorities share one big buffer pool, partitioned dynamically.

  • Dedicated buffer: Each port / priority has a fixed slice.

Most modern switches (Broadcom Trident/Tomahawk, Cisco Silicon One, Mellanox Spectrum) use shared buffer with dynamic allocation; works well for bursty AI workloads.

10.4 PFC Pause Storms and Deadlock

Pause storms: Receiver pauses sender → sender's switch buffer fills → it pauses its own ingress → propagates back upstream. Single congested receiver can stall an entire pod.

Deadlock: Cyclic dependency where switch A pauses B, which is waiting on C, which is waiting on A. Real example: a fault on one server creates an unending PAUSE on its ingress; that PAUSE propagates back; the chain becomes a deadlocked cycle.

Mitigations:

  • PFC watchdog: Detect a port that's been PAUSEd for too long (>200 ms typically), drop packets on that priority/port until it clears.
  • Reduce reliance on PFC: Use IRN-style selective retransmit, or run the lossy fabric design (UEC).

The Guo et al. SIGCOMM 2016 "RDMA over Commodity Ethernet at Scale" paper from Microsoft is the canonical reference on PFC pain.

10.5 Victim Flows

When PFC fires for one congested flow, it pauses the entire priority — all flows in that priority class stop. Innocent flows get caught in the pause; they're "victim flows." Mitigations:

  • Run latency-sensitive flows in a separate, less-congested priority
  • Use multi-queue + per-flow scheduling (BlueField DPU)
  • Move to lossy + IRN / UEC

11. Tail Latency Pathology

11.1 Incast — The Synchronized Many-to-One Pattern

Classic pattern: N senders simultaneously respond to one receiver's request (typical of MapReduce, distributed indexes, AllReduce). N senders × M bytes each → MN bytes arrive at the receiver's switch port near-instantaneously. The receiver's egress port buffer overflows. PFC fires; or worse, packets are dropped and we go to slow TCP timeouts (Linux RTO_min = 200 ms).

Solutions:

  • DCTCP/DCQCN: ECN signaling spreads the burst over time.
  • Smaller request granularity: Send only 64 KB chunks, not 1 MB.
  • Application-level rate limiting: HDFS uses staggered requests.
  • Aggregator pattern: Tree-reduce instead of flat-reduce.

11.2 Microbursts

Bursts of 10s-100s of packets arriving in < 1 ms — too short for ECN to react before buffer overflow. Caused by NIC GSO (Generic Segmentation Offload) batching, MPI message bursts, etc.

Solutions: Deep switch buffers, faster ECN reaction (lower Kmin), pacing at the sender NIC (Linux TSO + pacing, or NIC-level pacing in ConnectX-6+).

11.3 ECMP Hash Collisions

Traditional 5-tuple ECMP picks an output path by hashing (src_IP, dst_IP, src_port, dst_port, proto). If many flows happen to hash to the same uplink, you get load imbalance — half your spine links idle while others congest.

Solutions:

  • Adaptive routing (Slingshot, IB Quantum-2): switch picks output based on actual link utilization
  • Per-packet ECMP / packet spraying (UEC): every packet of a flow can take a different path; reorder at the endpoint
  • WCMP (weighted ECMP) where multiple flows are explicitly distributed
  • Symmetric hashing + multi-channel transport (UCX, NCCL channels): give each flow N "sub-flows" with different ports, spreading more uniformly

11.4 Head-of-Line Blocking

A single slow flow blocks others behind it in the same virtual channel / queue. IB uses service levels (SL) + virtual lanes (VL) to separate flows — each VL has its own buffer and PAUSE state. Up to 16 VLs per port. SL-to-VL mapping is configurable.

11.5 Pause Propagation

Already discussed in §10.4. Practical advice: monitor mlx5_xdp_redirect_drop, tx_pause, rx_pause counters via ethtool -S. Pause time > a few hundred ms per port = serious issue.

11.6 Solutions Summary

PathologyAdaptive routingPacket sprayingReceiver-driven CCDCQCN tuning
Incastpartialyesbestpartial
Microburstsyesyespartialpartial (slow)
ECMP collisionsbestbestn/an/a
HoL blockingpartialyespartialn/a
Pause stormsn/an/a (no PFC)bestn/a

12. Topology-Aware Collective Scheduling

12.1 Rail-Optimized AllReduce

In a rail-optimized network (NIC i on every node connects to leaf-i), an AllReduce across N nodes uses the same rank-i NIC on every node for all communication. The traffic stays entirely within "rail i" — never crosses rails. Benefits:

  • No cross-rail congestion
  • ECMP hash collisions impossible (one path per rail)
  • Failures in rail j don't impact rail i

NCCL detects rail topology via PCIe + IP/IB device info; set NCCL_IB_HCA=mlx5_0,mlx5_1,... per rail.

12.2 NCCL Channels and Hierarchical Reductions

For an AllReduce across 1024 nodes × 8 GPUs/node:

  1. Intra-node reduce: 8 GPUs per node reduce locally over NVLink (fast)
  2. Inter-node ring/tree: 1024 logical reducers across nodes via IB
  3. Intra-node broadcast: Result distributed back to 8 GPUs via NVLink

This 2-level hierarchy uses NVLink's massive BW for the 8-way reduction (cheap) and saves IB bandwidth for the 1024-way step (expensive). NCCL does this automatically.

12.3 SHARP / NVLS — In-Network Reduction

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is Mellanox/NVIDIA's switch-side reduction for InfiniBand. The IB switch contains reduction ALUs that combine packets from N children into one output (sum, max, min, AND, OR, etc.). Used during AllReduce, gradient aggregation.

Bandwidth model for ring AllReduce: 2(N-1)/N × M/B. With SHARP/NVLS in-network reduction: M/B + α log N. For N=72 (NVL72), M=64 GB, B=900 GB/s (per-rank NVLink BW):

  • Ring: 2 × 71/72 × 64 GB / 900 GB/s ≈ 140 ms
  • NVLS: 64 GB / 900 GB/s + α log 72 ≈ 71 ms + tiny α

Halves AllReduce time. SHARP v3 and NVLS support multiple data types (FP16, BF16, FP32) and multiple ops; collectives are pre-compiled into switch flow tables.


13. Cache Coherence on Fabric

13.1 MESI / MOESI / MESIF Refresher

(Covered in superscalar_ooo_cpu.md §10.) Briefly:

  • MESI (Modified, Exclusive, Shared, Invalid): Default x86 / Arm coherence.
  • MOESI (adds Owned): AMD. The "Owned" state lets a dirty line be shared without writeback to memory — Owner is responsible for supplying data on miss.
  • MESIF (adds Forward): Intel. Designates a single "Forward" copy that supplies on miss; eliminates redundant supplies.

13.2 Snoop Filter vs Directory

  • Snoop filter (small directory): A small inclusive cache at the LLC remembers which lines might be present in any L1/L2 in the system. Eliminates broadcast snoops to coherent agents that obviously don't have the line. Used on Intel/AMD CCDs.
  • Directory (full): Per cache line, store a bitmap of which agents have a copy. Scales to thousands of agents (used in IBM POWER, large NUMA, CXL fabric mode). Memory overhead: O(N_agents bits per line).

In CXL 3.0 fabric mode, the directory is distributed — each home agent (CXL switch tier) tracks the lines it owns. Snooping is targeted, not broadcast.

13.3 CXL.cache HDM-DB (Back-Invalidation)

In CXL 2.0 / HDM-H (host-managed coherence), the device's caches are tracked by the host. The host must allocate directory space proportional to total device cache.

In CXL 3.0 HDM-DB, the device tracks its own caches. When the device wants to write a line, it issues a Back-Invalidation (BI) to evict any cached copies on the host:

Host CPU                       CXL Type-2 Device
   │                                  │
   │  LD address X (.mem read) ──────►│  (line cached on host LLC)
   │                                  │
   │                            (device wants to write X)
   │ ◄────────────── BI: invalidate X │
   │ (drops host cached copy)         │
   │ ───────────── BI Ack ───────────►│
   │                                  │
   │                            (device writes X)
   │                                  │
   │  LD address X (.mem read) ──────►│  (returns new value)

This pattern works for any Type-2 device or fabric-attached coherent memory pool. Critical for CXL 3.0 disaggregated coherent memory pools (GFAM): the memory device tracks all host caches; hosts only react to BI messages.


14. Bandwidth Math, Bisection BW, Oversubscription

14.1 Bisection Bandwidth — Definition

Bisection BW of a network = minimum bandwidth across any cut that divides the network into two halves of equal size. It's the worst-case bandwidth for "half the nodes talk to the other half" patterns (which is what AllToAll, ring AllReduce on a partition, etc. require).

For a non-blocking Clos network with N leaves, S spines, and k leaf uplinks: bisection BW = N × k × link_rate / 2 (half of all uplinks cross any cut). A "full bisection" fat tree has total uplink BW = downlink BW at every tier.

14.2 Oversubscription Ratios

Most production datacenter networks are oversubscribed: leaf has fewer uplink BW than downlink BW. Common ratios:

RatioUse case
1:1 (full bisection)AI training, HPC
2:1High-end DC, latency-sensitive
3:1Cost-balanced DC (typical)
4:1 - 8:1Cost-optimized, web-tier traffic

A 3:1 oversubscription means cross-rack traffic gets 1/3 the bandwidth of intra-rack. AllReduce hitting that boundary suffers; rack-locality of training jobs is critical.

14.3 AllReduce Time Models

For a ring AllReduce on N nodes with message size M and per-node BW B:

   T_ring = 2(N-1)/N × M/B + (2N-2) × α

where α is per-hop latency. For large M, dominated by the bandwidth term ≈ 2 × M/B.

For a tree AllReduce (log N depth):

   T_tree = 2 log_2(N) × (M/B + α)

Dominated by latency for large N; better for small messages.

For SHARP / NVLS in-network reduction:

   T_sharp = M/B + α × log_2(N)

Approximately half the ring time at large M. Latency tier dominates for small M.

14.4 Clos Network Formula

For a 3-tier Clos with k-port switches at every tier:

  • Tier-1 (leaf): k servers down, k uplinks up
  • Tier-2 (spine): k leaves down, k cores up
  • Tier-3 (core): k spines down
  • Total servers: k³ / 4 ... but design typically uses k/2 servers per leaf

A k=64 Clos supports k³/4 = 65536 servers at full bisection. Beyond this, you go to 5-tier (super-spine), or you go to dragonfly.


15. Power and Cost at Scale

Per-Gbps power has dropped 100× over 25 years but is now plateauing. Approximate W/Gbps for switching:

YearSpeedW/Gbps (typical)
201010 GbE~10 W
201525 GbE~3 W
2020100 GbE~1 W
2024400 GbE~0.3-0.5 W
2025800 GbE~0.2-0.3 W
2026 (projected)1.6 TbE~0.15-0.2 W (electrical), ~0.08-0.12 W (CPO)

The electrical-to-optical crossover: Around 1.6 Tbps per port, the SerDes power required to drive electrical signals from switch ASIC across PCB to pluggable optics (a few dozen cm of board) becomes comparable to the optical engine power itself. Beyond that, CPO is cheaper and lower-power. This is why hyperscalers are aggressively pursuing CPO for 1.6T+.

15.2 Cost — $/Port and Cable Types

Approximate 2025 list pricing (often discounted 50%+ in volume):

Optic / CableReachList $/port
100G DAC (copper)< 3 m$80-150
100G AOC3-30 m$300-500
100G SR4 (multi-mode)100 m$400-700
100G LR4 (single-mode)10 km$1000-2000
400G DAC< 2 m$300-500
400G AOC5-30 m$800-1500
400G DR4 (single-mode)500 m$1500-3000
400G FR4 / LR42-10 km$3000-7000
400ZR80-120 km$5000-10000
800G AOC5-30 m$1500-3000
800G DR8500 m$3000-6000
800G ZR80-120 km$10000-20000
1.6T DR8500 m$5000-12000 (early 2026 pricing)

DAC = Direct Attach Cable (copper, < 3 m, cheapest, lowest latency). AOC = Active Optical Cable (optics permanently embedded in cable, plug-and-play, but fixed length). SR / DR / FR / LR / ZR = single-mode / multi-mode reach grades. SR = short reach (multi-mode, 100 m). DR = data center reach (single-mode, 500 m). FR = far reach (2 km). LR = long reach (10 km). ZR = ZR plug. ER = extended (40 km).

For a 1024-GPU cluster, optical interconnect can easily account for 15-30% of total system cost. CPO is projected to cut this in half by 2027.

15.3 CPO Necessity for >1.6T

At 1.6 TbE per port, PCB trace loss + connector reflections at 200 Gbaud PAM4 become severe enough that signal integrity requires either:

  1. Very short PCB traces (< 10 cm)
  2. Retimers (which consume power and add latency)
  3. Co-packaged optics (eliminating PCB entirely beyond the package)

Hyperscalers (Google, Microsoft, Meta) have all committed to CPO for >800G/port deployments by 2027.


16. Security

16.1 MACsec (802.1AE)

MACsec is L2 hop-by-hop encryption — every Ethernet frame is encrypted on the wire and decrypted at the next hop. Uses AES-128-GCM or AES-256-GCM. Now standard at line rate on most enterprise/DC NICs and switches.

  • Negotiated via MKA (MACsec Key Agreement, 802.1X) or static keys
  • Latency: < 100 ns added per hop
  • Throughput: line-rate on modern NICs (BlueField, ConnectX-7+)

Use cases: DCI links (inter-rack, inter-DC), regulated workloads, zero-trust networks. Microsoft requires MACsec on Azure DCI for FedRAMP.

16.2 IPsec

L3 encryption with ESP (Encapsulating Security Payload). Modern NICs (BlueField-3, AWS Nitro) offload IPsec in hardware at line rate. Used for cross-region VPCs, hybrid cloud, WAN.

16.3 InfiniBand P_Keys, Q_Keys

  • P_Key (16-bit Partition Key): IB equivalent of VLAN. Switch checks P_Key on every packet; mismatched packets dropped. Configured by OpenSM.
  • Q_Key (32-bit Queue Key): Per-QP access token for UD QPs. Sender includes Q_Key in WR; receiver QP verifies match. Used to gate UD multicast.

P_Keys provide coarse multi-tenant isolation but not cryptographic security (no encryption).

16.4 CXL IDE and TDISP

CXL IDE (Integrity and Data Encryption): Per-FLIT AES-GCM encryption on CXL links. Selectable per virtual channel. Adds ~3 ns latency. Mandatory for confidential CXL deployments.

TDISP (TEE Device Interface Security Protocol): PCIe spec adopted by CXL. Lets a confidential VM (Intel TDX, AMD SEV-SNP, ARM CCA Realm) cryptographically verify that a CXL device is:

  1. Genuine (DICE-attested)
  2. In a "trusted" state (firmware verified, no debug mode)
  3. Owned exclusively by this TEE (not shared)

After TDISP attestation, the device's MMIO and DMA regions are protected by the IOMMU + memory encryption — invisible to host kernel. Required for cloud-confidential AI workloads (host operator cannot snoop tenant GPU/CXL traffic).


17. Mental Models — Decision Framework

17.1 Workload-by-Scale Decision Table

WorkloadScaleBest FabricWhy
AI training (8 GPUs)< 1 serverNVLink (intra-server)NVSwitch BW
AI training (32-72 GPUs)< 1 rackNVLink + NVSwitch (NVL72)Single coherent domain
AI training (100-10k GPUs)< 1 clusterInfiniBand NDR/XDR or RoCE+UECBandwidth + low tail
LLM inference (single node)1 serverNVLink + GDR over PCIeKV cache locality
Disaggregated inference10-100s nodesRoCE + NIXLKV cache transfer
HPC (CFD, weather, MD)1k-100k nodesSlingshot / IB / TofuDLow-latency, dragonfly/torus
OLTP database10-100 nodesRoCE or TCP/IPStandard DC fabric is fine
OLAP / lakehouse100-10k nodesRoCE or TCP/IP + NVMe-oFDisk-IO-bound; RDMA storage
Memory pool (CXL)1 rackCXL 3.x fabricCoherent shared memory
Storage (NVMe-oF)10-1000 serversRoCE or TCP/IPMature NVMe-oF
Distributed KV (FoundationDB, Aurora)100-1000 nodesRoCE or TCP/IPLSN-ordered, latency-tolerant
Web tier100-100k serversStandard Ethernet + TCP/QUICMature, cheapest

17.2 Latency / BW / Cost Tradeoff Matrix

FabricLatencyBWCost ($/port)Lossless?Vendor lock
PCIe 5/6100-200 ns64-128 GB/sincluded on motherboardn/anone
NVLink5/NVSwitch100-500 ns1.8 TB/sembedded in GPUyesNVIDIA
UCIe (chiplet)5-10 ns1-4 TB/sbump areayesopen (consortium)
CXL 3.x100-300 ns64-128 GB/s$500-2000 (cable)yesopen (consortium)
InfiniBand NDR1-3 µs400 Gb/s$2000-5000yesNVIDIA (Mellanox)
RoCEv22-5 µs100-400 Gb/s$1000-3000yes (with PFC)open
UEC (Ultra Ethernet)2-5 µs100-1600 Gb/s$1000-3000no (lossy, OK)open (consortium)
Slingshot 122-5 µs400 Gb/s$5000-10000partialHPE
Standard Ethernet + TCP10-30 µs25-800 Gb/s$200-2000noopen

18. Practical Skills — Commands and Benchmarks

18.1 Topology Discovery

# PCIe tree
lspci -tvvv                 # Tree view of PCI bus
lspci -vv                   # Verbose per-device (BARs, capabilities, AER)
lspci -nn | grep -i mell    # Find Mellanox / NVIDIA NICs

# Hardware topology (NUMA + PCIe + cores)
lstopo                      # Graphical (PDF/PNG output)
lstopo --of console         # Text
hwloc-ls                    # Same as lstopo --of console
hwloc-distrib 8             # Suggest CPU set for 8-way parallelism

# NUMA placement
cat /sys/bus/pci/devices/0000:01:00.0/numa_node     # NIC's NUMA node
numactl -H                                           # NUMA topology

18.2 InfiniBand Inspection

ibstat                      # Per-HCA status (LID, state, port speeds)
ibv_devinfo -v              # Verbose verbs device info
ibportstate 1 1             # Port state for HCA 1, port 1
iblinkinfo                  # All links + remote endpoint
ibhosts                     # Discover all HCAs
ibroute                     # Per-switch routing table
saquery -t Node             # Subnet Admin query: list all nodes

18.3 RDMA Benchmarks (perftest)

# Server
ib_send_bw -d mlx5_0                    # Send bandwidth
ib_send_lat -d mlx5_0                   # Send latency
ib_write_bw -d mlx5_0 --report_gbits    # RDMA write bandwidth, Gbit/s
ib_write_lat -d mlx5_0                  # RDMA write latency
ib_read_lat -d mlx5_0                   # RDMA read latency
ib_atomic_lat -d mlx5_0                 # Atomic op latency

# Client (other side)
ib_write_bw -d mlx5_0 -q 4 -x 3 server_ip --report_gbits
#   -q 4 : 4 QPs (parallel)
#   -x 3 : GID index (RoCEv2)
#   -F   : skip CPU frequency check (recommended on shared nodes)

Expected on a tuned ConnectX-7 NDR (400 Gb/s):

  • ib_send_lat: ~1.0-1.2 µs
  • ib_write_bw: ~390-395 Gb/s
  • ib_read_lat: ~1.5-2 µs (round-trip cost)

18.4 NIC Tuning

# Driver info
ethtool -i eth0                         # Driver, version, firmware

# Ring buffer sizes
ethtool -g eth0                         # Current + max ring sizes
ethtool -G eth0 rx 8192 tx 8192         # Set ring sizes

# Queue count
ethtool -l eth0                         # Current + max queues
ethtool -L eth0 combined 32             # Set 32 combined queues

# Coalesce (interrupt moderation)
ethtool -c eth0                         # Current
ethtool -C eth0 rx-usecs 16 tx-usecs 16 # Per 16 µs or N pkts

# Offloads
ethtool -k eth0                         # Current offloads
ethtool -K eth0 tx-checksumming on rx-checksumming on tso on lro on

# Mellanox-specific: DCB / RoCE
mlnx_qos -i eth0                        # Show DCB config
mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0  # PFC on priority 3 only
mlnx_qos -i eth0 --trust dscp           # Use DSCP for priority (vs PCP)

# Mellanox firmware tools
mst start                               # Bring up mstflint device tree
mlxconfig -d /dev/mst/mt4119_pciconf0 q # Query firmware config
mlxlink -d mlx5_0 -p 1                  # Port-level link info: speed, FEC, errors
mlxlink -d mlx5_0 -p 1 --rx_fec_active  # RX FEC mode

18.5 NCCL Diagnostics

# Run NCCL-tests
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda
mpirun -np 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

# Debug logging
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL,P2P,NET mpirun ...

# Topology dump
NCCL_TOPO_DUMP_FILE=/tmp/topo.xml mpirun ...

# Force specific transport
NCCL_NET_GDR_LEVEL=PHB mpirun ...       # GPU-Direct RDMA only same PHB
NCCL_IB_HCA=mlx5_0,mlx5_1 mpirun ...    # Use only these HCAs (rail-aware)
NCCL_IB_GID_INDEX=3 mpirun ...          # RoCEv2 GID
NCCL_P2P_DISABLE=1 mpirun ...           # Disable NVLink (debug)
NCCL_ALGO=Tree mpirun ...               # Force tree AllReduce
NCCL_COLLNET_ENABLE=1 mpirun ...        # Enable SHARP

18.6 CXL

cxl list                                # List CXL devices
cxl list -v                             # Verbose: capacity, partitions
cxl create-region -d decoder0.0 -m mem0 # Create memory region
cxl reconfigure-system                  # After topology change
daxctl list                             # DAX devices (CXL memory exposed as devdax)
daxctl reconfigure-device dax0.0 -m system-ram  # Make CXL mem into NUMA node
ndctl list                              # PMEM/NVDIMM (parallel structure)

18.7 PCIe Performance Counters

# Uncore PMU events (Intel; Sapphire Rapids+ has iio_*)
perf stat -e uncore_iio_0/event=0x83,umask=0x04/ ...   # IIO inbound bytes
perf stat -e uncore_iio_*/event=0x83/ ...              # All IIO devices

# Intel PCM (Performance Counter Monitor)
pcm                                      # Live CPU/memory/PCIe view
pcm-pcie                                 # Per-device PCIe BW

# AER errors
lspci -vv | grep -i -A 5 "Advanced Error"

# PCIe link speed/width (current vs max)
lspci -vv -s 0000:01:00.0 | grep -i "lnksta\|lnkcap"

18.8 Standard Benchmarks

BenchmarkWhat it measuresCommand
OSUMPI point-to-point + collectivesosu_latency, osu_bw, osu_allreduce
NCCL-testsNCCL GPU collectivesall_reduce_perf, all_gather_perf
iperf3TCP/UDP bandwidthiperf3 -s / iperf3 -c server -P 16
netperfLatency + throughputnetperf -H server -t TCP_RR
fioStorage + NVMe-oF IOPSfio --rw=randread --bs=4k --iodepth=64 ...
MLPerf Training/InferenceEnd-to-end AI workload (NCCL component)as per MLPerf rules
HPCGHPC sparsexhpcg
HPL (LINPACK)HPC dense matrixxhpl

19. Further Reading

19.1 Datacenter Networking and RDMA

Citations grouped by topic. Conference codes: SIGCOMM = ACM SIGCOMM, NSDI = USENIX Networked Systems Design and Implementation, SOSP = ACM Symposium on Operating Systems Principles, OSDI = USENIX Operating Systems Design and Implementation.

  • Alizadeh, Greenberg, Maltz, Padhye, Patel, Prabhakar, Sengupta, Sridharan. "Data Center TCP (DCTCP)." SIGCOMM 2010.
  • Zhu, Eran, Firestone, Guo, Lipshteyn, Liron, Padhye, Raindel, Yahia, Zhang. "Congestion Control for Large-Scale RDMA Deployments" (DCQCN). SIGCOMM 2015.
  • Mittal, Lam, Dukkipati, Blem, Wassel, Ghobadi, Vahdat, Wang, Wetherall, Zats. "TIMELY: RTT-based Congestion Control for the Datacenter." SIGCOMM 2015.
  • Li, Miao, Liu, Zhou, Sridharan, Kumar, Bao, Zhou, Yang, Tewari. "HPCC: High Precision Congestion Control." SIGCOMM 2019.
  • Kumar, Dukkipati, Jouppi, Lam, Madhavan, Mittal, Mittal, Wassel, Wetherall, Wu, Yang, Zats. "Swift: Delay is Simple and Effective for Congestion Control in the Datacenter." SIGCOMM 2020.
  • Addanki, Apostolaki, Ghobadi, Schmid, Vanbever. "PowerTCP: Pushing the Performance Limits of Datacenter Networks." NSDI 2022.
  • Olteanu, Agache, Voinescu, Raiciu. "An Edge-Queued Datagram Service for All Datacenter Traffic" (EQDS). NSDI 2022.
  • Mittal, Shpiner, Panda, Zahavi, Krishnamurthy, Ratnasamy, Shenker. "Revisiting Network Support for RDMA" (IRN). SIGCOMM 2018.
  • Stephens, Akella, Swift. "Loom: Flexible and Efficient NIC Packet Scheduling" / "Annulus." SIGCOMM 2019.
  • Dragojević, Narayanan, Hodson, Castro. "FaRM: Fast Remote Memory." NSDI 2014.
  • Guo, Wu, Deng, Liu, Haridas, Liu, Xu, Yu, Xiang, Wang, Yu, Zhang, Zhang, Padhye, Lipshteyn. "RDMA over Commodity Ethernet at Scale." SIGCOMM 2016.
  • Singh, Ong, Agarwal, Anderson, Armistead, Bannon, Boving, Desai, Felderman, Germano, Kanagala, Provost, Simmons, Tanda, Wanderer, Hölzle, Stuart, Vahdat. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network." SIGCOMM 2015.
  • Poutievski, Mashayekhi, Ong, Singhvi, Tariq, Tariq, Vahdat, Wanderer. "Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking." SIGCOMM 2022.
  • Gibson, Hartl, Wlodarczyk, Vahdat, Mogul, Goldberg, Sjödin, Sosa, Yang, Singh. "Aquila: A Unified, Low-Latency Fabric for Datacenter Networks." NSDI 2022.
  • Bansal, Khan, Goyal et al. "Meta's RoCE Networks: Building, Operating, and Lessons Learned." SIGCOMM 2023.
  • Mellette, McGuinness, Roy, Forencich, Papen, Snoeren, Porter. "RotorNet: A Scalable, Low-Complexity, Optical Datacenter Network." SIGCOMM 2017.

19.2 HPC Fabrics and Optical Networks

  • De Sensi, Di Girolamo, McMahon, Roweth, Hoefler. "An In-Depth Analysis of the Slingshot Interconnect." SC 2020.
  • Ajima, Inoue, Hiramoto, Takagi, Shimizu. "The Tofu Interconnect D." 2018 (Fugaku).
  • Alverson, Roweth, Kaplan. "The Gemini System Interconnect." Hot Interconnects 2010.
  • Faanes, Bataineh, Roweth, Court, Froese, Alverson, Johnson, Kopnick, Higgins, Reinhard. "Cray Cascade: A Scalable HPC System Based on a Dragonfly Network" (Aries). SC 2012.
  • Shaw, Adams, Azaria, Bank, Batson, Bell, Bergdorf, Bhatt, Butts, Correia, Dirks, Dror, Eastwood, Edwards, Even, Feldmann, Fenn, Fenton, Forte, Gagliardo, Gill, Gorlatova, Greskamp, Grossman, Gullingsrud, Hibbard, Ho, Ierardi, Iserovich, Klepeis, Kuskin, Larson, Layman, Lee, Lerer, Li, Lindorff-Larsen, Maragakis, Mraz, Murphy, Piana, Predescu, Priest, Rendleman, Rosenberg, Salmon, Schafer, Schwink, Shan, Shrayer, Sjostedt, Smith, Spengler, Stuart, Theobald, Towles, Wang, Young. "Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer." SC 2014.
  • Liu, Theogarajan, Pinheiro, Vahdat. "Apollo: A Sequencing-Based Approach to Reconfigurable Optical Networks." SIGCOMM 2021.
  • Ballani, Costa, Behrendt, Cletheroe, Haller, Jozwik, Karinou, Lange, Shi, Thomsen, Williams. "Sirius: A Flat Datacenter Network with Nanosecond Optical Switching." SIGCOMM 2020.
  • Mellette, Das, Guo, McGuinness, Snoeren, Porter, Papen. "Expanding Across Time to Deliver Bandwidth Efficiency and Low Latency" (Opera). NSDI 2020.
  • Khani, Ghobadi, Alizadeh, Zhu, Glick, Bergman, Vahdat, Klenk, Ebrahimi. "SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training." SIGCOMM 2021.

19.3 Standards and Specifications

  • PCI-SIG. "PCI Express Base Specification Revision 7.0." 2025.
  • Compute Express Link Consortium. "CXL 3.2 Specification." Dec 2024.
  • UCIe Consortium. "UCIe 2.1 Specification." Aug 2025.
  • Ultra Ethernet Consortium. "Ultra Ethernet Specification 1.0." Jun 2025.
  • InfiniBand Trade Association. "InfiniBand Architecture Specification 1.7 (Volume 1)." 2023.
  • IEEE 802.3df-2024. "Standard for Ethernet — 200/400/800 Gb/s Operation." 2024.
  • IEEE 802.3dj (draft). "1.6 Tb/s Operation." Project, ratification 2026.
  • IEEE 802.1Qbb. "Priority-based Flow Control." 2011.
  • IEEE 802.1Qaz. "Enhanced Transmission Selection." 2011.
  • ARM. "AMBA AXI and ACE Protocol Specification." Issue G, 2021.
  • ARM. "AMBA CHI Architecture Specification." Issue F, 2023.
  • NVMe Express. "NVMe over Fabrics Specification 1.1a." 2023.

19.4 Books

  • Dally, Towles. "Principles and Practices of Interconnection Networks." Morgan Kaufmann, 2003.
  • Hennessy, Patterson. "Computer Architecture: A Quantitative Approach" (6th ed.). Morgan Kaufmann, 2017.
  • Duato, Yalamanchili, Ni. "Interconnection Networks: An Engineering Approach." Morgan Kaufmann, 2003.

19.5 Talks, Blog Posts, Vendor Materials

  • Microsoft Azure RDMA team (Bansal et al.) blog series 2023-2024 on RoCE at scale: deployment lessons.
  • NVIDIA GTC keynotes (2022-2025) for NVLink, NVSwitch, NVL72, Quantum-X Photonics architecture announcements.
  • Google Cloud research blogs on Apollo, Sirius, Jupiter, Aquila, Lightning.
  • Meta Engineering blog on AI cluster networking (Llama 2/3 training infra), RoCE deployment.
  • HPE Cray Slingshot Architecture Whitepaper, Cassini NIC datasheet.
  • OpenFabrics Alliance workshops (annual): UCX, libfabric, OFI provider updates.
  • SNIA tutorials on NVMe-oF, persistent memory, CXL.

Cross-references: pcie_internals.md, superscalar_ooo_cpu.md, gpu_tpu_accelerator_design.md, disaggregated_storage.md, vfio_internals.md, io_uring_internals.md, isa_critical_instructions.md.