PCIE Internals
PCI Express Internals: Architecture, Protocols, and Implementation
A comprehensive implementation-level reference for PCI Express — covering architecture fundamentals, packet formats with bit-level detail, configuration space registers, capability structures, interrupt mechanisms, Linux kernel driver APIs, device emulation for VMMs, SR-IOV, passthrough, error handling, power management, and advanced topics (CXL, NVMe, P2P DMA, FLIT mode). Written for engineers implementing PCIe device emulation, writing PCIe drivers, or working with PCIe passthrough in virtualization.
Table of Contents
- Evolution: ISA to PCI to PCIe
- PCIe Architecture Fundamentals
- Transaction Layer — TLPs
- Data Link Layer — DLLPs
- Physical Layer
- Configuration Space
- Base Address Registers (BARs)
- PCIe Capabilities
- Interrupt Mechanisms
- Enumeration and Resource Assignment
- PCIe in the Linux Kernel
- PCIe Device Emulation for VMMs
- PCIe Passthrough and SR-IOV
- Power Management
- Error Handling
- Advanced Topics
- Key References
1. Evolution: ISA to PCI to PCIe
The Journey
ISA (1981) PCI (1992) PCI-X (1998) PCIe 1.0 (2003)
8/16-bit 32/64-bit 64-bit Serial,
parallel bus parallel bus parallel bus point-to-point
4.77-8.33 MHz 33/66 MHz 133 MHz 2.5 GT/s/lane
~8 MB/s ~133/~533 MB/s ~1064 MB/s ~250 MB/s/lane
(x16 = 4 GB/s)
ISA (Industry Standard Architecture), 1981: Intel 8088/8086 bus. 8-bit data (later 16-bit in AT/EISA). Shared parallel bus with no auto-configuration — jumpers for IRQ and I/O port selection. No bus mastering. Maximum ~8 MB/s.
PCI (Peripheral Component Interconnect), 1992: Intel-designed replacement. 32-bit shared parallel bus at 33 MHz (132 MB/s), later 64-bit at 66 MHz (528 MB/s in PCI 2.1). Key innovation: auto-configuration via configuration space — BIOS/OS reads Vendor/Device IDs, allocates resources (BARs, IRQs) automatically. Supports bus mastering. Multiplexed address/data lines (AD[31:0] carry both address and data in different clock phases). Tree topology via PCI-to-PCI bridges.
PCI-X, 1998: Extended PCI to 133 MHz (1066 MB/s, 64-bit). Split transactions (separate request/response phases to free the bus). Still a shared parallel bus, so bandwidth shared among all devices.
PCI Express 1.0, 2003: The paradigm shift. Replaced the shared parallel bus with serial point-to-point links using differential signaling. Each direction has dedicated lanes — no arbitration, no bus contention. Packet-based protocol (transaction layer packets over a serial link). Software-compatible with PCI (same configuration space model), but electrically completely different.
Why Serial Beats Parallel
The parallel bus problem at high frequencies:
- Clock skew: All 32/64 data lines must arrive within the same clock cycle. At 133 MHz (7.5 ns period), even trace length differences of a few mm cause bit errors.
- Crosstalk: Adjacent parallel traces couple electromagnetically, worsening at higher frequencies.
- Stub loading: Each device on a shared bus adds electrical load (stub), degrading signal quality.
- Bus arbitration: Only one master at a time — wasted bandwidth when devices wait.
PCIe solves all of these:
- Serial per-lane: only one differential pair per direction, no skew across data bits.
- Point-to-point: no bus sharing, no arbitration, full bandwidth per link.
- Embedded clock: 8b/10b or 128b/130b encoding carries clock in the data stream — no separate clock signal.
- Scalable: add lanes (x1, x2, x4, x8, x16, x32) for more bandwidth.
PCIe Generations
Gen Year Data Rate Encoding BW/Lane x16 BW Signaling
─────────────────────────────────────────────────────────────────────────────────
1.0 2003 2.5 GT/s 8b/10b 250 MB/s 4 GB/s NRZ
2.0 2007 5.0 GT/s 8b/10b 500 MB/s 8 GB/s NRZ
3.0 2010 8.0 GT/s 128b/130b ~985 MB/s ~16 GB/s NRZ
4.0 2017 16.0 GT/s 128b/130b ~1969 MB/s ~32 GB/s NRZ
5.0 2019 32.0 GT/s 128b/130b ~3938 MB/s ~63 GB/s NRZ
6.0 2022 64.0 GT/s 1b/1b (FLIT) ~7877 MB/s ~126 GB/s PAM4
7.0 2025 128.0 GT/s 1b/1b (FLIT) ~15754 MB/s ~252 GB/s PAM4
8b/10b encoding (Gen 1-2): Each 8-bit byte encoded as 10-bit symbol. Guarantees DC balance and sufficient transitions for clock recovery. 20% overhead (250 MB/s from 2.5 Gbit/s raw).
128b/130b encoding (Gen 3-5): 128 data bits + 2-bit sync header. Only ~1.5% overhead. Requires scrambling (LFSR-based) for DC balance instead of encoding overhead.
PAM4 signaling (Gen 6+): 4-level Pulse Amplitude Modulation. Each symbol carries 2 bits (vs NRZ's 1 bit). Doubles data rate at same baud rate, but requires Forward Error Correction (FEC) due to reduced noise margins between the 4 voltage levels. Gen 6 uses lightweight CRC-based FEC rather than heavy Reed-Solomon.
FLIT mode (Gen 6+): Fixed-Length Integrity-protected Transport. TLPs are packed into fixed 256-byte FLITs with integrated CRC. Eliminates per-TLP LCRC and ECRC overhead. FLITs can contain multiple small TLPs or fragments of large ones. Reduces protocol overhead from ~20% to ~4%.
2. PCIe Architecture Fundamentals
Topology
PCIe uses a tree topology rooted at the Root Complex:
┌─────────────┐
│ CPU │
└──────┬──────┘
│
┌──────┴──────┐
│ Root Complex│ ← Bridges CPU to PCIe
│ (RC) │ fabric; may have
└──┬──────┬───┘ integrated endpoints
│ │
┌───────┘ └───────┐
│ │
┌─────┴─────┐ ┌─────┴─────┐
│ Root Port │ │ Root Port │ ← Each RP is a
│ (RP) │ │ (RP) │ virtual PCI-to-PCI
└─────┬─────┘ └─────┬─────┘ bridge
│ │
┌─────┴─────┐ ┌─────┴──────┐
│ Switch │ │ Endpoint │ ← NVMe SSD,
│ │ │ (EP) │ GPU, NIC, etc.
└──┬────┬───┘ └────────────┘
│ │
┌─────┘ └─────┐
┌─────┴─────┐ ┌─────┴─────┐
│ Endpoint │ │ Endpoint │
│ (EP) │ │ (EP) │
└───────────┘ └───────────┘
Root Complex (RC): Connects the processor and memory subsystem to the PCIe fabric. Generates configuration transactions. Contains one or more Root Ports. In x86 systems, typically part of the CPU die or chipset (PCH). The RC terminates the PCIe hierarchy — all TLPs addressed to system memory are consumed here.
Root Port (RP): A virtual PCI-to-PCI bridge within the RC. Each RP represents one downstream PCIe link. Has a Type 1 (bridge) configuration header with primary/secondary/subordinate bus numbers and memory/I/O windows.
Switch: A PCIe switching fabric. Contains one upstream port (connects toward RC) and multiple downstream ports (connect toward endpoints). Internally, the switch has a virtual PCI bus connecting all ports. Each port appears as a virtual PCI-to-PCI bridge. Switches perform address-based routing of TLPs between ports using the address windows configured in their bridge registers.
Endpoint (EP): A PCIe function that originates or terminates transactions. Has a Type 0 configuration header. Examples: NVMe controllers, NICs, GPUs, FPGA accelerators. Can have up to 8 functions per device (multi-function).
Bridge (to PCI/PCI-X): Converts between PCIe and legacy PCI/PCI-X protocols. Increasingly rare.
BDF Addressing
Every PCIe function is uniquely identified by a Bus:Device:Function (BDF) address:
┌─────────────────┬──────────────┬─────────────┐
│ Bus Number │ Device Number│ Function │
│ (8 bits) │ (5 bits) │ (3 bits) │
│ 0-255 │ 0-31 │ 0-7 │
└─────────────────┴──────────────┴─────────────┘
Total: 16 bits → Requester ID / Completer ID in TLPs
Example: 03:1f.2 = Bus 3, Device 31, Function 2
Requester ID = (3 << 8) | (31 << 3) | 2 = 0x03FA
PCI Segment Group (16 bits) extends this further:
Segment:Bus:Device:Function → 0000:03:1f.2
Allows up to 65536 segments × 256 buses = 16M bus segments
The BDF is assigned during enumeration. In PCIe, "Device" within a bus segment is typically 0 because each link connects exactly one device (unlike PCI where multiple devices share a bus). Multi-function devices use functions 0-7 on the same device number.
ARI (Alternative Routing-ID Interpretation): For SR-IOV and multi-function devices that need more than 8 functions. ARI removes the device number field, giving all 8 bits (device + function) to the function number, allowing up to 256 functions per bus number. Requires ARI capability in both the device and the upstream port.
Link Architecture
A PCIe link connects two ports and consists of 1 to 32 lanes:
Port A Port B
┌──────────────┐ ┌──────────────┐
│ Lane 0 TX ───┼──── D+ D- ───────►│ Lane 0 RX │
│ Lane 0 RX ◄──┼──── D+ D- ────────┤ Lane 0 TX │
│ │ │ │
│ Lane 1 TX ───┼──── D+ D- ───────►│ Lane 1 RX │
│ Lane 1 RX ◄──┼──── D+ D- ────────┤ Lane 1 TX │
│ │ │ │
│ ... │ │ ... │
│ │ │ │
│ Lane N TX ───┼──── D+ D- ───────►│ Lane N RX │
│ Lane N RX ◄──┼──── D+ D- ────────┤ Lane N TX │
└──────────────┘ └──────────────┘
Each lane = 1 differential pair TX + 1 differential pair RX
Full duplex — simultaneous transmit and receive
Common widths: x1, x2, x4, x8, x16 (x32 rare, only in some server interconnects).
Data is striped across lanes at the byte level. For a x4 link, byte 0 goes on lane 0, byte 1 on lane 1, byte 2 on lane 2, byte 3 on lane 3, byte 4 on lane 0, etc. This distributes the payload across all lanes for maximum bandwidth.
Lane reversal: During link training, the two ports may discover that lane 0 on one side connects to lane N on the other (PCB routing convenience). The hardware can reverse the lane numbering to compensate.
Polarity inversion: If D+ and D- of a differential pair are swapped, the receiver can invert the polarity. Detected and corrected during link training.
PCIe Protocol Stack
┌─────────────────────────────────────────────┐
│ Software Layer │
│ (Device driver, OS, application) │
├─────────────────────────────────────────────┤
│ Transaction Layer │
│ TLPs: MRd, MWr, CfgRd, CfgWr, │
│ Cpl, CplD, Msg │
│ Flow control, ordering, virtual channels │
├─────────────────────────────────────────────┤
│ Data Link Layer │
│ DLLPs: Ack/Nak, FC update, PM │
│ Sequence numbers, CRC, retry │
├─────────────────────────────────────────────┤
│ Physical Layer │
│ Electrical signaling, 8b/10b or │
│ 128b/130b encoding, LTSSM, │
│ lane training, equalization │
├─────────────────────────────────────────────┤
│ Electrical/Mechanical │
│ Differential pairs, connectors, │
│ impedance, voltage levels │
└─────────────────────────────────────────────┘
3. Transaction Layer — TLPs
The Transaction Layer is the heart of PCIe. All communication between devices occurs via Transaction Layer Packets (TLPs).
TLP Categories
TLPs are classified by their semantics:
Category TLP Types Requires Completion? Ordering
─────────────────────────────────────────────────────────────────────
Posted Memory Write (MWr) No Strong
Message (Msg/MsgD) No Strong
Non-Posted Memory Read (MRd) Yes (CplD) Strong
I/O Read (IORd) Yes (CplD) Strong
I/O Write (IOWr) Yes (Cpl) Strong
Config Read (CfgRd) Yes (CplD) Strong
Config Write (CfgWr) Yes (Cpl) Strong
Completion Completion (Cpl) N/A Relaxed*
Completion w/ Data N/A Relaxed*
(CplD)
Posted: Fire-and-forget. The sender does not wait for acknowledgment from the receiver. Memory writes are posted for performance — the sender can continue immediately. Ordering guarantee: posted requests are delivered in order.
Non-Posted: Request-response pair. The sender must wait for a completion. Memory reads, I/O operations, and configuration operations are all non-posted. Each non-posted request is tracked by a Tag, and the requester maintains a pending request table.
Completion: Response to a non-posted request. Carries data (CplD) for reads, or just status (Cpl) for writes.
TLP Header Format
Every TLP starts with a 3-DW (12-byte) or 4-DW (16-byte) header:
TLP Packet Structure:
┌─────────────────────────────────────────────────────────────┐
│ (Optional) TLP Prefix(es) — 4 bytes each, Gen 3+ only │
├─────────────────────────────────────────────────────────────┤
│ Header: 3 DW (12 bytes) or 4 DW (16 bytes) │
├─────────────────────────────────────────────────────────────┤
│ Data Payload: 0 to 1024 DW (0 to 4096 bytes) │
├─────────────────────────────────────────────────────────────┤
│ (Optional) ECRC: 4 bytes (end-to-end CRC) │
└─────────────────────────────────────────────────────────────┘
Note: The Data Link Layer adds a 2-byte sequence number prefix
and a 4-byte LCRC suffix (not part of the TLP itself).
Full packet on wire:
┌──────┬───────────────────────────────────┬──────┐
│ Seq# │ TLP (header + data + ECRC) │ LCRC │
│ 2B │ │ 4B │
└──────┴───────────────────────────────────┴──────┘
DW0 — Common Header (First Doubleword)
Bit 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
├──┤ ├─────┤ ├──────────────┤ ├─┤ ├──┤ ├─┤ ├─┤ ├─┤ ├────────────────────────────────────┤
Fmt Type R TC R Attr R TH TD EP Attr Length
[2] [5] [1] [3] [1] [2] [1][1] [1] [1] [1] [10]
Fmt[1:0]:
00 = 3 DW header, no data
01 = 4 DW header, no data
10 = 3 DW header, with data
11 = 4 DW header, with data
(For TLP Prefix: Fmt = 100)
Type[4:0] combined with Fmt:
Fmt Type TLP Type
──── ───── ──────────────────────────────
00 00000 Memory Read Request (MRd) — 32-bit address
01 00000 Memory Read Request (MRd) — 64-bit address
00 00001 Memory Read Lock Request (MRdLk) — 32-bit (deprecated)
01 00001 Memory Read Lock Request (MRdLk) — 64-bit (deprecated)
10 00000 Memory Write Request (MWr) — 32-bit address
11 00000 Memory Write Request (MWr) — 64-bit address
00 00010 I/O Read Request (IORd)
10 00010 I/O Write Request (IOWr)
00 00100 Config Read Type 0 (CfgRd0)
10 00100 Config Write Type 0 (CfgWr0)
00 00101 Config Read Type 1 (CfgRd1)
10 00101 Config Write Type 1 (CfgWr1)
01 10000 Message Request (Msg) — routed by address
01 10001 Message Request (Msg) — routed by ID
01 10010 Message Request (Msg) — broadcast from RC
01 10011 Message Request (Msg) — local, terminate at receiver
01 10100 Message Request (Msg) — gathered, routed to RC
01 10101 Message Request (Msg) — reserved
11 10xxx Message Request with Data (MsgD)
00 01010 Completion (Cpl) — no data
10 01010 Completion with Data (CplD)
00 01011 Completion for Locked Memory Read (CplLk)
10 01011 Completion for Locked Memory Read with Data (CplDLk)
TC[2:0]: Traffic Class (0-7). TC0 is default. Higher TCs map to
higher-priority Virtual Channels if configured.
Attr[2]: IDO (ID-based Ordering) — Gen 2+. Allows completions to
pass posted writes if they have different requester IDs.
Attr[1:0]:
Bit 1 = Relaxed Ordering (RO). If set, this TLP may pass
previously queued TLPs (relaxes strict ordering).
Bit 0 = No Snoop (NS). If set, system hardware is not required
to snoop CPU caches for this transaction.
TH: TLP Processing Hint (Gen 3+). If set, the last 4 bytes of
the TLP (before ECRC) contain Steering Tag hints for
cache allocation at the completer.
TD: TLP Digest (ECRC) present. If set, 4-byte ECRC appended.
EP: Poisoned. If set, the data payload is known to be corrupted.
Receiver should accept the TLP but flag an error.
Length[9:0]: Data payload length in DWs (doublewords = 4 bytes).
0 = reserved (except: Length=0 means 1024 DWs for some types).
Valid range: 1-1024 DWs (4-4096 bytes).
For requests without data (reads), this is the requested length.
Must not exceed Max Payload Size (MPS) for writes or
Max Read Request Size (MRRS) for reads.
Memory Request Header (3DW — 32-bit address)
DW0: [Fmt|Type|R|TC|R|Attr|R|TH|TD|EP|Attr|Length] (as above)
DW1:
Bit 31 16 15 8 7 0
┌─────────────────────────┬──────────────┬────────────────┐
│ Requester ID │ Tag │ Last BE|1st BE │
│ (Bus:Dev:Func) │ (8 bits) │ [3:0] [3:0] │
│ 16 bits │ │ │
└─────────────────────────┴──────────────┴────────────────┘
DW2:
Bit 31 2 1 0
┌────────────────────────────────────────────────┬───┐
│ Address [31:2] │ R │
│ (30 bits, DW-aligned) │ │
└────────────────────────────────────────────────┴───┘
Memory Request Header (4DW — 64-bit address)
DW0: [Fmt|Type|R|TC|R|Attr|R|TH|TD|EP|Attr|Length]
DW1:
┌─────────────────────────┬──────────────┬────────────────┐
│ Requester ID │ Tag │ Last BE|1st BE │
└─────────────────────────┴──────────────┴────────────────┘
DW2:
┌────────────────────────────────────────────────────────┐
│ Address [63:32] (upper 32 bits) │
└────────────────────────────────────────────────────────┘
DW3:
┌────────────────────────────────────────────────┬───┐
│ Address [31:2] │ R │
└────────────────────────────────────────────────┴───┘
Byte Enables: 4-bit masks specifying which bytes within the first and last DW are valid.
1st DW BE[3:0]: Byte enables for the first DW of data. Bit 0 = byte at lowest address.Last DW BE[3:0]: Byte enables for the last DW. All zeros if Length == 1 DW.
Example: Writing bytes at offset 0x1001-0x1002 (2 bytes starting at offset 1 within a DW):
- Address = 0x1000 (DW-aligned), Length = 1
- 1st DW BE = 0b0110 (bytes 1 and 2 valid, bytes 0 and 3 not)
- Last DW BE = 0b0000 (unused, Length=1)
Tag field: Uniquely identifies outstanding non-posted requests from a given requester. Standard: 8-bit tag (256 outstanding requests). Extended: 10-bit (1024 outstanding), requires Extended Tag capability. Phantom Functions can further expand the tag space by using unused function bits.
Requester ID: The BDF of the function that originated the request. Used to route completions back.
Completion Header
DW0: [Fmt=00/10|Type=01010|R|TC|R|Attr|R|TH|TD|EP|Attr|Length]
DW1:
Bit 31 16 15 13 12 8 7 0
┌─────────────────────────┬─────────┬────────────┬────────────────┐
│ Completer ID │ Cpl │ Byte │ │
│ (Bus:Dev:Func) │ Status │ Count │ BCM │
│ 16 bits │ [2:0] │ [11:0] │ [0] │
└─────────────────────────┴─────────┴────────────┴────────────────┘
Completion Status:
000 = Successful Completion (SC)
001 = Unsupported Request (UR)
010 = Config Request Retry Status (CRS) — device not ready
100 = Completer Abort (CA)
Byte Count[11:0]: Remaining bytes to be transferred (for
split completions, multiple CplD packets may be needed).
BCM: Byte Count Modified — set by PCI-X bridges.
DW2:
Bit 31 16 15 8 7 0
┌─────────────────────────┬──────────────┬────────────────┐
│ Requester ID │ Tag │ R |Lower Addr │
│ (original requester) │ (8 bits) │[7] [6:0] │
└─────────────────────────┴──────────────┴────────────────┘
Lower Address[6:0]: Lower bits of the byte address for
the first enabled byte of data in this completion.
Used by the requester to place data correctly.
Configuration Request Header
DW0: [Fmt=00/10|Type=00100(T0)/00101(T1)|...|Length=1]
DW1:
┌─────────────────────────┬──────────────┬────────────────┐
│ Requester ID │ Tag │ Last BE|1st BE │
└─────────────────────────┴──────────────┴────────────────┘
DW2:
Bit 31 24 23 16 15 11 10 8 7 2 1 0
┌───────────────────────┬────────┬───────┬────────┬──────────┬───┐
│ Reserved │ Bus │ Dev │ Func │ Ext Reg │ R │
│ │ [7:0] │ [4:0] │ [2:0] │ [9:2] │ │
└───────────────────────┴────────┴───────┴────────┴──────────┴───┘
Ext Reg#[9:2] + Reg#[7:2] (from Ext Reg# field):
Combined 10-bit register number, addressing the full
4096-byte configuration space (4096/4 = 1024 DW offsets).
Standard config space: Ext Reg# = 0, Reg#[7:2] addresses
offsets 0x00-0xFC (256 bytes).
Extended config space: offsets 0x100-0xFFC.
Type 0 vs Type 1 configuration requests:
- Type 0: Targets the device on the current bus segment. Consumed by the endpoint.
- Type 1: Forwarded by bridges. When a bridge receives a Type 1 CfgRd/CfgWr, it checks if the target bus number matches its secondary bus. If so, it converts to Type 0 and forwards. If the target bus is behind a subordinate bridge, it forwards as Type 1.
Message TLP Header
DW0: [Fmt=01/11|Type=10rrr|...|Length]
rrr = routing subfield (by address, by ID, broadcast, etc.)
DW1:
┌─────────────────────────┬──────────────┬────────────────┐
│ Requester ID │ Tag │ Message Code │
│ │ │ [7:0] │
└─────────────────────────┴──────────────┴────────────────┘
DW2-DW3: Depends on message code and routing type.
For INTx: message code = 0x20 (Assert_INTA) through 0x23 (Assert_INTD)
or 0x24 (Deassert_INTA) through 0x27 (Deassert_INTD)
For PME: message code = 0x18 (PME_Turn_Off), 0x19 (PME_TO_Ack)
For Error: 0x30 (ERR_COR), 0x31 (ERR_NONFATAL), 0x33 (ERR_FATAL)
For vendor-defined: specific vendor code in message code field
TLP Processing Rules
When a PCIe port receives a TLP, it must decide whether to accept, forward, or reject it:
-
Endpoint receives a Memory/IO TLP: Compare the address against all enabled BARs. If the address falls within a BAR range, accept. Otherwise, send Unsupported Request (UR) completion (for non-posted) or silently discard (for posted writes — with error logging).
-
Bridge/Switch receives a TLP: Compare address against configured memory and I/O windows (Memory Base/Limit, I/O Base/Limit, Prefetchable Memory Base/Limit). If the address falls within a downstream window, forward downstream. If it falls outside all downstream windows, forward upstream (toward RC). Bridges also check secondary bus number range for config TLPs.
-
Root Complex receives a TLP: Memory addresses targeting system DRAM are consumed by the RC (DMA reads/writes). Completions are routed to the originating function by Requester ID. Messages (INTx, PME, error) are handled by the RC's interrupt and error logic.
Poisoned TLPs
A poisoned TLP (EP bit set in DW0) indicates the data payload is corrupted. This can occur due to:
- Uncorrectable memory ECC error during DMA read
- Data integrity error detected at an intermediate point
The receiver must:
- Accept the TLP (don't reject it)
- Log an error (poisoned TLP received = uncorrectable non-fatal error)
- Use the corrupted data or discard it (implementation-defined)
- NOT generate a completion abort — the TLP was delivered
Ordering Rules
PCIe ordering rules ensure producer-consumer correctness:
Can B pass A?
A \ B Posted Non-Posted Completion
─────────────────────────────────────────────
Posted No(1) Yes Yes
Non-Posted No(2) No(3) Yes
Completion Yes Yes No(4)
(1) Posted requests are strongly ordered — P2 cannot pass P1.
This ensures writes arrive in program order.
(2) Non-posted requests cannot pass posted requests.
A read issued after a write must see the write's effect.
(3) Non-posted requests are ordered among themselves.
(4) Completions for the same request are ordered (multi-part
completion for a large read). Completions for different
requests may be reordered.
With Relaxed Ordering (RO) attribute set:
- Memory writes with RO set can pass other memory writes
- Memory reads with RO set can pass memory writes
- Useful for bulk DMA where ordering doesn't matter
With ID-based Ordering (IDO):
- Transactions with different Requester IDs can pass each other
- Useful for SR-IOV where VF traffic shouldn't block PF traffic
Credit-Based Flow Control
PCIe uses a credit-based flow control mechanism to prevent buffer overflow. Each receiver advertises how many TLPs it can accept, measured in credit units.
Six credit types (tracked independently):
1. Posted Header (PH) — headers for memory writes, messages
2. Posted Data (PD) — data payloads for posted TLPs
3. Non-Posted Header (NPH) — headers for reads, config, I/O ops
4. Non-Posted Data (NPD) — data payloads for non-posted TLPs
5. Completion Header (CplH) — headers for completions
6. Completion Data (CplD) — data payloads for completions
Credit units:
Header credit = 1 per TLP header (regardless of 3DW or 4DW)
Data credit = 1 per 4 DW (16 bytes) of payload, rounded up
Example: 64-byte payload = 64/16 = 4 data credits
Example: 20-byte payload = ceil(20/16) = 2 data credits
Transmitter maintains:
credits_consumed[type] — running counter of credits used
Receiver advertises:
credits_limit[type] — maximum credits available
Transmission rule:
Can transmit only if: credits_consumed + cost <= credits_limit
(using modular arithmetic with the counter width)
Flow:
1. During link initialization, receiver sends InitFC1/InitFC2
DLLPs advertising initial credit limits for all 6 types.
2. Transmitter starts with credits_consumed = 0.
3. Each TLP sent increments credits_consumed.
4. Receiver processes TLPs, frees buffer space, sends
UpdateFC DLLPs with new credits_limit values.
5. Transmitter updates its credits_limit and can send more.
IMPORTANT: Endpoints MUST advertise infinite credits for
completion headers and completion data. This prevents deadlock
— a device that issued a read must always be able to accept
the completion, even if its buffers are full of other traffic.
Infinite credit = a credits_limit value that never stops the
transmitter.
Switches/bridges can have finite completion credits but must
have enough to handle all outstanding non-posted requests
that might generate completions.
Max Payload Size (MPS) and Max Read Request Size (MRRS)
MPS (Max Payload Size):
Configured in PCIe Device Control register (bits 7:5).
Values: 128, 256, 512, 1024, 2048, 4096 bytes.
Applies to the DATA PAYLOAD of any TLP (read completion data
or write data). A TLP's data payload must not exceed MPS.
Negotiation: Each device advertises its Max_Payload_Size_Supported
in Device Capabilities. System software sets MPS to the minimum
of all devices in the hierarchy (to ensure all switches and
bridges can handle the payload).
Common default: 128 bytes (conservative).
NVMe SSDs: often support 512 or 256.
GPUs: often support 256.
MRRS (Max Read Request Size):
Configured in PCIe Device Control register (bits 14:12).
Values: 128, 256, 512, 1024, 2048, 4096 bytes.
Limits the size of a single memory read request.
Large reads may be split into multiple completions.
MRRS can be set independently per device (unlike MPS).
4. Data Link Layer — DLLPs
The Data Link Layer provides reliable delivery of TLPs between adjacent PCIe ports using sequence numbers, CRC, and a retry mechanism.
DLLP Format
DLLP packet (always 8 bytes on wire):
┌──────────────┬────────────────────────────┬──────────┐
│ DLLP Type │ Type-specific fields │ CRC-16 │
│ (8 bits) │ (24 bits) │ (16 bits)│
└──────────────┴────────────────────────────┴──────────┘
▲
CRC covers DLLP Type
+ type-specific fields
DLLP Types:
Type byte Name Purpose
───────── ──── ───────
0000_0000 Ack Acknowledge TLPs up to seq#
0001_0000 Nak Request retransmission from seq#
0010_0000 PM_Enter_L1 Power management
0010_0001 PM_Enter_L23 Power management
0010_0011 PM_Active_State_Req_L1 Power management
0010_0100 PM_Request_Ack Power management
0011_0000 Vendor-specific Implementation-defined
0100_xxxx InitFC1-P Flow control init (Posted)
0101_xxxx InitFC1-NP Flow control init (Non-Posted)
0110_xxxx InitFC1-Cpl Flow control init (Completion)
1100_xxxx InitFC2-P Flow control init phase 2
1101_xxxx InitFC2-NP Flow control init phase 2
1110_xxxx InitFC2-Cpl Flow control init phase 2
1000_xxxx UpdateFC-P Flow control update (Posted)
1001_xxxx UpdateFC-NP Flow control update (Non-Posted)
1010_xxxx UpdateFC-Cpl Flow control update (Completion)
Ack/Nak DLLP Format
Bit 7 4 3 0 23 12 11 0
┌─────────────┬───────────┬────────────────┬─────────────────┐
│ Ack (0x00) │ Reserved │ Reserved │ AckNak_Seq# │
│ Nak (0x10) │ │ │ [11:0] │
└─────────────┴───────────┴────────────────┴─────────────────┘
AckNak_Seq#: The sequence number being acknowledged or nak'd.
Ack: All TLPs with seq# <= AckNak_Seq# are acknowledged.
Nak: Retransmit starting from TLP with seq# = AckNak_Seq#.
Flow Control DLLP Format
Bit 7 0 23 20 19 8 7 0
┌────────────────────┬──────────┬───────────┬────────────────┐
│ FC Type (InitFC1/ │ HdrFC │ HdrFC │ DataFC │
│ InitFC2/UpdateFC │ [11:8] │ [7:0] │ [11:0] │
│ + P/NP/Cpl) │ │ │ │
└────────────────────┴──────────┴───────────┴────────────────┘
HdrFC[11:0]: Header credits (number of TLP headers receiver can buffer)
DataFC[11:0]: Data credits (in units of 4 DW = 16 bytes)
Infinite credits: HdrFC or DataFC = 0 means infinite for that type.
(InitFC values of 0 mean infinite; UpdateFC 0 means no change.)
Ack/Nak Protocol and Retry
Transmitter Receiver
│ │
│──── TLP (seq=0) ──────────────────────────► │
│──── TLP (seq=1) ──────────────────────────► │
│──── TLP (seq=2) ──────────────────────────► │ CRC check
│ │ passes
│ ◄──────────────────────── Ack (seq=2) ──── │ (acks 0,1,2)
│ │
│──── TLP (seq=3) ──────────────────────────► │ CRC FAILS
│──── TLP (seq=4) ──────────────────────────► │
│ │
│ ◄──────────────────────── Nak (seq=3) ──── │ (request retry)
│ │
│ [replay buffer: resend seq=3,4] │
│──── TLP (seq=3) ──────────────────────────► │
│──── TLP (seq=4) ──────────────────────────► │
│ │
│ ◄──────────────────────── Ack (seq=4) ──── │
Replay buffer: Transmitter keeps all sent TLPs until acknowledged.
Buffer depth determines max outstanding TLPs.
Replay timer: If no Ack received within timeout, retransmit.
Replay number rollover: After N consecutive replays without
progress, the link is considered failed → link retraining.
LCRC (Link CRC): CRC-32 computed over the entire TLP (header +
data + ECRC if present). Appended by transmitter Data Link Layer,
checked by receiver Data Link Layer.
ECRC (End-to-end CRC): CRC-32 computed by the originator over
the TLP header + data. Survives across switches. Optional.
Checked by the final destination, not intermediate switches.
Enabled via AER capability.
Flow Control Initialization
Flow control initialization occurs during link training (after
Physical Layer link-up, before normal TLP traffic):
Phase 1 (InitFC1):
Both ports exchange InitFC1 DLLPs for all three types
(Posted, Non-Posted, Completion). Each InitFC1 carries the
receiver's initial credit limits.
A port sends InitFC1 continuously until it receives InitFC1
from the other side for all three types.
Phase 2 (InitFC2):
After receiving all InitFC1 DLLPs, switch to InitFC2.
InitFC2 carries the same credit values (confirmation).
After both sides complete InitFC2 exchange, the Data Link
Layer enters DL_Active state and TLP traffic can begin.
After initialization:
Credits are consumed by sending TLPs and replenished by
receiving UpdateFC DLLPs from the receiver.
UpdateFC DLLPs must be sent periodically (even if no credits
freed) to prevent the transmitter from stalling due to
counter rollover ambiguity.
5. Physical Layer
Electrical Signaling
Each PCIe lane is a pair of differential signals (D+ and D-):
TX RX
┌─────────┐ ┌─────────┐
│ D+ ──┼──── AC coupling ────►│ D+ │
│ D- ──┼──── capacitor ──────►│ D- │
└─────────┘ └─────────┘
Differential voltage swing (Gen 1-5):
800-1200 mV peak-to-peak differential
Common-mode voltage: 0 V (AC-coupled)
Impedance: 85 ohm differential, 42.5 ohm single-ended
AC coupling: Required on all lanes. Blocks DC component.
Allows each side to set its own common-mode voltage.
Capacitor value: typically 75-200 nF.
LTSSM (Link Training and Status State Machine)
The LTSSM controls link initialization, speed/width negotiation, and power management:
┌──────────┐
Reset ──────►│ Detect │◄─────────────────┐
│ │ │
└────┬─────┘ │
│ Receiver detected │
┌────▼─────┐ │
│ Polling │ │
│ │ TS1/TS2 exchange │
└────┬─────┘ │
│ Bit lock, symbol lock │
┌────▼─────────┐ │
│Configuration │ │
│ │ Lane/width │
│ │ negotiation │
└────┬─────────┘ │
│ Link# and Lane# │
│ agreed │
┌────▼─────┐ │
┌────►│ L0 │───────────────────┤
│ │ (Active) │ Link error or │
│ └──┬───┬──┘ timeout │
│ │ │ │
│ ┌────┘ └────┐ │
│ │ │ │
┌────▼───▼──┐ ┌────▼─────┐ │
│ Recovery │ │ L0s │ │
│ │ │ (Standby)│ │
└─────┬─────┘ └────┬─────┘ │
│ │ │
│ Back to L0 │
│ │
┌────▼─────┐ │
│ L1 │ │
│ (Low Pwr)│ │
└────┬─────┘ │
│ │
┌────▼─────┐ │
│ L2 │ ──── Power off ────► ┌───┴────┐
│ (Sleep) │ │ L3 │
└──────────┘ │ (Off) │
└────────┘
Key LTSSM states:
Detect: Port powers on, checks for receiver presence by
measuring impedance on each lane. If a receiver termination
(50 ohm to ground) is detected, proceed to Polling.
Polling: Exchange Training Sequences (TS1, TS2) to achieve:
- Bit lock (PLL locks to incoming data frequency)
- Symbol lock (8b/10b comma detection or 128b/130b sync header)
- Lane polarity detection and correction
Configuration: Negotiate link width and assign lane numbers.
Exchange TS1/TS2 with Link# and Lane# fields set.
Both sides agree on the active link width.
After agreement, transition to L0.
L0 (Active): Normal operating state. TLPs and DLLPs flow.
This is where useful work happens.
Recovery: Entered when:
- Speed change is needed (e.g., Gen 1 → Gen 3)
- Equalization required (Gen 3+)
- Link error recovery
Exchanges TS1/TS2 to re-establish bit/symbol lock.
After recovery, return to L0.
L0s: Low-power standby. Electrical idle on TX lanes.
Fast exit (~1 us). Entered autonomously by hardware.
No software involvement. Small power savings.
L1: Deeper low-power state. Both TX and RX quiesced.
Longer exit latency (~2-32 us). Requires DLLP handshake
to enter. Significant power savings.
L1.1/L1.2 (substates): Even deeper L1 power savings.
L1.1: Reference clock can be turned off.
L1.2: Common-mode voltage removed. Longest exit latency
but maximum power savings.
L2: Very low power. Only auxiliary power maintained.
Used for wake capability (PME#). Most of the link
circuitry is powered down.
L3: Link completely off. No power. Full power-on reset
required to return to Detect.
Ordered Sets
Ordered sets are special symbols transmitted during link training and maintenance:
TS1 (Training Sequence 1):
16 symbols in Gen 1-2, 32 symbols in Gen 3+.
Contains: COM, Link#, Lane#, N_FTS (number of Fast
Training Sequences), Data Rate supported, Training
Control bits (hot reset, disable, loopback, etc.)
TS2 (Training Sequence 2):
Same format as TS1 but indicates the receiver has
successfully locked to TS1. Both sides sending TS2
means agreement.
SKIP ordered set:
Compensates for clock frequency differences between
TX and RX (clock tolerance compensation).
Sent periodically during L0 (every ~1180-1538 symbols).
EIEOS (Electrical Idle Exit Ordered Set):
Signals transition from electrical idle back to active.
Used when exiting L0s or L1.
SDS (Start of Data Stream):
Marks the beginning of TLP/DLLP data after training
(Gen 3+ only, replaces COM in data stream).
EDS (End of Data Stream):
Marks transition from data stream to ordered set
(Gen 3+ only).
Equalization (Gen 3+)
At 8 GT/s and above, signal quality degrades due to channel loss (frequency-dependent attenuation). Equalization compensates:
Equalization phases (Gen 3-5):
Phase 0: Upstream port sends initial TX presets in TS1
(from a table of 11 presets, P0-P10, each specifying
pre-cursor, cursor, and post-cursor coefficients).
Phase 1: Downstream port evaluates presets, selects best.
Phase 2: Downstream port requests specific TX coefficients
from upstream port via TS1 Equalization Control field.
Upstream port adjusts its transmitter.
Phase 3: Upstream port requests specific TX coefficients
from downstream port (reverse direction equalization).
TX equalization uses a 3-tap FIR (Finite Impulse Response) filter:
Output = C(-1)*D(n-1) + C(0)*D(n) + C(+1)*D(n+1)
C(-1) = pre-cursor coefficient (compensates pre-cursor ISI)
C(0) = cursor coefficient (main signal amplitude)
C(+1) = post-cursor coefficient (compensates post-cursor ISI)
The 11 presets define specific {C(-1), C(0), C(+1)} combinations
optimized for different channel characteristics.
Clock Architecture
Common clock (default for add-in cards):
Both ports use a common 100 MHz reference clock
distributed from the platform. ±300 ppm tolerance.
Simplifies CDR (Clock Data Recovery).
SRIS (Separate Reference clock with Independent SSC):
Each port has its own reference clock. SSC (Spread
Spectrum Clocking, ±0.5% modulation) applied independently.
Requires wider CDR bandwidth. Used in some embedded designs.
SRNS (Separate Reference clock with No SSC):
Separate clocks, no spread spectrum. Tighter frequency
tolerance required.
Data clock:
PCIe embeds the clock in the data stream via encoding
(8b/10b guarantees transitions, 128b/130b uses scrambling).
The receiver's CDR PLL recovers the clock from the data.
6. Configuration Space
Every PCIe function has a 4096-byte configuration space. The first 256 bytes are the standard PCI configuration space. Bytes 256-4095 are the extended configuration space (PCIe-only, accessed via ECAM).
Access Mechanisms
Legacy PCI mechanism (I/O ports, first 256 bytes only):
CONFIG_ADDRESS (0xCF8): Write the target BDF + register offset
CONFIG_DATA (0xCFC): Read/write the register value
CONFIG_ADDRESS format:
Bit 31 30:24 23:16 15:11 10:8 7:2 1:0
┌────┬──────────┬────────┬────────┬───────┬─────────┬─────┐
│ En │ Reserved │ Bus │ Dev │ Func │ Reg Ofs │ 0 │
│ [1]│ [7] │ [8] │ [5] │ [3] │ [6] │ [2] │
└────┴──────────┴────────┴────────┴───────┴─────────┴─────┘
Example (C):
uint32_t config_read32(uint8_t bus, uint8_t dev, uint8_t func,
uint8_t offset) {
uint32_t addr = (1u << 31) // enable
| ((uint32_t)bus << 16)
| ((uint32_t)dev << 11)
| ((uint32_t)func << 8)
| (offset & 0xFC); // DW-aligned
outl(0xCF8, addr);
return inl(0xCFC);
}
ECAM (Enhanced Configuration Access Mechanism):
Memory-mapped access to full 4096-byte space.
Base address provided by ACPI MCFG table.
Physical address = ECAM_Base
+ (Bus << 20)
+ (Device << 15)
+ (Function << 12)
+ Register_Offset
Each function gets a 4 KB page (4096 bytes = 2^12).
Each bus gets 256 KB (32 devices * 8 functions * 4 KB).
Max ECAM region = 256 MB (256 buses * 256 KB/bus).
Example:
// Map ECAM region
volatile uint8_t *ecam = mmap_ecam_base();
// Read 32 bits from Bus=3, Dev=0, Func=0, offset=0x10 (BAR0)
uint32_t *reg = (uint32_t *)(ecam + (3 << 20) + (0 << 15)
+ (0 << 12) + 0x10);
uint32_t bar0 = *reg;
Type 0 Header (Endpoint)
Offset Bits 31:24 Bits 23:16 Bits 15:8 Bits 7:0
────── ─────────── ─────────── ────────── ─────────
0x00 Device ID [31:16] Vendor ID [15:0]
0x04 Status [31:16] Command [15:0]
0x08 Class Code Subclass Prog IF Revision ID
0x0C BIST Header Type Lat Timer(0) Cache Line Sz
0x10 BAR0 [31:0]
0x14 BAR1 [31:0] (or upper 32 bits of BAR0 if 64-bit)
0x18 BAR2 [31:0]
0x1C BAR3 [31:0] (or upper 32 bits of BAR2 if 64-bit)
0x20 BAR4 [31:0]
0x24 BAR5 [31:0] (or upper 32 bits of BAR4 if 64-bit)
0x28 CardBus CIS Pointer [31:0] (legacy, usually 0)
0x2C Subsystem Device ID [31:16] Subsystem Vendor ID [15:0]
0x30 Expansion ROM Base Address [31:0]
0x34 Reserved [31:8] Capabilities Pointer [7:0]
0x38 Reserved [31:0]
0x3C Max Lat Min Gnt Interrupt Pin Interrupt Line
Command Register (0x04, 16 bits):
Bit Name Description
─── ──── ───────────
0 I/O Space Enable Allow I/O BAR access
1 Memory Space Enable Allow Memory BAR access
2 Bus Master Enable Allow device to initiate DMA
3 Special Cycles (hardwired 0 in PCIe)
4 MW Invalidate En (hardwired 0 in PCIe)
5 VGA Palette Snoop (hardwired 0 in PCIe)
6 Parity Error Resp Enable parity error signaling
7 IDSEL Stepping (hardwired 0 in PCIe)
8 SERR# Enable Enable SERR# driver
9 Fast B2B Enable (hardwired 0 in PCIe)
10 Interrupt Disable Disable INTx assertion
11:15 Reserved
Status Register (0x06, 16 bits):
Bit Name Description
─── ──── ───────────
0 Immediate Readiness (PCIe) Device ready immediately
3 Interrupt Status INTx asserted (read-only)
4 Capabilities List Always 1 for PCIe (has capabilities)
5 66 MHz Capable (hardwired 0 in PCIe)
7 Fast B2B Capable (hardwired 0 in PCIe)
8 Master Data Parity Err Set when Bus Master detects parity err
11 DEVSEL Timing (hardwired 0 in PCIe)
12 Signaled Target Abort Set when device sends UR completion
13 Received Target Abort Set when device receives UR completion
14 Received Master Abort Set when completion times out
15 Signaled System Error Set when device sends ERR_FATAL/NONFATAL
Header Type (0x0E):
Bit 7: Multi-function device (if 1, scan all 8 functions)
Bit 6:0: Header layout type
0x00 = Type 0 (Endpoint)
0x01 = Type 1 (PCI-to-PCI Bridge)
0x02 = Type 2 (CardBus Bridge) — obsolete
Type 1 Header (Bridge / Root Port / Switch Port)
Offset Bits 31:24 Bits 23:16 Bits 15:8 Bits 7:0
────── ─────────── ─────────── ────────── ─────────
0x00 Device ID [31:16] Vendor ID [15:0]
0x04 Status [31:16] Command [15:0]
0x08 Class Code Subclass Prog IF Revision ID
0x0C BIST Header Type Lat Timer(0) Cache Line Sz
0x10 BAR0 [31:0]
0x14 BAR1 [31:0]
0x18 Sec Lat Timer Subordinate Secondary Primary
Bus Number Bus Number Bus Number
0x1C Secondary Status [31:16] I/O Limit I/O Base
0x20 Memory Limit [31:16] Memory Base [15:0]
0x24 Pref Mem Limit [31:16] Pref Mem Base [15:0]
0x28 Prefetchable Base Upper 32 bits
0x2C Prefetchable Limit Upper 32 bits
0x30 I/O Limit Upper 16 I/O Base Upper 16
0x34 Reserved [31:8] Capabilities Pointer [7:0]
0x38 Expansion ROM Base Address [31:0]
0x3C Bridge Ctrl [31:16] Interrupt Pin Interrupt Line
Bus Number Registers (0x18):
Primary Bus Number [7:0]:
Bus number of the bus on the upstream (CPU-facing) side.
Secondary Bus Number [15:8]:
Bus number of the bus immediately downstream of this bridge.
The bridge decodes Type 1 config transactions targeting this
bus number, converts them to Type 0, and forwards downstream.
Subordinate Bus Number [23:16]:
Highest bus number of any bus downstream of this bridge.
The bridge forwards Type 1 config transactions for bus numbers
in [Secondary, Subordinate] range downstream.
Memory Base/Limit (0x20):
Bits 15:4 of Memory Base → bits [31:20] of start address
(bits [19:0] are 0 → 1 MB alignment, 1 MB granularity)
Bits 15:4 of Memory Limit → bits [31:20] of end address
(bits [19:0] are 0xFFFFF → inclusive upper bound)
The bridge forwards memory TLPs with addresses in
[Memory Base, Memory Limit + 0xFFFFF] downstream.
TLPs outside this range are forwarded upstream.
Example: Memory Base = 0xF000, Memory Limit = 0xF7F0
Window = 0xF000_0000 to 0xF7F0_FFFF (128 MB window)
Prefetchable Memory Base/Limit (0x24-0x2C):
Same concept but supports 64-bit addresses:
Lower 16 bits at offset 0x24 (base) and upper 16 bits of base
at offset 0x28.
Lower 16 bits at offset 0x24 (limit) and upper 16 bits of limit
at offset 0x2C.
Prefetchable memory: Safe to prefetch (no side effects on read).
Typically used for framebuffers, large DMA buffers.
Non-prefetchable: Reads may have side effects (MMIO registers).
Use the standard Memory Base/Limit window.
I/O Base/Limit (0x1C, 0x30):
Bits 7:4 of I/O Base → bits [15:12] of start address
(4 KB alignment, 4 KB granularity)
If bits 3:0 = 0x1, indicates 32-bit I/O addressing
(upper 16 bits at offset 0x30).
If bits 3:0 = 0x0, indicates 16-bit I/O addressing.
PCIe endpoints rarely use I/O BARs (discouraged).
Bridge Control Register (0x3E, 16 bits):
Bit Name Description
─── ──── ───────────
0 Parity Error Response Forward parity errors
1 SERR# Enable Enable SERR# for secondary bus
2 ISA Enable Enable ISA I/O filtering
3 VGA Enable Forward VGA I/O and memory
4 VGA 16-bit Decode VGA alias decoding
5 Master Abort Mode (hardwired 0 in PCIe)
6 Secondary Bus Reset Assert reset on secondary bus
8 Primary Discard Timer (hardwired 0 in PCIe)
9 Secondary Discard Timer (hardwired 0 in PCIe)
10 Discard Timer Status (hardwired 0 in PCIe)
11 Discard Timer SERR# (hardwired 0 in PCIe)
Extended Configuration Space (0x100 - 0xFFF)
Accessible only via ECAM (not via legacy I/O port mechanism). Contains Extended Capabilities in a linked list similar to standard capabilities but with a different header format:
Offset 0x100+:
Extended Capability Header (4 bytes):
Bit 31:20 19:16 15:0
┌──────────┬───────────┬──────────────────┐
│Next Cap │ Cap Ver │ Cap ID │
│Offset │ [3:0] │ [15:0] │
│[11:0] │ │ │
└──────────┴───────────┴──────────────────┘
Next Cap Offset: Pointer to next extended capability (0 = end).
Cap Version: Version of this capability.
Extended Capability ID: 16-bit ID (vs 8-bit for standard caps).
Common Extended Capability IDs:
0x0001 = AER (Advanced Error Reporting)
0x0002 = Virtual Channel (VC)
0x0003 = Device Serial Number
0x0004 = Power Budgeting
0x0005 = Root Complex Link Declaration
0x000D = ACS (Access Control Services)
0x000E = ARI (Alternative Routing-ID Interpretation)
0x0010 = SR-IOV (Single Root I/O Virtualization)
0x0015 = Resizable BAR
0x0017 = TPH (TLP Processing Hints)
0x0018 = LTR (Latency Tolerance Reporting)
0x001E = L1 PM Substates
0x001F = DPC (Downstream Port Containment)
0x001D = Secondary PCI Express
0x0023 = Designated Vendor-Specific
0x0025 = Data Link Feature
0x0027 = Physical Layer 16.0 GT/s
0x0029 = Physical Layer 32.0 GT/s
0x002B = DOE (Data Object Exchange)
0x0030 = IDE (Integrity and Data Encryption)
0x0031 = Physical Layer 64.0 GT/s
7. Base Address Registers (BARs)
BARs define the memory or I/O address ranges that a device's registers occupy. The host CPU accesses device registers by reading/writing to these addresses.
BAR Format
Memory BAR (bit 0 = 0):
Bit 31 4 3 2 1 0
┌────────────────────────────────────────┬──┬──┬──┐
│ Base Address [31:4] │P │Ty│ 0│
│ (16-byte aligned minimum) │ │pe│ │
└────────────────────────────────────────┴──┴──┴──┘
Bit 0: 0 = Memory Space
Bit 2:1 (Type):
00 = 32-bit address, locatable anywhere in 32-bit space
10 = 64-bit address, uses this BAR + next BAR for full address
(01 and 11 are reserved)
Bit 3 (Prefetchable):
0 = Non-prefetchable (MMIO with side effects, uncacheable)
1 = Prefetchable (safe to read ahead, write-combinable)
I/O BAR (bit 0 = 1):
Bit 31 2 1 0
┌──────────────────────────────────┬──┬──┐
│ Base Address [31:2] │ R│ 1│
│ (4-byte aligned) │ │ │
└──────────────────────────────────┴──┴──┘
Bit 0: 1 = I/O Space
Bit 1: Reserved
BAR Sizing Algorithm
The firmware/OS determines BAR size by:
1. Save original BAR value:
original = config_read32(bus, dev, func, BAR_OFFSET);
2. Write all 1s to the BAR:
config_write32(bus, dev, func, BAR_OFFSET, 0xFFFFFFFF);
3. Read back the BAR:
readback = config_read32(bus, dev, func, BAR_OFFSET);
4. Restore original value:
config_write32(bus, dev, func, BAR_OFFSET, original);
5. Determine type and size:
if (readback == 0) {
// BAR is not implemented (device doesn't use this BAR)
return;
}
if (readback & 1) {
// I/O BAR
mask = readback & 0xFFFFFFFC; // clear type bits
size = ~mask + 1; // invert and add 1
// size is the I/O address space required
} else {
// Memory BAR
mask = readback & 0xFFFFFFF0; // clear type/prefetch bits
size = ~mask + 1;
// size is the memory address space required
if ((readback >> 1) & 3 == 2) {
// 64-bit BAR: read next BAR too
config_write32(bus, dev, func, BAR_OFFSET+4, 0xFFFFFFFF);
upper = config_read32(bus, dev, func, BAR_OFFSET+4);
config_write32(bus, dev, func, BAR_OFFSET+4, original_upper);
// Combine for 64-bit size calculation
uint64_t mask64 = ((uint64_t)upper << 32) | mask;
uint64_t size64 = ~mask64 + 1;
}
}
Example: NVMe controller BAR0
Write 0xFFFFFFFF → BAR0
Read back: 0xFFFFC004
Bit 0 = 0 → Memory BAR
Bits 2:1 = 10 → 64-bit
Bit 3 = 0 → Non-prefetchable
Mask = 0xFFFFC000
Size = ~0xFFFFC000 + 1 = 0x00004000 = 16 KB
This is a 64-bit, non-prefetchable, 16 KB memory BAR.
The device's NVMe registers (doorbell array, controller
registers) are mapped into this 16 KB window.
BAR Assignment
After sizing, firmware/OS assigns addresses:
For each device, for each BAR:
1. Determine size via sizing algorithm
2. Allocate a naturally-aligned region from the MMIO pool
(address must be aligned to BAR size)
3. Write the allocated base address into the BAR
4. For bridges: ensure the bridge's memory window encompasses
all downstream BAR assignments
Example BAR assignment for a hierarchy:
Root Port (Bus 0, Memory Window: 0xF000_0000 - 0xF7FF_FFFF)
└── NVMe SSD (Bus 1, Dev 0)
BAR0 = 0xF000_0000 (16 KB, 64-bit, non-prefetchable)
└── NIC (Bus 1, Dev 1)
BAR0 = 0xF010_0000 (1 MB, 64-bit, non-prefetchable)
BAR2 = 0xF020_0000 (64 KB, 64-bit, prefetchable)
Root Port memory base/limit programmed to cover 0xF000_0000
to 0xF02F_FFFF (or wider, aligned to 1 MB granularity).
For 64-bit BARs:
BAR[n] holds lower 32 bits of address
BAR[n+1] holds upper 32 bits
This consumes two BAR slots (e.g., BAR0+BAR1, BAR2+BAR3)
So a Type 0 header with 6 BAR slots can have at most 3
64-bit BARs, or 6 32-bit BARs, or a mix.
Prefetchable vs Non-Prefetchable
Non-prefetchable (MMIO registers):
- Read side effects possible (reading a status register may
clear it, reading a FIFO pops an entry)
- Must be mapped uncacheable (UC) by the CPU
- Writes are not combinable
- Bridge forwards only to non-prefetchable memory window
Prefetchable (device memory, framebuffers):
- Reads are idempotent (no side effects)
- CPU can use Write-Combining (WC) or even cacheable mapping
- Multiple writes to adjacent addresses can be merged
- Bridge can use the prefetchable memory window
- Improves DMA bandwidth for bulk data
Gotcha: If a device marks a BAR as prefetchable but the registers
actually have side effects, reads will be corrupted. Conversely,
marking a framebuffer as non-prefetchable wastes performance
(no write combining).
8. PCIe Capabilities
PCIe capabilities extend the configuration space with optional features. They form a linked list starting from the Capabilities Pointer register (offset 0x34).
Capability List Structure
Standard Capability Header (2 bytes):
Offset+0x00:
Bit 15:8 7:0
┌────────────┬───────────┐
│ Next Cap │ Cap ID │
│ Pointer │ │
│ (offset) │ │
└────────────┴───────────┘
Next Pointer: Offset of next capability in config space.
0x00 = end of list.
Must be DWORD-aligned (bits 1:0 are reserved/zero).
Cap ID: Identifies the capability type.
Standard Capability IDs:
0x01 = PCI Power Management Interface (PMI)
0x05 = MSI (Message Signaled Interrupts)
0x10 = PCI Express Capability
0x11 = MSI-X
0x12 = SATA Data/Index Configuration
0x13 = Advanced Features (AF)
0x14 = Enhanced Allocation (EA)
Walking the capability list:
uint8_t cap_ptr = config_read8(dev, 0x34) & 0xFC;
while (cap_ptr != 0) {
uint8_t cap_id = config_read8(dev, cap_ptr);
uint8_t next = config_read8(dev, cap_ptr + 1) & 0xFC;
printf("Capability 0x%02x at offset 0x%02x\n", cap_id, cap_ptr);
if (cap_id == target_cap) {
// Found it — read capability-specific registers
break;
}
cap_ptr = next;
}
PCI Express Capability (ID 0x10)
The most important capability for any PCIe device. Always present.
Offset Register Size
────── ──────── ────
+0x00 Cap ID (0x10) | Next Ptr 2B
+0x02 PCIe Capabilities Reg 2B
+0x04 Device Capabilities 4B
+0x08 Device Control 2B
+0x0A Device Status 2B
+0x0C Link Capabilities 4B
+0x10 Link Control 2B
+0x12 Link Status 2B
+0x14 Slot Capabilities 4B (Root/Switch Downstream ports only)
+0x18 Slot Control 2B
+0x1A Slot Status 2B
+0x1C Root Control 2B (Root Ports only)
+0x1E Root Capabilities 2B
+0x20 Root Status 4B
+0x24 Device Capabilities 2 4B
+0x28 Device Control 2 2B
+0x2A Device Status 2 2B
+0x2C Link Capabilities 2 4B
+0x30 Link Control 2 2B
+0x32 Link Status 2 2B
+0x34 Slot Capabilities 2 4B
+0x38 Slot Control 2 2B
+0x3A Slot Status 2 2B
PCIe Capabilities Register (+0x02):
Bits 3:0: Capability Version (current = 2)
Bits 7:4: Device/Port Type:
0000 = PCI Express Endpoint
0001 = Legacy PCI Express Endpoint
0100 = Root Port
0101 = Upstream Port of Switch
0110 = Downstream Port of Switch
1000 = PCIe-to-PCI/PCI-X Bridge
1001 = Root Complex Integrated Endpoint
1010 = Root Complex Event Collector
Bit 8: Slot Implemented
Bits 13:9: Interrupt Message Number (for MSI/MSI-X)
Device Capabilities (+0x04):
Bits 2:0: Max_Payload_Size Supported (encoded: 0=128B, 1=256B,
2=512B, 3=1024B, 4=2048B, 5=4096B)
Bits 4:3: Phantom Functions Supported
Bit 5: Extended Tag Field Supported (10-bit tags)
Bits 8:6: Endpoint L0s Acceptable Latency
Bits 11:9: Endpoint L1 Acceptable Latency
Bit 15: Role-Based Error Reporting
Bit 28: FLR (Function Level Reset) Capable
Device Control (+0x08):
Bit 0: Correctable Error Reporting Enable
Bit 1: Non-Fatal Error Reporting Enable
Bit 2: Fatal Error Reporting Enable
Bit 3: Unsupported Request Reporting Enable
Bit 4: Enable Relaxed Ordering
Bits 7:5: Max_Payload_Size (must not exceed Device Cap)
Bit 8: Extended Tag Field Enable
Bit 9: Phantom Functions Enable
Bit 10: AUX Power PM Enable
Bit 11: Enable No Snoop
Bits 14:12: Max_Read_Request_Size
Bit 15: Initiate FLR (Function Level Reset) — write-only
Device Status (+0x0A):
Bit 0: Correctable Error Detected
Bit 1: Non-Fatal Error Detected
Bit 2: Fatal Error Detected
Bit 3: Unsupported Request Detected
Bit 4: AUX Power Detected
Bit 5: Transactions Pending (device has outstanding non-posted)
Link Capabilities (+0x0C):
Bits 3:0: Max Link Speed (1=2.5GT/s, 2=5GT/s, 3=8GT/s,
4=16GT/s, 5=32GT/s, 6=64GT/s)
Bits 9:4: Maximum Link Width (1,2,4,8,12,16,32)
Bits 11:10: ASPM Support (bit0=L0s, bit1=L1)
Bits 14:12: L0s Exit Latency (encoded, 64ns to >4us)
Bits 17:15: L1 Exit Latency (encoded, 1us to >64us)
Bit 18: Clock Power Management
Bit 19: Surprise Down Error Reporting Capable
Bit 20: Data Link Layer Link Active Reporting Capable
Bit 21: Link Bandwidth Notification Capable
Bit 22: ASPM Optionality Compliance
Bits 31:24: Port Number
Link Control (+0x10):
Bits 1:0: ASPM Control (00=disabled, 01=L0s, 10=L1, 11=L0s+L1)
Bit 3: Read Completion Boundary (0=64B, 1=128B)
Bit 4: Link Disable
Bit 5: Retrain Link (write 1 to initiate retraining)
Bit 6: Common Clock Configuration
Bit 7: Extended Synch
Bit 8: Enable Clock Power Management
Bit 9: Hardware Autonomous Width Disable
Bit 10: Link Bandwidth Management Interrupt Enable
Bit 11: Link Autonomous Bandwidth Interrupt Enable
Bits 15:12: Reserved (Gen 3+: some equalization bits)
Link Status (+0x12):
Bits 3:0: Current Link Speed (same encoding as Cap)
Bits 9:4: Negotiated Link Width
Bit 11: Link Training (1 = training in progress)
Bit 12: Slot Clock Configuration
Bit 13: Data Link Layer Link Active (DL_Active)
Bit 14: Link Bandwidth Management Status
Bit 15: Link Autonomous Bandwidth Status
Power Management Capability (ID 0x01)
Offset Register Size
────── ──────── ────
+0x00 Cap ID (0x01) | Next 2B
+0x02 PMC (PM Capabilities) 2B
+0x04 PMCSR (PM Ctrl/Status)2B
+0x06 PMCSR_BSE Bridge Ext 1B
+0x07 Data 1B
PMC (+0x02):
Bits 2:0: Version (3 for PCIe)
Bit 3: PME Clock (0 for PCIe)
Bit 5: DSI (Device Specific Init)
Bits 8:6: AUX Current (for D3cold PME)
Bit 9: D1 Support
Bit 10: D2 Support
Bits 15:11: PME Support (which D-states can generate PME)
PMCSR (+0x04):
Bits 1:0: PowerState (00=D0, 01=D1, 10=D2, 11=D3hot)
Bit 3: No Soft Reset
Bit 8: PME Enable
Bits 12:9: Data Select
Bits 14:13: Data Scale
Bit 15: PME Status (write 1 to clear)
D-state power consumption:
D0: Fully operational. All functions available.
D1: Light sleep. Context preserved. Quick wake.
D2: Deeper sleep. More context may be lost.
D3hot: Deep sleep. Only config space accessible.
Software can transition to D0 via PMCSR.
D3cold: Power removed. Only aux power (for wake).
Requires full power-on reset to return to D0.
MSI Capability (ID 0x05)
See Section 9: Interrupt Mechanisms for detailed MSI/MSI-X coverage.
AER — Advanced Error Reporting (Extended Cap ID 0x0001)
Offset Register Size
────── ──────── ────
+0x00 Extended Cap Header 4B
+0x04 Uncorrectable Error Status 4B
+0x08 Uncorrectable Error Mask 4B
+0x0C Uncorrectable Error Severity 4B
+0x10 Correctable Error Status 4B
+0x14 Correctable Error Mask 4B
+0x18 Advanced Error Capabilities/Ctrl 4B
+0x1C Header Log (4 DW) 16B
+0x2C Root Error Command 4B (Root Ports only)
+0x30 Root Error Status 4B (Root Ports only)
+0x34 Error Source ID 4B (Root Ports only)
+0x38 TLP Prefix Log (4 DW) 16B (if supported)
Uncorrectable Error Status/Mask/Severity (+0x04/+0x08/+0x0C):
Bit Error
─── ─────
4 Data Link Protocol Error
12 Poisoned TLP Received
13 Flow Control Protocol Error
14 Completion Timeout
15 Completer Abort
16 Unexpected Completion
17 Receiver Overflow
18 Malformed TLP
19 ECRC Error
20 Unsupported Request Error
21 ACS Violation
22 Uncorrectable Internal Error
24 AtomicOp Egress Blocked
25 TLP Prefix Blocked Error
26 Poisoned TLP Egress Blocked
Severity: 0 = non-fatal, 1 = fatal
Default fatal: Data Link Protocol Error, Flow Control Protocol
Error, Receiver Overflow, Malformed TLP.
Default non-fatal: all others.
Correctable Error Status/Mask (+0x10/+0x14):
Bit Error
─── ─────
0 Receiver Error
6 Bad TLP
7 Bad DLLP
8 Replay Number Rollover
12 Replay Timer Timeout
13 Advisory Non-Fatal Error
14 Corrected Internal Error
15 Header Log Overflow
Header Log (+0x1C):
Captures the first 4 DW of the TLP that caused the error.
Invaluable for debugging — you can see the Requester ID,
address, type, and other fields of the offending TLP.
SR-IOV Capability (Extended Cap ID 0x0010)
Offset Register Size
────── ──────── ────
+0x00 Extended Cap Header 4B
+0x04 SR-IOV Capabilities 4B
+0x08 SR-IOV Control 2B
+0x0A SR-IOV Status 2B
+0x0C InitialVFs 2B
+0x0E TotalVFs 2B
+0x10 NumVFs 2B
+0x12 Function Dependency Link 1B
+0x14 First VF Offset 2B
+0x16 VF Stride 2B
+0x18 Reserved 2B
+0x1A VF Device ID 2B
+0x1C Supported Page Sizes 4B
+0x20 System Page Size 4B
+0x24 VF BAR0 4B
+0x28 VF BAR1 4B
+0x2C VF BAR2 4B
+0x30 VF BAR3 4B
+0x34 VF BAR4 4B
+0x38 VF BAR5 4B
+0x3C VF Migration State Array Ofs 4B
SR-IOV Control (+0x08):
Bit 0: VF Enable (set to create VFs)
Bit 1: VF Migration Enable
Bit 3: VF MSE (Memory Space Enable for all VFs)
Bit 4: ARI Capable Hierarchy
NumVFs (+0x10):
Number of VFs to create. Must be <= TotalVFs.
Write this before setting VF Enable.
First VF Offset (+0x14):
Routing ID offset from PF to first VF.
VF0 RID = PF RID + First VF Offset.
VF Stride (+0x16):
Routing ID stride between consecutive VFs.
VF[n] RID = PF RID + First VF Offset + n * VF Stride.
VF BARs (+0x24-0x38):
Define the BAR apertures for each VF.
All VFs share the same BAR sizes (decoded from VF BARs).
VF BAR addresses are spaced VF_BAR_size * NumVFs apart.
The PF driver programs VF BARs during VF creation; each
VF gets its own slice of the total VF BAR space.
Example: Intel E810 NIC
TotalVFs = 256 per PF
VF BAR0 = 16 KB per VF (NIC queue doorbell registers)
VF BAR3 = 16 KB per VF (NIC queue memory)
With 64 VFs enabled: 64 * 16 KB = 1 MB per BAR consumed.
Each VF appears as a separate PCIe function with its own
config space, BARs, and MSI-X capability. The VF has no
SR-IOV capability itself — only the PF does.
ACS — Access Control Services (Extended Cap ID 0x000D)
ACS is critical for secure SR-IOV and VFIO passthrough.
ACS Control bits:
Bit 0: ACS Source Validation — reject TLPs from wrong source
Bit 1: ACS Translation Blocking — prevent ATS translation bypass
Bit 2: ACS P2P Request Redirect — redirect peer-to-peer requests
through the IOMMU instead of direct switch forwarding
Bit 3: ACS P2P Completion Redirect — same for completions
Bit 4: ACS Upstream Forwarding — force upstream forwarding
Bit 5: ACS P2P Egress Control — enable egress control vector
Bit 6: ACS Direct Translated P2P
Why ACS matters for passthrough:
Without ACS, two devices behind the same switch could DMA
directly to each other via peer-to-peer, bypassing the IOMMU.
A malicious VF assigned to one VM could DMA to a VF assigned
to another VM through the switch, violating isolation.
ACS forces all transactions upstream through the IOMMU,
ensuring DMA isolation even for devices behind switches.
VFIO checks for ACS support and configuration when determining
IOMMU groups. Devices behind switches without ACS end up in
the same IOMMU group, preventing independent assignment.
Other Important Capabilities
ATS — Address Translation Services (Extended Cap ID 0x000F):
Device-side TLB for IOMMU translations.
Device caches IOVA→PA translations from the IOMMU.
ATS-capable devices send Translation Requests upstream.
Reduces IOMMU lookup overhead for frequently accessed pages.
The IOMMU can invalidate device TLB entries.
PASID — Process Address Space ID (Extended Cap ID 0x001B):
Allows a device to tag DMA requests with a PASID,
identifying which process's address space to use.
Enables Shared Virtual Addressing (SVA) — device DMAs
using the same virtual addresses as the CPU process.
Requires IOMMU support (Intel PASID, ARM Substream ID).
Resizable BAR (Extended Cap ID 0x0015):
Allows dynamic BAR size changes. Used by GPUs to expose
their full VRAM (e.g., 16 GB) instead of the default
256 MB window. Requires OS and firmware support.
NVIDIA/AMD GPUs use this for "ReBAR" / "Smart Access Memory".
LTR — Latency Tolerance Reporting (Extended Cap ID 0x0018):
Device reports its latency tolerance to the platform.
Enables aggressive power management — if all devices
tolerate high latency, deeper sleep states can be used.
DPC — Downstream Port Containment (Extended Cap ID 0x001F):
Automatically disables a PCIe link when a fatal error
is detected on the downstream port. Prevents error
propagation to the rest of the hierarchy. The OS can
then attempt recovery or hot-removal.
PTM — Precision Time Measurement (Extended Cap ID 0x001F):
Distributes precise time from the Root Complex to endpoints.
Enables timestamp synchronization between devices.
Used for audio/video synchronization, IEEE 1588 PTP.
9. Interrupt Mechanisms
Legacy INTx Interrupts
PCIe preserves PCI's INTx interrupt model but replaces physical
interrupt wires with in-band messages.
Physical PCI: Four shared interrupt lines (INTA#-INTD#).
Level-triggered, active-low. Multiple devices share lines.
PCIe: INTx messages carried as TLPs:
Assert_INTA (message code 0x20)
Assert_INTB (message code 0x21)
Assert_INTC (message code 0x22)
Assert_INTD (message code 0x23)
Deassert_INTA (message code 0x24)
Deassert_INTB (message code 0x25)
Deassert_INTC (message code 0x26)
Deassert_INTD (message code 0x27)
Behavior:
1. Device asserts virtual interrupt by sending Assert_INTx
2. Root Complex converts this to physical IOAPIC input or
internal interrupt routing
3. CPU receives interrupt, handler reads device status
4. Handler clears interrupt source in device
5. Device sends Deassert_INTx
Problems with INTx:
- Shared interrupts: handler must check if this device
caused the interrupt → overhead
- Two messages per interrupt (assert + deassert) → overhead
- Level-triggered semantics require polling
- No per-device targeting — all go through IOAPIC routing
INTx Disable: Bit 10 of Command register (0x04).
When set, device must not send INTx messages.
Required when MSI or MSI-X is enabled.
MSI (Message Signaled Interrupts)
MSI replaces INTx with memory writes. The device writes a
specific data value to a specific memory address, which the
interrupt controller (LAPIC) interprets as an interrupt.
MSI Capability (ID 0x05):
Without per-vector masking, 32-bit:
+0x00: [Next Ptr | Cap ID=0x05] 2B
+0x02: Message Control 2B
+0x04: Message Address (lower 32) 4B
+0x08: Message Data 2B
With 64-bit addressing:
+0x00: [Next Ptr | Cap ID=0x05] 2B
+0x02: Message Control 2B
+0x04: Message Address (lower 32) 4B
+0x08: Message Address (upper 32) 4B
+0x0C: Message Data 2B
With per-vector masking and 64-bit:
+0x00: [Next Ptr | Cap ID=0x05] 2B
+0x02: Message Control 2B
+0x04: Message Address (lower 32) 4B
+0x08: Message Address (upper 32) 4B
+0x0C: Message Data 2B
+0x10: Mask Bits 4B
+0x14: Pending Bits 4B
Message Control (+0x02):
Bit 0: MSI Enable
Bits 3:1: Multiple Message Capable (log2: 000=1, 001=2,
010=4, 011=8, 100=16, 101=32 vectors)
Bits 6:4: Multiple Message Enable (same encoding, <=Capable)
Bit 7: 64-bit Address Capable
Bit 8: Per-Vector Masking Capable
Message Address (x86 LAPIC format):
Bits 31:20 = 0xFEE (fixed prefix for LAPIC)
Bits 19:12 = Destination APIC ID
Bit 3 = Redirection Hint (RH)
Bit 2 = Destination Mode (DM: 0=physical, 1=logical)
Example: Target APIC ID 0x02, physical mode:
Address = 0xFEE02000
Message Data (x86 format):
Bits 7:0 = Vector number (0-255)
Bits 10:8 = Delivery Mode (000=Fixed, 001=LowPri, etc.)
Bit 14 = Level (1=Assert for edge-triggered)
Bit 15 = Trigger Mode (0=Edge)
Example: Vector 0x41, fixed delivery, edge:
Data = 0x0041
Multiple MSI vectors:
When Multiple Message Enable > 0, the device uses the low
bits of Message Data as a sub-vector selector.
E.g., with 4 vectors enabled, vectors are Data & ~0x3 | {0,1,2,3}.
All vectors share the same address (same target CPU).
MSI operation:
1. OS writes Message Address and Message Data to capability
2. OS sets MSI Enable in Message Control
3. OS disables INTx (sets bit 10 in Command register)
4. When device needs to signal interrupt:
- Device issues a Memory Write TLP:
Address = Message Address
Data = Message Data (with vector bits set)
- This write targets the LAPIC MMIO region
- LAPIC delivers the interrupt to the CPU
5. No ack needed — MSI is edge-triggered by write
MSI-X (Extended Message Signaled Interrupts)
MSI-X provides per-vector configuration, more vectors, and
per-vector masking. Preferred over MSI for modern devices.
MSI-X Capability (ID 0x11):
+0x00: [Next Ptr | Cap ID=0x11] 2B
+0x02: Message Control 2B
+0x04: Table Offset / Table BIR 4B
+0x08: PBA Offset / PBA BIR 4B
Message Control (+0x02):
Bits 10:0: Table Size (N-1 encoded; max 2047 → 2048 vectors)
Bit 14: Function Mask (mask all vectors at once)
Bit 15: MSI-X Enable
Table Offset/BIR (+0x04):
Bits 2:0: Table BIR (BAR Indicator Register — which BAR
contains the MSI-X table)
Bits 31:3: Table Offset (byte offset within the BAR,
8-byte aligned)
PBA Offset/BIR (+0x08):
Bits 2:0: PBA BIR (which BAR contains the Pending Bit Array)
Bits 31:3: PBA Offset (byte offset within the BAR,
8-byte aligned)
MSI-X Table Entry (16 bytes per vector, in BAR space):
Byte Offset Field
─────────── ─────
+0x00 Message Address Low (32 bits)
+0x04 Message Address High (32 bits)
+0x08 Message Data (32 bits)
+0x0C Vector Control (32 bits)
Bit 0: Mask Bit (1=masked, 0=unmasked)
Bits 31:1: Reserved
Each entry is independently programmable:
- Different address (different target CPU)
- Different data (different vector number)
- Independent mask
PBA (Pending Bit Array):
One bit per vector. If the vector is masked and the device
wants to signal it, the corresponding PBA bit is set.
When the vector is unmasked, if PBA bit is set, the
interrupt is delivered immediately.
PBA is read-only from software's perspective.
MSI-X operation:
1. OS reads Table Size from Message Control
2. OS maps the BAR containing the MSI-X table
3. For each vector:
- Write Message Address (LAPIC target)
- Write Message Data (vector number)
- Clear Mask Bit in Vector Control
4. Set MSI-X Enable in Message Control
5. Disable INTx in Command register
6. When device signals vector N:
- Device reads table entry N
- Issues Memory Write TLP with address/data from entry
- If entry is masked, sets PBA bit N instead
Example MSI-X table for NVMe (4 queues + admin):
Vector 0: Admin completion queue → CPU 0, vector 0x30
Vector 1: I/O queue 1 completion → CPU 1, vector 0x31
Vector 2: I/O queue 2 completion → CPU 2, vector 0x32
Vector 3: I/O queue 3 completion → CPU 3, vector 0x33
Vector 4: I/O queue 4 completion → CPU 4, vector 0x34
Each I/O queue's completion interrupt goes directly to
the CPU processing that queue — no interrupt routing overhead.
Interrupt Delivery Path
Device → Root Complex → Interrupt Controller → CPU
For MSI/MSI-X:
Device writes to LAPIC address → RC decodes the write
as a local interrupt delivery (address in 0xFEExxxxx range)
→ LAPIC on target CPU receives the interrupt → CPU
vectors to ISR.
For INTx:
Device sends Assert_INTx message → RC routes to IOAPIC
input → IOAPIC sends interrupt to target LAPIC →
CPU vectors to ISR → ISR services device and clears
interrupt → Device sends Deassert_INTx.
In virtualization (with interrupt remapping):
VT-d Interrupt Remapping:
Device write to LAPIC address intercepted by IOMMU.
IOMMU looks up the Interrupt Remapping Table Entry (IRTE)
using the interrupt index (from address/data fields).
IRTE specifies destination VCPU and vector.
IOMMU translates and delivers to the correct VCPU.
Posted Interrupts (VT-d PI):
IOMMU can post the interrupt directly to the VCPU's
Posted Interrupt Descriptor (PID) in memory.
If VCPU is running, interrupt delivered without VM exit.
If VCPU is not running, notification event triggers
the VMM to schedule the VCPU.
Eliminates VM exit overhead for interrupt delivery.
10. Enumeration and Resource Assignment
Bus Enumeration Algorithm
Firmware (BIOS/UEFI) or the OS enumerates PCIe devices using a depth-first search (DFS):
Algorithm: enumerate(bus_number)
bus_available = bus_number + 1
for dev = 0 to 31:
for func = 0 to 7:
vendor_id = config_read16(bus_number, dev, func, 0x00)
if vendor_id == 0xFFFF:
if func == 0: break // no device at this dev#
continue // no function at this func#
header_type = config_read8(bus_number, dev, func, 0x0E)
if (header_type & 0x7F) == 0x01:
// This is a bridge — recurse
// Set primary bus = bus_number
config_write8(bus_number, dev, func, 0x18, bus_number)
// Set secondary bus = bus_available
config_write8(bus_number, dev, func, 0x19, bus_available)
// Temporarily set subordinate to max
config_write8(bus_number, dev, func, 0x1A, 0xFF)
// Recurse into the secondary bus
bus_available = enumerate(bus_available)
// Now set subordinate to the actual max bus found
config_write8(bus_number, dev, func, 0x1A, bus_available - 1)
else:
// Endpoint — size BARs, assign resources
assign_bars(bus_number, dev, func)
assign_interrupts(bus_number, dev, func)
if func == 0 && !(header_type & 0x80):
break // Single-function device, skip func 1-7
return bus_available
// Start enumeration from bus 0
enumerate(0)
BAR Assignment Strategy
Firmware/OS assigns BARs after sizing:
1. Size all BARs for all devices
2. Sort BARs by size (largest first) — ensures alignment
3. Allocate from the MMIO address pool:
- 32-bit BARs: allocate from below 4 GB
- 64-bit BARs: can allocate from above 4 GB
- I/O BARs: allocate from the I/O port range (limited!)
4. Program bridge windows to encompass all downstream BARs
MMIO pool typically:
Below 4 GB: varies by platform, often 0xC000_0000-0xFEFF_FFFF
Above 4 GB: large pool, platform-specific
Address alignment requirement:
BAR address must be naturally aligned to its size.
A 1 MB BAR must start at a 1 MB boundary.
This is enforced by the BAR's hardwired low bits.
Linux kernel functions:
pci_scan_bus() — trigger enumeration
pci_read_config_*() — read config space
pci_assign_resource() — assign BAR addresses
pci_enable_device() — enable memory/IO access in Command reg
pci_set_master() — enable Bus Master in Command reg
ACPI Tables for PCIe
MCFG (Memory-mapped Configuration Table):
Provides ECAM base addresses for each PCI segment group.
struct acpi_mcfg_allocation {
uint64_t base_address; // ECAM base physical address
uint16_t segment_group; // PCI segment group number
uint8_t start_bus; // first bus number
uint8_t end_bus; // last bus number
uint32_t reserved;
};
_DSM (Device Specific Method):
ACPI method for device-specific operations.
Used for PCIe features like hotplug, power management.
_OSC (Operating System Capabilities):
Firmware and OS negotiate control of PCIe features:
- Native PCIe hotplug control
- SHPC (Standard Hot-Plug Controller)
- PCIe native power management events
- PCIe AER control
- PCIe Capability Structure control
_HPP (Hot-Plug Parameters):
Provides default PCIe settings for hot-plugged devices.
11. PCIe in the Linux Kernel
Key Source Files
drivers/pci/
├── pci.c — Core PCI functions
├── pci-driver.c — Driver model (probe/remove)
├── probe.c — Device enumeration
├── setup-bus.c — Resource assignment
├── msi/ — MSI/MSI-X subsystem
│ ├── msi.c
│ └── irqdomain.c
├── pcie/ — PCIe-specific
│ ├── portdrv.c — Port service driver
│ ├── aer.c — Advanced Error Reporting
│ ├── dpc.c — Downstream Port Containment
│ ├── pme.c — Power Management Events
│ └── aspm.c — ASPM control
├── hotplug/ — Hotplug drivers
│ ├── pciehp_core.c — PCIe native hotplug
│ └── acpiphp_core.c — ACPI-based hotplug
├── iov.c — SR-IOV support
├── ecam.c — ECAM config space access
├── access.c — Config space read/write
└── host/ — Host bridge drivers (per-platform)
include/linux/pci.h — struct pci_dev, pci_driver, APIs
include/uapi/linux/pci_regs.h — All PCI register offset #defines
Key Data Structures
struct pci_dev {
struct list_head bus_list; // linked into pci_bus->devices
struct pci_bus *bus; // bus this device is on
struct pci_bus *subordinate; // bus behind this bridge (if bridge)
unsigned int devfn; // encoded device + function
unsigned short vendor; // from config space
unsigned short device;
unsigned short subsystem_vendor;
unsigned short subsystem_device;
unsigned int class; // class code (24 bits)
u8 revision;
u8 hdr_type; // 0=endpoint, 1=bridge
u16 pcie_cap; // offset of PCIe capability
struct resource resource[PCI_NUM_RESOURCES]; // BAR resources
// resource[0..5] = standard BARs
// resource[6] = expansion ROM
// resource[7..N] = bridge windows, etc.
unsigned int irq; // IRQ number (legacy)
bool msi_enabled;
bool msix_enabled;
struct pci_driver *driver; // bound driver
void *driver_data; // private data for driver
u8 pm_cap; // offset of PM capability
unsigned int d3hot_delay; // D3hot->D0 transition delay
pci_power_t current_state; // D0, D1, D2, D3hot, D3cold
// SR-IOV
struct pci_sriov *sriov; // SR-IOV capability info
u16 sriov_initial_vfs;
// Error handling
struct aer_stats *aer_stats;
// DMA
u64 dma_mask; // DMA address mask
};
struct pci_driver {
const char *name;
const struct pci_device_id *id_table; // match table
int (*probe)(struct pci_dev *dev, const struct pci_device_id *id);
void (*remove)(struct pci_dev *dev);
int (*suspend)(struct pci_dev *dev, pm_message_t state);
int (*resume)(struct pci_dev *dev);
void (*shutdown)(struct pci_dev *dev);
int (*sriov_configure)(struct pci_dev *dev, int num_vfs);
const struct pci_error_handlers *err_handler;
struct device_driver driver; // embedded generic driver
};
struct pci_device_id {
__u32 vendor, device; // PCI_ANY_ID = match any
__u32 subvendor, subdevice;
__u32 class, class_mask;
kernel_ulong_t driver_data;
};
Writing a PCI Driver — Skeleton
#include <linux/module.h>
#include <linux/pci.h>
#include <linux/interrupt.h>
#define MY_VENDOR_ID 0x1234
#define MY_DEVICE_ID 0x5678
struct my_device {
struct pci_dev *pdev;
void __iomem *bar0; // mapped BAR0
void __iomem *bar2; // mapped BAR2
int irq_count;
};
static const struct pci_device_id my_pci_ids[] = {
{ PCI_DEVICE(MY_VENDOR_ID, MY_DEVICE_ID) },
{ 0, } // terminator
};
MODULE_DEVICE_TABLE(pci, my_pci_ids);
static irqreturn_t my_irq_handler(int irq, void *data)
{
struct my_device *mydev = data;
u32 status = readl(mydev->bar0 + 0x08); // read ISR register
if (!(status & 0x1))
return IRQ_NONE; // not our interrupt
writel(status, mydev->bar0 + 0x08); // clear interrupt
mydev->irq_count++;
return IRQ_HANDLED;
}
static int my_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct my_device *mydev;
int err;
// 1. Allocate driver private data
mydev = kzalloc(sizeof(*mydev), GFP_KERNEL);
if (!mydev)
return -ENOMEM;
mydev->pdev = pdev;
pci_set_drvdata(pdev, mydev);
// 2. Enable the PCI device (power on, enable config space)
err = pci_enable_device(pdev);
if (err)
goto err_free;
// 3. Request MMIO regions (prevents conflicts)
err = pci_request_regions(pdev, "my_driver");
if (err)
goto err_disable;
// 4. Map BAR0 into kernel virtual address space
mydev->bar0 = pci_iomap(pdev, 0, 0); // BAR 0, map entire BAR
if (!mydev->bar0) {
err = -ENOMEM;
goto err_release;
}
// 5. Enable bus mastering (for DMA)
pci_set_master(pdev);
// 6. Set DMA mask
err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
if (err) {
err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
if (err)
goto err_unmap;
}
// 7. Allocate MSI-X vectors (or fall back to MSI)
err = pci_alloc_irq_vectors(pdev, 1, 16,
PCI_IRQ_MSIX | PCI_IRQ_MSI);
if (err < 0)
goto err_unmap;
// 8. Request IRQ for vector 0
err = request_irq(pci_irq_vector(pdev, 0), my_irq_handler,
0, "my_driver", mydev);
if (err)
goto err_free_irq_vectors;
// 9. Device-specific initialization
writel(0x01, mydev->bar0 + 0x00); // enable device
readl(mydev->bar0 + 0x00); // flush posted write
dev_info(&pdev->dev, "device initialized\n");
return 0;
err_free_irq_vectors:
pci_free_irq_vectors(pdev);
err_unmap:
pci_iounmap(pdev, mydev->bar0);
err_release:
pci_release_regions(pdev);
err_disable:
pci_disable_device(pdev);
err_free:
kfree(mydev);
return err;
}
static void my_remove(struct pci_dev *pdev)
{
struct my_device *mydev = pci_get_drvdata(pdev);
// 1. Disable device-level interrupts
writel(0x00, mydev->bar0 + 0x04); // disable interrupts
readl(mydev->bar0 + 0x04); // flush
// 2. Free IRQ
free_irq(pci_irq_vector(pdev, 0), mydev);
// 3. Free MSI/MSI-X vectors
pci_free_irq_vectors(pdev);
// 4. Unmap BARs
pci_iounmap(pdev, mydev->bar0);
// 5. Release regions
pci_release_regions(pdev);
// 6. Disable PCI device
pci_disable_device(pdev);
// 7. Free private data
kfree(mydev);
}
static struct pci_driver my_pci_driver = {
.name = "my_driver",
.id_table = my_pci_ids,
.probe = my_probe,
.remove = my_remove,
};
module_pci_driver(my_pci_driver);
MODULE_LICENSE("GPL");
DMA Mapping APIs
// Coherent (consistent) DMA mapping — kernel manages cache coherency
// Use for long-lived structures (descriptor rings, command queues)
void *virt = dma_alloc_coherent(&pdev->dev, size, &dma_handle, GFP_KERNEL);
// virt: kernel virtual address
// dma_handle: DMA address the device should use
// CPU writes are immediately visible to device (no explicit flush)
dma_free_coherent(&pdev->dev, size, virt, dma_handle);
// Streaming DMA mapping — for transient buffers (data payloads)
// Must explicitly sync before/after device access
dma_addr_t dma = dma_map_single(&pdev->dev, cpu_addr, size,
DMA_TO_DEVICE); // CPU → device
// DMA_FROM_DEVICE); // device → CPU
// DMA_BIDIRECTIONAL);
if (dma_mapping_error(&pdev->dev, dma)) { /* handle error */ }
// After device finishes DMA (before CPU reads the buffer):
dma_sync_single_for_cpu(&pdev->dev, dma, size, DMA_FROM_DEVICE);
// Before giving buffer back to device (after CPU writes):
dma_sync_single_for_device(&pdev->dev, dma, size, DMA_TO_DEVICE);
dma_unmap_single(&pdev->dev, dma, size, direction);
// Scatter-gather DMA (for multiple non-contiguous buffers)
int nents = dma_map_sg(&pdev->dev, sglist, nents_orig, direction);
// nents may be less than nents_orig (IOMMU coalescing)
for_each_sg(sglist, sg, nents, i) {
dma_addr = sg_dma_address(sg);
dma_len = sg_dma_len(sg);
// Program device DMA descriptor with dma_addr, dma_len
}
dma_unmap_sg(&pdev->dev, sglist, nents_orig, direction);
sysfs Interface
/sys/bus/pci/devices/0000:03:00.0/
├── vendor — 0x8086 (Intel)
├── device — 0x1572 (E810 NIC)
├── class — 0x020000 (Network controller)
├── subsystem_vendor
├── subsystem_device
├── revision
├── config — Binary access to 4096-byte config space
├── resource — Text: BAR start, end, flags for each BAR
├── resource0 — Binary mmap of BAR0
├── resource0_wc — BAR0 with write-combining (if prefetchable)
├── resource2 — Binary mmap of BAR2
├── rom — Expansion ROM (write 1 to enable, then read)
├── irq — IRQ number
├── local_cpus — CPU affinity mask
├── local_cpulist — CPU affinity list
├── numa_node — NUMA node
├── enable — Enable/disable device
├── remove — Hot-remove device
├── rescan — Rescan for new devices
├── driver/ — Symlink to bound driver
├── driver_override — Force specific driver binding
├── iommu_group/ — Symlink to IOMMU group
├── msi_irqs/ — Directory of assigned MSI/MSI-X IRQs
├── sriov_numvfs — Set number of SR-IOV VFs
├── sriov_totalvfs — Max VFs supported
├── sriov_vf_total_msix — Total MSI-X vectors for VFs
└── power/ — Power management attributes
Interpreting lspci output:
$ lspci -vvv -s 03:00.0
03:00.0 Network controller: Intel Corporation ...
Control: I/O- Mem+ BusMaster+ SpecCycle- ...
Status: Cap+ 66MHz- ... <MAbort- >SERR- <PERR- ...
Latency: 0
Interrupt: pin A routed to IRQ 37
Region 0: Memory at f0000000 (64-bit, non-prefetchable) [size=16M]
Region 3: Memory at f1000000 (64-bit, prefetchable) [size=32K]
Capabilities: [40] Power Management ...
Capabilities: [50] MSI-X: Enable+ Count=129 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00001000
Capabilities: [60] Express Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFuncEn 0, ExtTag+ ...
DevCtl: MaxPayload 256 bytes, MaxReadReq 512 bytes
LnkCap: Speed 8GT/s, Width x8, ASPM L1
LnkSta: Speed 8GT/s, Width x8
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] SR-IOV: ...
IOVCap: ...
IOVCtl: Enable- ...
NumVFs 0, TotalVFs 64, ...
VF BAR0: Memory at f2000000 (64-bit, non-pref) [size=16K]
$ setpci -s 03:00.0 COMMAND # Read Command register
0407
$ setpci -s 03:00.0 COMMAND=0007 # Write Command register
PCIe Error Recovery in Linux
// Error recovery callbacks in struct pci_error_handlers:
static pci_ers_result_t my_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
struct my_device *mydev = pci_get_drvdata(pdev);
if (state == pci_channel_io_perm_failure)
return PCI_ERS_RESULT_DISCONNECT;
// Stop all DMA and I/O
my_stop_dma(mydev);
// If device memory is still accessible:
if (state == pci_channel_io_normal)
return PCI_ERS_RESULT_CAN_RECOVER;
// If device memory is frozen (reads return 0xFFFFFFFF):
return PCI_ERS_RESULT_NEED_RESET;
}
static pci_ers_result_t my_slot_reset(struct pci_dev *pdev)
{
struct my_device *mydev = pci_get_drvdata(pdev);
// Re-enable the device after reset
pci_restore_state(pdev);
pci_enable_device(pdev);
pci_set_master(pdev);
// Re-initialize device hardware
my_hw_init(mydev);
return PCI_ERS_RESULT_RECOVERED;
}
static void my_resume(struct pci_dev *pdev)
{
// Resume normal operations
struct my_device *mydev = pci_get_drvdata(pdev);
my_resume_io(mydev);
}
static const struct pci_error_handlers my_err_handlers = {
.error_detected = my_error_detected,
.slot_reset = my_slot_reset,
.resume = my_resume,
};
// In pci_driver:
static struct pci_driver my_driver = {
...
.err_handler = &my_err_handlers,
};
12. PCIe Device Emulation for VMMs
Architecture Overview
When building a VMM (like QEMU, Cloud Hypervisor, Firecracker, or crosvm), you must emulate PCIe devices for the guest. This involves:
┌──────────────────────────────────────────────────────────┐
│ Guest VM │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Guest OS │ │
│ │ ├── PCI driver (e.g., virtio-net) │ │
│ │ │ reads/writes config space and BARs │ │
│ │ │ programs MSI-X vectors │ │
│ │ │ submits DMA descriptors │ │
│ │ └── PCI subsystem (enumeration, resource assign) │ │
│ └──────────┬───────────┬───────────────────────────┘ │
│ │ │ │
│ Config access MMIO access │
│ (port I/O or (BAR reads/ │
│ ECAM trap) writes trap) │
│ │ │ │
├─────────────┼───────────┼───────────────────────────────┤
│ VMM (Host userspace) │ │
│ ┌──────────▼───────────▼───────────────────────────┐ │
│ │ PCI Device Model │ │
│ │ ┌──────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Config Space │ │ BAR MMIO Handlers │ │ │
│ │ │ (4096 bytes) │ │ (device register logic) │ │ │
│ │ └──────────────┘ └──────────────────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────────────────┐ │ │
│ │ │ MSI-X Table │ │ Interrupt Injection │ │ │
│ │ │ (in BAR) │ │ (KVM irqfd / irqchip) │ │ │
│ │ └──────────────┘ └──────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Emulating Configuration Space
// Configuration space is a 4096-byte array per device.
// The VMM pre-populates standard fields and handles
// reads/writes with appropriate semantics.
struct emulated_pci_device {
uint8_t config_space[4096]; // full config space
// BAR state
struct {
uint64_t addr; // current mapped address
uint64_t size; // BAR size
uint8_t type; // memory/IO, 32/64-bit, prefetchable
bool is_upper; // true if this is upper half of 64-bit
} bars[6];
// MSI-X state
struct {
bool enabled;
uint16_t table_size; // number of vectors
struct msix_entry {
uint64_t addr;
uint32_t data;
bool masked;
} *table;
uint64_t *pba; // pending bit array
} msix;
// Capability list
// ...
};
// Initialize standard config space fields:
void init_config_space(struct emulated_pci_device *dev) {
// Vendor ID / Device ID
*(uint16_t *)(dev->config_space + 0x00) = htole16(VENDOR_ID);
*(uint16_t *)(dev->config_space + 0x02) = htole16(DEVICE_ID);
// Command: start with everything disabled
*(uint16_t *)(dev->config_space + 0x04) = 0x0000;
// Status: capabilities list present
*(uint16_t *)(dev->config_space + 0x06) = htole16(1 << 4);
// Class code
dev->config_space[0x09] = PROG_IF;
dev->config_space[0x0A] = SUBCLASS;
dev->config_space[0x0B] = CLASS_CODE;
// Header type (endpoint, single-function)
dev->config_space[0x0E] = 0x00;
// Capabilities pointer (offset of first capability)
dev->config_space[0x34] = FIRST_CAP_OFFSET;
// Initialize BARs (write size mask, firmware will size them)
// BAR0: 64-bit, non-prefetchable, 16 KB
dev->bars[0].size = 16384;
dev->bars[0].type = 0x04; // 64-bit memory
// BAR register initially shows the type bits, address = 0
*(uint32_t *)(dev->config_space + 0x10) = htole32(0x04);
*(uint32_t *)(dev->config_space + 0x14) = 0;
dev->bars[1].is_upper = true;
}
// Handle config space read:
uint32_t config_read(struct emulated_pci_device *dev,
uint16_t offset, uint8_t size) {
uint32_t val = 0;
// Some registers have special read behavior
switch (offset) {
case 0x10: case 0x14: case 0x18: case 0x1C:
case 0x20: case 0x24:
// BAR read — return current address | type bits
val = *(uint32_t *)(dev->config_space + offset);
break;
default:
memcpy(&val, dev->config_space + offset, size);
break;
}
return val;
}
// Handle config space write:
void config_write(struct emulated_pci_device *dev,
uint16_t offset, uint32_t val, uint8_t size) {
switch (offset) {
case 0x04: { // Command register
uint16_t cmd = (uint16_t)val;
uint16_t old = *(uint16_t *)(dev->config_space + 0x04);
// Only allow writable bits
uint16_t writable = 0x0447; // IO, Mem, BusMaster, INTx Disable, etc.
cmd = (old & ~writable) | (cmd & writable);
*(uint16_t *)(dev->config_space + 0x04) = cmd;
// React to changes:
if ((cmd & 0x02) && !(old & 0x02)) {
// Memory space just enabled — register MMIO regions
register_mmio_regions(dev);
}
if (!(cmd & 0x02) && (old & 0x02)) {
// Memory space disabled — unregister MMIO regions
unregister_mmio_regions(dev);
}
break;
}
case 0x10: case 0x14: case 0x18: case 0x1C:
case 0x20: case 0x24: {
// BAR write — handle sizing and address assignment
int bar_idx = (offset - 0x10) / 4;
handle_bar_write(dev, bar_idx, val);
break;
}
// ... handle capability writes (MSI-X enable, etc.)
default:
memcpy(dev->config_space + offset, &val, size);
break;
}
}
// BAR write handling (critical for firmware interaction):
void handle_bar_write(struct emulated_pci_device *dev,
int idx, uint32_t val) {
if (dev->bars[idx].is_upper) {
// Upper 32 bits of 64-bit BAR
if (val == 0xFFFFFFFF) {
// Sizing: return upper size mask
uint64_t mask = ~(dev->bars[idx-1].size - 1);
*(uint32_t *)(dev->config_space + 0x10 + idx*4) =
htole32((uint32_t)(mask >> 32));
} else {
dev->bars[idx-1].addr = (dev->bars[idx-1].addr & 0xFFFFFFFF)
| ((uint64_t)val << 32);
*(uint32_t *)(dev->config_space + 0x10 + idx*4) = htole32(val);
// Re-register MMIO region at new address
update_mmio_mapping(dev, idx-1);
}
return;
}
if (val == 0xFFFFFFFF) {
// Sizing: return size mask with type bits
uint64_t mask = ~(dev->bars[idx].size - 1);
uint32_t type_bits = dev->bars[idx].type;
*(uint32_t *)(dev->config_space + 0x10 + idx*4) =
htole32((uint32_t)(mask & 0xFFFFFFF0) | type_bits);
} else {
// Address assignment
dev->bars[idx].addr = (dev->bars[idx].addr & 0xFFFFFFFF00000000ULL)
| (val & 0xFFFFFFF0);
*(uint32_t *)(dev->config_space + 0x10 + idx*4) =
htole32((val & 0xFFFFFFF0) | dev->bars[idx].type);
update_mmio_mapping(dev, idx);
}
}
Emulating MSI-X
// MSI-X table lives in a BAR. When the guest reads/writes
// the MSI-X table region, the VMM traps the access.
void msix_table_write(struct emulated_pci_device *dev,
uint64_t offset, uint64_t val, uint8_t size) {
int vector = offset / 16; // 16 bytes per entry
int field = offset % 16; // which field within entry
if (vector >= dev->msix.table_size) return;
struct msix_entry *entry = &dev->msix.table[vector];
switch (field) {
case 0x00: // Message Address Low
entry->addr = (entry->addr & 0xFFFFFFFF00000000ULL) | (val & 0xFFFFFFFC);
break;
case 0x04: // Message Address High
entry->addr = (entry->addr & 0x00000000FFFFFFFF) | ((uint64_t)val << 32);
break;
case 0x08: // Message Data
entry->data = (uint32_t)val;
break;
case 0x0C: // Vector Control
entry->masked = (val & 1);
if (!entry->masked && (dev->msix.pba[vector / 64] & (1ULL << (vector % 64)))) {
// Was pending and now unmasked — deliver interrupt
dev->msix.pba[vector / 64] &= ~(1ULL << (vector % 64));
inject_interrupt(dev, vector);
}
break;
}
// Update KVM irqfd routing if address/data changed
if (field <= 0x08 && !entry->masked) {
update_irq_routing(dev, vector, entry->addr, entry->data);
}
}
// Interrupt injection into guest (via KVM):
void inject_interrupt(struct emulated_pci_device *dev, int vector) {
struct msix_entry *entry = &dev->msix.table[vector];
if (entry->masked) {
// Set pending bit
dev->msix.pba[vector / 64] |= (1ULL << (vector % 64));
return;
}
// Method 1: Direct KVM ioctl (slow, causes VM exit)
struct kvm_irq_level irq = {
.irq = vector,
.level = 1,
};
ioctl(kvm_fd, KVM_IRQ_LINE, &irq);
// Method 2: irqfd (fast, no VM exit needed)
// Pre-register: associate eventfd with MSI address/data
struct kvm_irqfd irqfd = {
.fd = eventfd_create(0, EFD_NONBLOCK),
.gsi = gsi_number, // global system interrupt
.flags = 0,
};
ioctl(kvm_fd, KVM_IRQFD, &irqfd);
// Now writing 1 to the eventfd triggers the interrupt
// Method 3: MSI routing (KVM_SET_GSI_ROUTING)
// Map GSI to MSI address/data, then use irqfd
struct kvm_msi msi = {
.address_lo = (uint32_t)entry->addr,
.address_hi = (uint32_t)(entry->addr >> 32),
.data = entry->data,
};
ioctl(kvm_fd, KVM_SIGNAL_MSI, &msi);
}
Virtio over PCI
Virtio-PCI is the most common transport for virtual devices. It uses PCIe capabilities to expose virtio configuration.
Virtio PCI Capability Structure (in PCI config space):
Each virtio-pci capability is a standard PCI capability with
Cap ID = 0x09 (Vendor-Specific) containing:
struct virtio_pci_cap {
uint8_t cap_vndr; // 0x09 (PCI_CAP_ID_VNDR)
uint8_t cap_next; // Offset of next capability
uint8_t cap_len; // Length of this capability
uint8_t cfg_type; // Type of virtio structure:
// 1 = VIRTIO_PCI_CAP_COMMON_CFG
// 2 = VIRTIO_PCI_CAP_NOTIFY_CFG
// 3 = VIRTIO_PCI_CAP_ISR_CFG
// 4 = VIRTIO_PCI_CAP_DEVICE_CFG
// 5 = VIRTIO_PCI_CAP_PCI_CFG
uint8_t bar; // BAR containing the structure
uint8_t id; // Multiple instances of same type
uint8_t padding[2];
uint32_t offset; // Offset within BAR
uint32_t length; // Length of structure
};
Five capability types:
1. Common Configuration (cfg_type=1):
Mapped in a BAR. Contains:
- device_feature_select / device_feature — feature negotiation
- driver_feature_select / driver_feature
- msix_config — config MSI-X vector
- num_queues — number of virtqueues
- device_status — status byte
- config_generation
- queue_select / queue_size / queue_msix_vector
- queue_enable / queue_notify_off
- queue_desc / queue_driver / queue_device — virtqueue addresses
2. Notification (cfg_type=2):
Extended structure with:
struct virtio_pci_notify_cap {
struct virtio_pci_cap cap;
uint32_t notify_off_multiplier; // multiplier for queue offset
};
To notify queue Q:
Write queue index to:
BAR[cap.bar] + cap.offset + queue_notify_off * notify_off_multiplier
The VMM traps this MMIO write using ioeventfd:
KVM_IOEVENTFD maps a specific MMIO address to an eventfd.
When guest writes to the doorbell address, KVM signals the
eventfd without VM exit. The VMM backend thread reads the
eventfd and processes the virtqueue.
3. ISR Status (cfg_type=3):
Single byte register for legacy interrupt status.
Bit 0: virtqueue interrupt
Bit 1: device configuration change
4. Device Configuration (cfg_type=4):
Device-type-specific configuration (e.g., MAC address for
virtio-net, capacity for virtio-blk). Format varies by device.
5. PCI Configuration Access (cfg_type=5):
Allows accessing the other structures via PCI config space
reads/writes (for platforms that can't easily mmap BARs).
Modern Virtio (1.0+) vs Legacy:
Modern: Uses PCI capabilities above. Guest discovers structures
via capability list walking. Clean, extensible.
Legacy: Fixed layout in BAR0:
Offset 0x00-0x13: Common header
Offset 0x14+: Device-specific config
No capabilities. BAR0 I/O port space.
Still supported for compatibility.
Implementing a Minimal PCI Device in a VMM
// Minimal example: a PCI device that exposes a 256-byte
// register space with a writable scratch register and
// an interrupt-on-write register.
struct minimal_pci_dev {
uint8_t config[4096];
uint8_t regs[256]; // device registers in BAR0
int kvm_fd;
int irqfd; // eventfd for MSI-X vector 0
struct msix_entry msix[1]; // single MSI-X vector
};
// Step 1: Initialize config space
void minimal_init(struct minimal_pci_dev *dev) {
memset(dev->config, 0, sizeof(dev->config));
// Vendor=0x1AF4 (Red Hat), Device=0x1000
put_le16(dev->config + 0x00, 0x1AF4);
put_le16(dev->config + 0x02, 0x1000);
// Status: capabilities list present
put_le16(dev->config + 0x06, 0x0010);
// Class: Unclassified (0xFF)
dev->config[0x0B] = 0xFF;
// Header type 0, single function
dev->config[0x0E] = 0x00;
// BAR0: 256 bytes, 32-bit, non-prefetchable
// (size mask = ~(256-1) = 0xFFFFFF00)
put_le32(dev->config + 0x10, 0x00); // type = memory 32-bit
// Capabilities pointer → offset 0x40
dev->config[0x34] = 0x40;
// MSI-X capability at offset 0x40
dev->config[0x40] = 0x11; // Cap ID = MSI-X
dev->config[0x41] = 0x00; // Next = end of list
put_le16(dev->config + 0x42, 0x0000); // 1 vector (N-1=0), disabled
put_le32(dev->config + 0x44, 0x00000100 | 0); // Table at BAR0+0x100
put_le32(dev->config + 0x48, 0x00000180 | 0); // PBA at BAR0+0x180
}
// Step 2: Register with KVM
// - Map ECAM region for config space access (or use port I/O trap)
// - Set up ioeventfd for BAR0 MMIO writes
// - Set up irqfd for MSI-X interrupt injection
// Step 3: Handle MMIO reads/writes to BAR0
uint32_t bar0_read(struct minimal_pci_dev *dev, uint64_t offset) {
if (offset < 0x100) {
// Device registers
return get_le32(dev->regs + offset);
} else if (offset < 0x110) {
// MSI-X table (1 entry = 16 bytes)
int field = (offset - 0x100) % 16;
switch (field) {
case 0: return (uint32_t)dev->msix[0].addr;
case 4: return (uint32_t)(dev->msix[0].addr >> 32);
case 8: return dev->msix[0].data;
case 12: return dev->msix[0].masked ? 1 : 0;
}
}
return 0xFFFFFFFF;
}
void bar0_write(struct minimal_pci_dev *dev, uint64_t offset,
uint32_t val) {
if (offset == 0x04) {
// Interrupt trigger register: writing any value
// causes MSI-X vector 0 to fire
uint64_t one = 1;
write(dev->irqfd, &one, sizeof(one)); // trigger irqfd
} else if (offset < 0x100) {
put_le32(dev->regs + offset, val);
} else if (offset < 0x110) {
// MSI-X table write — update routing
msix_table_write(dev, offset - 0x100, val);
}
}
13. PCIe Passthrough and SR-IOV
VFIO-Based PCIe Passthrough
Passthrough gives a guest VM direct access to a physical PCIe
device. The guest's driver talks directly to hardware — no
emulation overhead for data path operations.
Architecture:
┌──────────────────────────────────┐
│ Guest VM │
│ Guest driver ←→ Physical NIC │ ← Direct MMIO + DMA
└────────┬─────────────────────────┘
│ (guest physical → host physical mapping)
┌────────┴─────────────────────────┐
│ IOMMU (VT-d / AMD-Vi) │ ← DMA isolation
│ Translates guest DMA addresses │
│ to host physical addresses │
└──────────────────────────────────┘
Steps for passthrough:
1. Unbind device from host driver:
echo 0000:03:00.0 > /sys/bus/pci/devices/0000:03:00.0/driver/unbind
2. Bind to vfio-pci:
echo 8086 1572 > /sys/bus/pci/drivers/vfio-pci/new_id
# or: echo vfio-pci > /sys/bus/pci/devices/0000:03:00.0/driver_override
# echo 0000:03:00.0 > /sys/bus/pci/drivers_probe
3. Open VFIO container and group:
container_fd = open("/dev/vfio/vfio", O_RDWR);
group_fd = open("/dev/vfio/42", O_RDWR); // IOMMU group
ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container_fd);
ioctl(container_fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU);
4. Get device fd:
device_fd = ioctl(group_fd, VFIO_GROUP_GET_DEVICE_FD, "0000:03:00.0");
5. Map guest memory for DMA:
struct vfio_iommu_type1_dma_map dma_map = {
.vaddr = (uint64_t)guest_ram,
.iova = 0, // guest physical address = 0
.size = guest_ram_size,
.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
};
ioctl(container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
6. Access config space:
struct vfio_region_info config_info = {
.argsz = sizeof(config_info),
.index = VFIO_PCI_CONFIG_REGION_INDEX,
};
ioctl(device_fd, VFIO_DEVICE_GET_REGION_INFO, &config_info);
// Read config: pread(device_fd, buf, 4, config_info.offset + reg);
// Write config: pwrite(device_fd, &val, 4, config_info.offset + reg);
7. Map BARs into guest:
struct vfio_region_info bar_info = {
.argsz = sizeof(bar_info),
.index = VFIO_PCI_BAR0_REGION_INDEX,
};
ioctl(device_fd, VFIO_DEVICE_GET_REGION_INFO, &bar_info);
void *bar0 = mmap(NULL, bar_info.size,
PROT_READ | PROT_WRITE, MAP_SHARED,
device_fd, bar_info.offset);
// Then map this into guest address space via KVM_SET_USER_MEMORY_REGION
8. Set up interrupts:
struct vfio_irq_info irq_info = {
.argsz = sizeof(irq_info),
.index = VFIO_PCI_MSIX_IRQ_INDEX,
};
ioctl(device_fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
int efd = eventfd(0, EFD_NONBLOCK);
struct vfio_irq_set *irq_set = alloc_irq_set(1);
irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
irq_set->start = 0;
irq_set->count = 1;
*(int *)irq_set->data = efd;
ioctl(device_fd, VFIO_DEVICE_SET_IRQS, irq_set);
// Then connect this eventfd to KVM for interrupt injection:
struct kvm_irqfd kvm_irqfd = { .fd = efd, .gsi = guest_gsi };
ioctl(kvm_fd, KVM_IRQFD, &kvm_irqfd);
Config Space Virtualization for Passthrough
Even in passthrough, config space access must be virtualized
(trapped by the VMM), not passed through directly. Reasons:
1. BAR addresses: Guest sees guest physical addresses, but
host BARs are at different host physical addresses.
The VMM must translate.
2. Bus numbers: Guest has its own PCI bus numbering.
Requester ID in config space doesn't match host.
3. Capabilities: Some capabilities (ASPM, power management)
must be controlled by the host, not the guest.
4. Security: Unrestricted config write could let guest
disable IOMMU checks, change BARs to overlap with
other devices, etc.
Typical virtualization strategy:
- Vendor ID, Device ID, Class Code: Pass through from device
- Command register: Virtualize. Track guest's view. Apply
safe bits to hardware. Never let guest disable IOMMU
features.
- BARs: Show guest the guest physical addresses. When guest
writes BAR (sizing or assignment), handle in VMM. The
actual hardware BARs remain at host-assigned addresses.
- Capabilities: Selective pass-through. MSI-X capability
is intercepted (VMM manages interrupt routing). PCIe
capability link status can be passed through (read-only).
Power management: intercept to prevent guest from
power-managing the device (host controls this).
- Extended capabilities: AER status can be passed through.
SR-IOV capability is hidden from VF guests.
SR-IOV Architecture
SR-IOV splits a single physical device (Physical Function, PF)
into multiple Virtual Functions (VFs), each assignable to a
different VM.
┌──────────────────────────────────────────────────────────┐
│ Physical NIC (e.g., Intel E810) │
│ │
│ PF (Bus:Dev.0) │
│ ├── Full device functionality │
│ ├── SR-IOV capability │
│ ├── Controls VF creation/destruction │
│ └── Manages shared resources (link, firmware, etc.) │
│ │
│ VF0 (Bus:Dev+offset.0) VF1 (Bus:Dev+offset+stride.0)
│ ├── Lightweight PCIe function ├── Lightweight PCIe function
│ ├── Own config space ├── Own config space
│ ├── Own BARs (queue regs) ├── Own BARs (queue regs)
│ ├── Own MSI-X vectors ├── Own MSI-X vectors
│ └── Assigned to VM1 └── Assigned to VM2
│ │
│ ... up to TotalVFs ... │
└──────────────────────────────────────────────────────────┘
VF creation (Linux):
# Check total VFs supported
cat /sys/bus/pci/devices/0000:03:00.0/sriov_totalvfs
64
# Create 4 VFs
echo 4 > /sys/bus/pci/devices/0000:03:00.0/sriov_numvfs
# VFs appear as new PCI devices:
# 0000:03:01.0 (VF0)
# 0000:03:01.1 (VF1)
# 0000:03:02.0 (VF2)
# 0000:03:02.1 (VF3)
# Each VF can be individually bound to vfio-pci for passthrough
VF properties:
- VFs have no SR-IOV capability (only PF does)
- VFs have limited config space (no power management control)
- VF BARs are slices of the PF's VF BAR aperture
- VFs share the physical link but have isolated queues
- VFs have own MSI-X vectors (interrupt isolation)
- VFs appear to guest as regular PCIe endpoints
Key advantage over full device passthrough:
- Multiple VMs can share one physical device
- Near-native performance (no emulation on data path)
- Hardware-enforced isolation between VFs
Production examples:
Intel E810 (100G NIC): 256 VFs per PF, 4 PFs
Mellanox ConnectX-6: 1024 VFs per PF
NVMe SSDs (some): VF support for storage isolation
Intel QAT: VFs for crypto offload to VMs
SR-IOV in Linux Kernel
// PF driver implements sriov_configure callback:
static int my_sriov_configure(struct pci_dev *pdev, int num_vfs)
{
if (num_vfs == 0) {
// Disable SR-IOV
pci_disable_sriov(pdev);
// Free VF resources
my_free_vf_resources(pdev);
return 0;
}
// Allocate per-VF resources (queues, etc.)
int err = my_alloc_vf_resources(pdev, num_vfs);
if (err)
return err;
// Enable SR-IOV — creates VF PCIe functions
err = pci_enable_sriov(pdev, num_vfs);
if (err) {
my_free_vf_resources(pdev);
return err;
}
return num_vfs;
}
static struct pci_driver my_pf_driver = {
.name = "my_nic_pf",
.id_table = my_pf_ids,
.probe = my_pf_probe,
.remove = my_pf_remove,
.sriov_configure = my_sriov_configure,
};
// VF driver is a separate, simpler driver:
static struct pci_driver my_vf_driver = {
.name = "my_nic_vf",
.id_table = my_vf_ids, // matches VF device ID
.probe = my_vf_probe,
.remove = my_vf_remove,
};
14. Power Management
ASPM (Active State Power Management)
ASPM allows PCIe links to enter low-power states during idle
periods, without software involvement.
L0: Active. Full bandwidth. Normal operation.
No power savings. No latency.
L0s: Standby. TX lanes go electrically idle.
Entry: Autonomous (hardware detects idle period).
Exit latency: ~1 us (fast).
Power savings: Moderate (TX PLL can idle).
Per-direction: Each direction can enter L0s independently.
Caveat: Short idle periods → frequent transitions → overhead.
L1: Low power. Both TX and RX quiesced. PLL may be off.
Entry: Both ends must agree (via DLLP handshake):
1. Upstream port sends PM_Enter_L1 DLLP
2. Downstream port sends PM_Request_Ack DLLP
3. Both enter L1
Exit latency: 2-32 us (configurable).
Power savings: Significant.
Software can enable/disable via Link Control register.
L1 Substates (PCIe 3.1+, L1 PM Substates capability):
L1.0: Standard L1. PHY powered, reference clock on.
L1.1: PCI-PM L1. Reference clock can be gated.
Additional power savings from clock gating.
Exit latency: add clock stabilization time.
L1.2: CLKREQ#-based L1. Common-mode voltage removed.
Maximum power savings. Only logic for wake detection
is powered.
Exit latency: longest (up to 64 us).
Requires CLKREQ# signal (not available on all platforms).
ASPM configuration:
Link Control register bits 1:0:
00 = ASPM disabled
01 = L0s enabled
10 = L1 enabled
11 = L0s and L1 enabled
Linux kernel: /sys/module/pcie_aspm/parameters/policy
default, performance, powersave, powersupersave
Gotcha: ASPM can cause latency spikes. Some NVMe drivers
disable ASPM for consistent performance:
pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S |
PCIE_LINK_STATE_L1);
D-States
D-state transitions:
┌──────┐ software ┌──────┐ software ┌──────┐
│ D0 │────────────►│ D1 │────────────►│ D2 │
│Active│◄────────────│Light │◄────────────│Deeper│
└──┬───┘ └──────┘ └──────┘
│ │
│ software software │
▼ ▼
┌──────┐ ┌──────────┐
│D3hot │◄─────────────────────────────│ D3cold │
│Config│ power removal │ No power│
│only │──────────────────────────────►│ (aux │
└──────┘ │ only) │
└──────────┘
Transition via PMCSR register (power management capability):
Write PowerState bits [1:0] to desired D-state.
D3hot → D0: Requires at least 10 ms delay (spec minimum).
D3cold → D0: Full power-on reset (link retrain, enumeration).
Device context in each state:
D0: All context preserved. Fully operational.
D1: Most context preserved. Quick resume.
D2: May lose some context. Moderate resume time.
D3hot: Config space accessible. No MMIO/DMA. Internal
state may be lost. Software must re-initialize.
D3cold: Nothing preserved except aux-powered PME logic.
Device must be fully re-initialized on power-on.
PME (Power Management Event):
Device in D1/D2/D3 can signal PME to wake the system.
In PCIe: PME is a message TLP sent upstream to the Root Port.
Root Port has PME interrupt handling (via AER/PME service).
Linux: pcie_pme driver handles PME messages.
15. Error Handling
Error Classification
Correctable Errors:
Automatically corrected by hardware (link layer retry).
No data corruption. Counted for monitoring.
Specific errors:
- Receiver Error: 8b/10b or 128b/130b decode error
- Bad TLP: CRC error on TLP (LCRC mismatch)
- Bad DLLP: CRC error on DLLP
- Replay Timer Timeout: Ack not received in time
- Replay Number Rollover: Too many retries
- Advisory Non-Fatal: Non-fatal error treated as correctable
- Corrected Internal Error: Internal logic error corrected
Uncorrectable Non-Fatal Errors:
Data may be corrupted but device can continue.
Software intervention needed.
Specific errors:
- Poisoned TLP Received: Data payload known bad
- Completion Timeout: No completion received for non-posted request
- Completer Abort: Completer deliberately rejected request
- Unexpected Completion: Received completion for no matching request
- Unsupported Request: Request type or address not supported
- ECRC Error: End-to-end CRC mismatch
- ACS Violation: ACS check failed
- Uncorrectable Internal Error: Internal logic error
Uncorrectable Fatal Errors:
Link reliability compromised. Link must be reset.
Specific errors:
- Data Link Protocol Error: DLLP sequence error
- Surprise Down Error: Link went down unexpectedly
- Flow Control Protocol Error: Credit protocol violation
- Receiver Overflow: Receiver buffer overflow
- Malformed TLP: TLP structure invalid
AER Error Flow
Error occurs at device:
1. Device sets error bit in AER Uncorrectable/Correctable
Error Status register.
2. If error reporting is enabled in Device Control register:
- Device sends ERR_COR, ERR_NONFATAL, or ERR_FATAL message
upstream toward Root Port.
3. Root Port receives error message:
- Sets appropriate bit in Root Error Status register.
- If interrupt enabled: triggers AER interrupt.
4. Linux AER driver (aer.c):
- Reads Root Error Status to identify error source.
- Reads error device's AER registers (via config space).
- Logs the Header Log (first 4 DW of offending TLP).
- For fatal/non-fatal: initiates error recovery state machine.
- Calls driver's err_handler callbacks.
DPC (Downstream Port Containment):
Configured on Root Ports or Switch Downstream Ports.
When fatal error detected on downstream link:
1. DPC automatically disables the link (containment).
2. Prevents error propagation to rest of hierarchy.
3. Triggers DPC interrupt to software.
4. Software can attempt recovery:
- Clear DPC trigger status
- Retrain link
- Re-enumerate devices
- Or hot-remove the failed device
DPC is especially valuable for NVMe hotplug and
surprise-removal scenarios.
Linux AER Recovery State Machine
Error Detected
│
┌────▼────────────────────┐
│ error_detected() │
│ - Stop I/O │
│ - Return: CAN_RECOVER, │
│ NEED_RESET, or │
│ DISCONNECT │
└────┬────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
CAN_RECOVER NEED_RESET DISCONNECT
│ │ │
┌────▼────┐ ┌────▼────┐ │
│mmio_ │ │platform │ │
│enabled()│ │link/slot│ Remove
│- Probe │ │reset │ device
│ device │ └────┬────┘
└────┬────┘ │
│ ┌────▼────┐
│ │slot_ │
│ │reset() │
│ │- Restore│
│ │ config │
│ │- Re-init│
│ └────┬────┘
│ │
└──────┬─────┘
│
┌────▼────┐
│resume() │
│- Normal │
│ ops │
└─────────┘
16. Advanced Topics
CXL (Compute Express Link)
CXL is built on top of PCIe physical layer. Three sub-protocols:
CXL.io: Equivalent to PCIe TLP layer. Used for device discovery,
configuration, MMIO access. Functionally identical to PCIe
transactions. This is what the device uses during enumeration
and initialization.
CXL.cache: Device-to-host cache coherency protocol. Allows the
device to cache host memory with full coherency (snoop/invalidate).
Enables accelerators to access host memory without software
cache flush/invalidate. Uses separate header format from TLPs.
CXL.mem: Host-to-device memory access protocol. Allows the host
CPU to access device-attached memory as if it were local DRAM
(load/store, cacheable). Enables memory expanders, pooling.
CXL device types:
Type 1: Accelerator (CXL.io + CXL.cache) — e.g., SmartNIC
Type 2: Accelerator with memory (CXL.io + CXL.cache + CXL.mem)
— e.g., GPU with device memory
Type 3: Memory expander (CXL.io + CXL.mem) — e.g., CXL DRAM
module, persistent memory
CXL versions:
CXL 1.0/1.1: Based on PCIe Gen 5 PHY
CXL 2.0: Memory pooling, switching, security
CXL 3.0: Based on PCIe Gen 6 PHY (PAM4, 64 GT/s)
Back-invalidation, peer-to-peer, fabric management
Relationship to PCIe:
CXL uses the same electrical signaling, link training (LTSSM),
and physical layer as PCIe. A CXL port negotiates CXL protocol
during link training via modified TS1/TS2 ordered sets.
If both sides support CXL, the link runs CXL protocols.
If either side is PCIe-only, it falls back to standard PCIe.
CXL devices enumerate as PCIe devices (same config space,
BDF, ECAM) with additional CXL-specific capabilities.
NVMe over PCIe
NVMe uses PCIe as its transport. Key architectural mapping:
NVMe BAR0 (Controller Registers):
Offset 0x00: CAP (Controller Capabilities, 8B)
Offset 0x08: VS (Version, 4B)
Offset 0x0C: INTMS (Interrupt Mask Set, 4B)
Offset 0x10: INTMC (Interrupt Mask Clear, 4B)
Offset 0x14: CC (Controller Configuration, 4B)
Offset 0x1C: CSTS (Controller Status, 4B)
Offset 0x20: NSSR (NVM Subsystem Reset, 4B)
Offset 0x24: AQA (Admin Queue Attributes, 4B)
Offset 0x28: ASQ (Admin Submission Queue Base Addr, 8B)
Offset 0x30: ACQ (Admin Completion Queue Base Addr, 8B)
Offset 0x1000+: Doorbell registers (per-queue)
Submission Queue (SQ): Ring buffer in host memory. Host writes
64-byte NVMe commands. Host writes SQ Tail Doorbell in BAR0
to notify controller of new commands.
Doorbell write = single 4-byte MMIO write = single PCIe
Memory Write TLP. This is the "submission" cost.
Completion Queue (CQ): Ring buffer in host memory. Controller
DMAs 16-byte completion entries. Controller triggers MSI-X
interrupt to notify host.
DMA write + interrupt = one PCIe Memory Write TLP (completion
data) + one PCIe Memory Write TLP (MSI-X interrupt).
Performance-critical path:
Submit: CPU writes doorbell → 1 posted Memory Write TLP (fast)
Complete: Device DMAs completion → triggers MSI-X → CPU reads CQ
Data: Device DMAs data to/from host memory → bulk Memory Write TLPs
NVMe and PCIe MPS/MRRS:
NVMe data transfers are limited by MPS (for writes) and MRRS
(for reads). Larger values mean fewer TLPs for the same data,
reducing per-TLP overhead. Setting MPS=256 or 512 can improve
throughput.
PCIe Peer-to-Peer (P2P) DMA
P2P DMA allows one PCIe device to directly read/write another
PCIe device's memory (BARs), without going through system DRAM.
Use cases:
- GPU Direct Storage: NVMe SSD → GPU VRAM (bypass CPU/DRAM)
- GPUDirect RDMA: NIC → GPU VRAM (bypass CPU/DRAM)
- FPGA ↔ GPU direct communication
Requirements:
- Both devices must be behind the same Root Port (or the
Root Complex must support P2P routing)
- ACS must be configured to allow P2P (or disabled)
- IOMMU must allow the P2P DMA mappings
How it works:
Device A wants to DMA to Device B's BAR:
1. Device A issues a Memory Write TLP with Device B's BAR address
2. The switch (or Root Complex) routes the TLP directly to
Device B based on address matching
3. The data goes Device A → Switch → Device B (never touches DRAM)
Linux kernel P2P support (drivers/pci/p2pdma.c):
pci_p2pdma_distance() — check if P2P is feasible
pci_alloc_p2pmem() — allocate P2P-capable memory
pci_p2pdma_map_sg() — DMA-map for P2P transfer
Limitations:
- Not all Root Complexes support P2P (some force all traffic
through DRAM). Intel chipsets historically didn't support
P2P across Root Ports.
- Switch-based P2P works more reliably (both devices behind
same PCIe switch).
- IOMMU interaction: P2P requests may or may not go through
the IOMMU depending on topology and ACS configuration.
PCIe 6.0/7.0: FLIT Mode
FLIT (Flow control unIT) mode fundamentally changes the
Transaction and Data Link Layer for Gen 6+:
Traditional mode (Gen 1-5):
Variable-length TLPs, each with its own LCRC.
DLLPs separate from TLPs.
Ack/Nak per TLP.
FLIT mode (Gen 6+):
Fixed 256-byte FLITs. Every FLIT is exactly 256 bytes.
FLIT structure:
┌──────────────────────────────────────────────────────┐
│ 236 bytes: TLP data │
│ (can contain multiple small TLPs, or fragments │
│ of large TLPs, packed tightly) │
├──────────────────────────────────────────────────────┤
│ 6 bytes: FLIT header (type, sequence#, credits) │
├──────────────────────────────────────────────────────┤
│ 8 bytes: CRC (covers entire FLIT) │
├──────────────────────────────────────────────────────┤
│ 6 bytes: FEC (Forward Error Correction) │
└──────────────────────────────────────────────────────┘
Total: 256 bytes
Benefits:
- No per-TLP ECRC/LCRC overhead (single CRC per FLIT)
- Small TLPs packed efficiently (no wasted link bandwidth)
- FEC provides error correction without retransmission
(corrects 1-bit errors, detects multi-bit)
- Flow control credits embedded in every FLIT (no separate
UpdateFC DLLPs needed)
- Simplified hardware: fixed-size processing pipeline
Efficiency comparison:
64-byte Memory Write TLP:
Gen 5 (traditional): 64B data + 16B header + 4B ECRC + 4B LCRC
+ 2B seq# = 90 bytes → 71% efficiency
Gen 6 (FLIT): Packed into 236B payload area with other TLPs
→ ~92% efficiency (amortized FLIT overhead)
L0p power state (Gen 6+):
New power state between L0 and L0s.
Reduces link width dynamically (e.g., x8 → x2) while
maintaining active state. No exit latency for remaining lanes.
Provides power savings proportional to reduced width.
Other Advanced Features
PCIe IDE (Integrity and Data Encryption):
Link-level encryption for TLP payloads.
AES-GCM-256 encryption + integrity protection.
Protects against physical attacks on PCIe links
(interposers, protocol analyzers that modify data).
Negotiated via IDE capability and DOE/SPDM.
DOE (Data Object Exchange):
Generic mechanism for exchanging data objects via config space.
Used for CMA (Component Measurement and Authentication),
SPDM (Security Protocol and Data Model), and IDE setup.
Replaces vendor-specific mailbox protocols.
MCTP (Management Component Transport Protocol):
Sideband management protocol. Can run over PCIe (using
VDM = Vendor Defined Messages). Used for BMC communication
with PCIe devices (firmware update, telemetry, etc.).
TPH (TLP Processing Hints):
TLP carries Steering Tag hints that tell the completer
(Root Complex / memory controller) where to place the data
in the CPU cache hierarchy. Can direct DMA data to specific
cache levels or bypass cache entirely.
Useful for NIC receive buffers (place in L2 cache of the
CPU that will process the packet).
Coherent Interconnects layered on PCIe:
CXL: Cache/memory coherency (see above)
CCIX: Cache Coherent Interconnect for Accelerators
(ARM-ecosystem, being superseded by CXL)
Gen-Z: Fabric-based memory-semantic interconnect (defunct,
merged into CXL efforts)
OpenCAPI: Open Coherent Accelerator Processor Interface
(IBM Power, now folded into CXL)
UCIe: Universal Chiplet Interconnect Express
Chiplet-to-chiplet interconnect. Uses PCIe/CXL
protocols at the die-to-die level. Physical layer
designed for short-reach (in-package) signaling.
17. Key References
Specifications
- PCI Express Base Specification, PCI-SIG. Revision 6.1 (2024), 7.0 (2025). The authoritative source. Available to PCI-SIG members. Covers all layers, TLP formats, capabilities, LTSSM, etc. (~1500 pages for Rev 5.0).
- PCI Local Bus Specification, PCI-SIG, Revision 3.0. Covers legacy PCI config space layout, BAR mechanism, capability list. Still relevant for understanding PCIe's software interface.
- Single Root I/O Virtualization (SR-IOV) Specification, PCI-SIG, Revision 1.1 (2010). SR-IOV capability, VF lifecycle, BARs.
- VIRTIO Specification, OASIS, Version 1.2+ (2022). Virtio-PCI transport chapter covers capability structures, notification, MSI-X usage.
- NVM Express Base Specification, NVM Express Inc., Revision 2.0+ (2021). NVMe register layout in PCIe BAR0, doorbell mechanism, queue architecture.
- CXL Specification, CXL Consortium, Revision 3.1 (2023). CXL.io/cache/mem protocols, device types, relationship to PCIe PHY.
Books
- PCI Express System Architecture, Ravi Budruk, Don Anderson, Tom Shanley, MindShare/Addison-Wesley, 2003. The classic comprehensive reference. Covers all three layers, LTSSM, flow control, ordering, config space in great detail.
- PCI Express Technology 3.0, Mike Jackson, Ravi Budruk, MindShare Press, 2012. Updated for Gen 3, covers 128b/130b encoding, equalization, AER, extended capabilities.
- Linux Device Drivers, 3rd Edition, Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman, O'Reilly, 2005. Chapter 12 covers PCI driver basics (somewhat dated API but concepts remain).
Linux Kernel Documentation
- kernel.org/doc/html/latest/PCI/pci.html — Linux PCI driver API
- kernel.org/doc/html/latest/PCI/msi-howto.html — MSI/MSI-X howto
- kernel.org/doc/html/latest/PCI/pci-error-recovery.html — Error recovery
- kernel.org/doc/html/latest/PCI/pciebus-howto.html — PCIe port bus driver
- kernel.org/doc/html/latest/PCI/sysfs-pci.html — sysfs interface
- kernel.org/doc/html/latest/PCI/endpoint/ — PCI endpoint framework
- kernel.org/doc/html/latest/driver-api/vfio.html — VFIO driver API
- Source: include/uapi/linux/pci_regs.h — All register offset defines
Online Resources
- OSDev Wiki: PCI (wiki.osdev.org/PCI) — Configuration space layout, BAR decoding, MSI/MSI-X format, capability structures. Excellent implementation reference.
- OSDev Wiki: PCI Express (wiki.osdev.org/PCI_Express) — ECAM access, extended config space.
- Xillybus PCIe Tutorials (xillybus.com/tutorials/) — Excellent TLP format walkthrough, flow control explanation, practical examples.
- QEMU source: include/hw/pci/pci.h, hw/pci/, hw/virtio/ — Reference implementation for PCI device emulation.
Papers and Talks
- "Understanding PCIe Performance for End Host Networking", Neugebauer et al., SIGCOMM 2018. Measures real PCIe performance bottlenecks: TLP overhead, credit stalls, ordering constraints. Essential reading for understanding PCIe performance in practice.
- "FPsPIN: An FPGA-based Open-Hardware Research Platform for Network-Attached Accelerators", Di Girolamo et al., 2023. PCIe endpoint implementation details.
- "Optimus Prime: Accelerating Data Transformation and Transfer for GPU-Native Analytics", Chrysogelos et al., PVLDB 2023. GPU Direct / P2P DMA architecture for database analytics.
- "Characterizing PCIe Congestion in an HPC Cluster", Li et al., SC 2020. PCIe congestion analysis in multi-GPU/multi-NIC configurations.
Last updated: 2026-03-27