KVM Internals
KVM Internals: Building Hypervisors with the Kernel-based Virtual Machine
A comprehensive, implementation-focused reference for KVM (Kernel-based Virtual Machine). Covers everything needed to build a working VMM from scratch: the ioctl API, hardware virtualization mechanics (VT-x/AMD-V/ARM/RISC-V), memory virtualization, interrupt handling, I/O models, live migration, security features, and kernel internals. Includes complete code examples and architecture diagrams.
Table of Contents
- Architecture & Core Concepts
- KVM API & ioctl Interface
- Hardware Virtualization Support
- Memory Virtualization
- Interrupt Virtualization
- I/O Virtualization
- vCPU Scheduling & Performance
- Live Migration
- Security & Confidential Computing
- KVM Kernel Internals
- Building a Minimal VMM
- Advanced Topics
- Key References
1. Architecture & Core Concepts
1.1 What KVM Is
KVM is a Linux kernel module (kvm.ko plus architecture-specific modules like kvm-intel.ko or kvm-amd.ko) that turns the Linux kernel into a hypervisor. It was created by Avi Kivity at Qumranet in 2006 and merged into Linux 2.6.20 (February 2007). KVM leverages existing Linux infrastructure -- the scheduler, memory management, device drivers -- rather than reimplementing them. Each VM is a regular Linux process; each vCPU is a regular Linux thread.
KVM Architecture:
┌─────────────────────────────────────────────────────────────┐
│ User Space │
│ │
│ ┌─────────┐ ┌───────────┐ ┌────────────┐ ┌──────────┐ │
│ │ QEMU │ │Firecracker│ │Cloud Hyper- │ │ crosvm │ │
│ │ │ │ │ │ visor │ │ │ │
│ │ (full │ │(microVM, │ │(rust-vmm │ │(ChromeOS │ │
│ │ device │ │ minimal │ │ based) │ │ VMs) │ │
│ │ model) │ │ devices) │ │ │ │ │ │
│ └────┬────┘ └─────┬─────┘ └─────┬──────┘ └────┬─────┘ │
│ │ │ │ │ │
│ │ ioctl(/dev/kvm) │ │ │
├───────┼─────────────┼──────────────┼──────────────┼────────┤
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ KVM Module │ │
│ │ ┌───────────┐ ┌───────────┐ ┌─────────────────┐ │ │
│ │ │ VM Mgmt │ │ vCPU Mgmt │ │ Memory Mgmt │ │ │
│ │ │ │ │ │ │ (EPT/NPT/ │ │ │
│ │ │ │ │ │ │ Stage-2) │ │ │
│ │ └───────────┘ └───────────┘ └─────────────────┘ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌─────────────────┐ │ │
│ │ │ IRQ Chip │ │ Timer │ │ I/O Handling │ │ │
│ │ │ (LAPIC, │ │ (PIT/ │ │ (PIO/MMIO │ │ │
│ │ │ IOAPIC) │ │ HPET) │ │ intercept) │ │ │
│ │ └───────────┘ └───────────┘ └─────────────────┘ │ │
│ └──────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ Hardware VT │ │
│ │ (VMX/SVM/EL2/H) │ │
│ └─────────────────────┘ │
│ Kernel Space │
└─────────────────────────────────────────────────────────────┘
1.2 Type 1 vs Type 2 Debate
KVM defies clean classification:
- Type 1 (bare-metal): Xen, VMware ESXi, Hyper-V -- hypervisor runs directly on hardware
- Type 2 (hosted): VMware Workstation, VirtualBox -- hypervisor runs on top of a host OS
KVM is technically Type 2 (it is a kernel module loaded into Linux), but once loaded, the Linux kernel itself becomes the hypervisor, making it functionally Type 1. The host kernel runs in VMX root mode (ring 0) and guest code runs in VMX non-root mode. The Linux kernel serves as the hypervisor and also handles scheduling, memory management, and device drivers -- roles traditionally handled by either the hypervisor (Type 1) or the host OS (Type 2). The academic consensus is to call KVM a "Type 1.5" or simply note that the classification doesn't cleanly apply.
1.3 KVM vs Other VMMs
KVM is the hypervisor (the kernel component). A VMM (Virtual Machine Monitor) is the userspace component that uses KVM. The most important VMMs:
| VMM | Language | Use Case | Code Size | Key Features |
|---|---|---|---|---|
| QEMU | C | General-purpose | ~2M LoC | Full device emulation, TCG fallback, migration, snapshots |
| Firecracker | Rust | Serverless (AWS Lambda/Fargate) | ~50K LoC | Minimal devices, <125ms boot, rate limiters, jailer |
| Cloud Hypervisor | Rust | Cloud workloads | ~100K LoC | rust-vmm based, VFIO, vhost-user, PVH boot |
| crosvm | Rust | ChromeOS | ~150K LoC | Sandboxed processes per device, GPU passthrough, Wayland |
| STRATOVIRT | Rust | Huawei Cloud | ~80K LoC | Lightweight/standard modes, hot-plugging |
| libkrun | Rust+C | Container-like VMs | ~30K LoC | Library form factor, macOS/Linux |
| Kata Containers | Go+Rust | Secure containers | Varies | OCI-compatible, uses QEMU/Firecracker/Cloud Hypervisor |
1.4 The VM/vCPU Lifecycle
┌──────────────────────────────────────────────┐
│ VMM Process Lifecycle │
│ │
open(/dev/kvm) ──► │ 1. KVM_GET_API_VERSION (verify == 12) │
│ 2. KVM_CHECK_EXTENSION (probe capabilities) │
│ 3. KVM_CREATE_VM ──► returns vm_fd │
│ │
vm_fd ──────────► │ 4. KVM_SET_TSS_ADDR (x86 specific) │
│ 5. KVM_SET_IDENTITY_MAP_ADDR (x86) │
│ 6. KVM_CREATE_IRQCHIP (in-kernel APIC) │
│ 7. KVM_CREATE_PIT2 (i8254 timer) │
│ 8. KVM_SET_USER_MEMORY_REGION (guest RAM) │
│ 9. KVM_CREATE_VCPU ──► returns vcpu_fd │
│ │
vcpu_fd ────────► │ 10. mmap(vcpu_fd) ──► kvm_run shared page │
│ 11. KVM_SET_CPUID2 (CPUID leaves) │
│ 12. KVM_SET_SREGS (CR0, CR3, CR4, segments) │
│ 13. KVM_SET_REGS (RIP, RSP, RFLAGS) │
│ 14. KVM_SET_MSRS (EFER, STAR, etc.) │
│ │
vCPU Run Loop ──► │ 15. loop { │
│ KVM_RUN ──► blocks until VM exit │
│ match kvm_run.exit_reason { │
│ IO => handle_pio() │
│ MMIO => handle_mmio() │
│ HLT => wait_or_idle() │
│ SHUTDOWN => break │
│ ... │
│ } │
│ } │
│ │
Teardown ───────► │ 16. close(vcpu_fd) │
│ 17. close(vm_fd) │
│ 18. close(kvm_fd) │
└──────────────────────────────────────────────┘
2. KVM API & ioctl Interface
The KVM API is entirely ioctl-based, operating on three levels of file descriptors:
- System fd (
/dev/kvm) -- global operations - VM fd (from
KVM_CREATE_VM) -- per-VM operations - vCPU fd (from
KVM_CREATE_VCPU) -- per-vCPU operations
2.1 System ioctls (on /dev/kvm)
#include <linux/kvm.h>
#include <sys/ioctl.h>
// Open the KVM device
int kvm_fd = open("/dev/kvm", O_RDWR | O_CLOEXEC);
// --- KVM_GET_API_VERSION ---
// Returns the KVM API version. MUST be 12 (stable since 2007).
// Any other value means incompatible.
int api_version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
assert(api_version == 12);
// --- KVM_CHECK_EXTENSION ---
// Probe for optional capabilities. Returns 0 (unsupported) or positive value.
// Critical extensions to check:
int has_irqchip = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_IRQCHIP);
int has_pit2 = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_PIT2);
int has_user_mem = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_USER_MEMORY);
int has_set_tss = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_SET_TSS_ADDR);
int has_ext_cpuid = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_EXT_CPUID);
int has_irqfd = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_IRQFD);
int has_ioeventfd = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_IOEVENTFD);
int has_dirty_log = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_DIRTY_LOG_RING);
int max_vcpus = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_MAX_VCPUS);
int has_imm_exit = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_IMMEDIATE_EXIT);
int has_tsc_ctrl = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_TSC_CONTROL);
int has_tsc_deadline = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER);
int has_split_irq = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_SPLIT_IRQCHIP);
int has_readonly_mem = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_READONLY_MEM);
int has_multi_addr = ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_MULTI_ADDRESS_SPACE);
// --- KVM_GET_VCPU_MMAP_SIZE ---
// Returns the size of the kvm_run struct mmap region for each vCPU.
// Must be called before mmapping vCPU fds.
int vcpu_mmap_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
// Typically one page (4096) or slightly more.
// --- KVM_CREATE_VM ---
// Creates a new VM. Returns a VM file descriptor.
// The argument is machine type (0 = default, non-zero for ARM/MIPS machine types).
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// --- KVM_GET_SUPPORTED_CPUID (x86 only) ---
// Get the host's CPUID leaves that KVM supports.
// Used to filter/configure the guest's CPUID.
struct kvm_cpuid2 *cpuid = calloc(1, sizeof(*cpuid) + 256 * sizeof(cpuid->entries[0]));
cpuid->nent = 256;
ioctl(kvm_fd, KVM_GET_SUPPORTED_CPUID, cpuid);
2.2 VM ioctls (on vm_fd)
// --- KVM_SET_TSS_ADDR (x86 only) ---
// Set the address of the Task State Segment (TSS) in guest physical memory.
// Required for the in-kernel LAPIC. Must be in the first 4GB.
// KVM needs 3 pages at this address. Common choice: 0xFFFBD000.
ioctl(vm_fd, KVM_SET_TSS_ADDR, 0xFFFBD000);
// --- KVM_SET_IDENTITY_MAP_ADDR (x86 only) ---
// Set the address for KVM's internal identity-mapped page table.
// Required for real-mode emulation. Must not overlap with guest RAM.
// Common choice: 0xFFFBC000.
uint64_t identity_base = 0xFFFBC000;
ioctl(vm_fd, KVM_SET_IDENTITY_MAP_ADDR, &identity_base);
// --- KVM_CREATE_IRQCHIP ---
// Create in-kernel PIC (8259), IOAPIC, and LAPIC.
// Must be done BEFORE creating vCPUs.
ioctl(vm_fd, KVM_CREATE_IRQCHIP, 0);
// --- KVM_CREATE_PIT2 ---
// Create an in-kernel i8254 PIT (Programmable Interval Timer).
struct kvm_pit_config pit_config = { .flags = KVM_PIT_SPEAKER_DUMMY };
ioctl(vm_fd, KVM_CREATE_PIT2, &pit_config);
// --- KVM_SET_USER_MEMORY_REGION ---
// Map a region of the VMM process's virtual memory as guest physical memory.
// This is the fundamental mechanism for providing guest RAM.
struct kvm_userspace_memory_region region = {
.slot = 0, // Memory slot ID (0-based)
.flags = 0, // 0 or KVM_MEM_LOG_DIRTY_PAGES or KVM_MEM_READONLY
.guest_phys_addr = 0x0, // Guest physical address (GPA) start
.memory_size = 256 * 1024 * 1024, // Size in bytes (256 MB)
.userspace_addr = (uint64_t)mmap(NULL, 256 * 1024 * 1024,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE,
-1, 0),
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
// --- KVM_CREATE_VCPU ---
// Create a virtual CPU. Returns a vCPU file descriptor.
// Argument is the vCPU ID (0, 1, 2, ...).
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
// --- KVM_IRQFD ---
// Bind an eventfd to a guest IRQ line for direct injection.
// When the eventfd is signaled, KVM injects the interrupt without a VMM ioctl.
int efd = eventfd(0, EFD_CLOEXEC);
struct kvm_irqfd irqfd = {
.fd = efd,
.gsi = 5, // Guest System Interrupt number
.flags = 0, // or KVM_IRQFD_FLAG_RESAMPLE for level-triggered
};
ioctl(vm_fd, KVM_IRQFD, &irqfd);
// For level-triggered IRQs with EOI notification:
int resample_efd = eventfd(0, EFD_CLOEXEC);
struct kvm_irqfd irqfd_level = {
.fd = efd,
.gsi = 10,
.flags = KVM_IRQFD_FLAG_RESAMPLE,
.resamplefd = resample_efd, // KVM signals this when guest does EOI
};
// --- KVM_IOEVENTFD ---
// Bind an eventfd to a specific I/O port or MMIO address.
// When the guest writes to this address, KVM signals the eventfd
// WITHOUT causing a VM exit to the VMM. Used as virtio doorbell.
int io_efd = eventfd(0, EFD_CLOEXEC);
struct kvm_ioeventfd ioeventfd = {
.datamatch = 0, // Optional data value to match
.addr = 0x500, // PIO port or MMIO address
.len = 4, // Access width (1, 2, 4, or 8 bytes)
.fd = io_efd,
.flags = 0, // KVM_IOEVENTFD_FLAG_PIO for port I/O
// KVM_IOEVENTFD_FLAG_DATAMATCH to filter by value
};
ioctl(vm_fd, KVM_IOEVENTFD, &ioeventfd);
// --- KVM_SET_GSI_ROUTING ---
// Configure IRQ routing table. Maps GSI numbers to interrupt controller pins.
struct kvm_irq_routing *routing;
int num_entries = 24; // legacy ISA IRQs
size_t size = sizeof(*routing) + num_entries * sizeof(routing->entries[0]);
routing = calloc(1, size);
routing->nr = num_entries;
// Example: route GSI 0 to IOAPIC pin 2 (PIT timer on IOAPIC)
routing->entries[0].gsi = 0;
routing->entries[0].type = KVM_IRQ_ROUTING_IRQCHIP;
routing->entries[0].u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC;
routing->entries[0].u.irqchip.pin = 2;
// Example: MSI routing
routing->entries[1].gsi = 24; // first non-legacy GSI
routing->entries[1].type = KVM_IRQ_ROUTING_MSI;
routing->entries[1].u.msi.address_lo = 0xFEE00000; // LAPIC base + dest
routing->entries[1].u.msi.address_hi = 0;
routing->entries[1].u.msi.data = 0x41; // vector 0x41, edge-triggered
ioctl(vm_fd, KVM_SET_GSI_ROUTING, routing);
// --- KVM_GET_DIRTY_LOG ---
// Retrieve bitmap of pages dirtied by the guest since last call.
// Used for live migration.
struct kvm_dirty_log dirty_log = {
.slot = 0, // memory slot
};
// bitmap: 1 bit per page, must be pre-allocated
size_t bitmap_size = (region.memory_size / 4096 + 7) / 8;
dirty_log.dirty_bitmap = calloc(1, bitmap_size);
ioctl(vm_fd, KVM_GET_DIRTY_LOG, &dirty_log);
// --- KVM_CLEAR_DIRTY_LOG (newer, more efficient) ---
// Atomically get and clear dirty bits for a sub-range.
struct kvm_clear_dirty_log clear = {
.slot = 0,
.num_pages = 1024,
.first_page = 0,
.dirty_bitmap = calloc(1, 1024 / 8),
};
ioctl(vm_fd, KVM_CLEAR_DIRTY_LOG, &clear);
// --- KVM_ENABLE_CAP ---
// Enable a specific VM capability.
struct kvm_enable_cap cap = {
.cap = KVM_CAP_SPLIT_IRQCHIP,
.args[0] = 24, // number of GSI routes for split irqchip
};
ioctl(vm_fd, KVM_ENABLE_CAP, &cap);
2.3 vCPU ioctls (on vcpu_fd)
// --- mmap the kvm_run structure ---
// This shared memory region is how KVM communicates exit information to userspace.
struct kvm_run *run = mmap(NULL, vcpu_mmap_size,
PROT_READ | PROT_WRITE, MAP_SHARED,
vcpu_fd, 0);
// --- KVM_SET_CPUID2 (x86 only) ---
// Set the CPUID leaves the guest will see.
// Typically start from KVM_GET_SUPPORTED_CPUID and modify.
// Critical modifications:
// - Set vendor string
// - Hide hypervisor features you don't support
// - Expose KVM paravirt CPUID (0x40000000, 0x40000001)
ioctl(vcpu_fd, KVM_SET_CPUID2, cpuid);
// --- KVM_GET_SREGS / KVM_SET_SREGS ---
// Get/set special registers: segment registers, CR0-CR4, EFER, IDT, GDT, etc.
struct kvm_sregs sregs;
ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);
// For real mode (simplest setup):
sregs.cs.base = 0;
sregs.cs.selector = 0;
// For protected mode: set CS/DS/ES/SS/FS/GS with proper limits, types, and DPL
// For long mode: set CR0.PE, CR0.PG, CR4.PAE, EFER.LME, EFER.LMA, CR3 = page table
ioctl(vcpu_fd, KVM_SET_SREGS, &sregs);
// --- KVM_GET_REGS / KVM_SET_REGS ---
// Get/set general-purpose registers, RIP, RSP, RFLAGS.
struct kvm_regs regs = {
.rip = 0x1000, // Entry point
.rsp = 0x0, // Stack pointer
.rflags = 0x2, // Bit 1 must be set (reserved)
// rax, rbx, rcx, rdx, rsi, rdi, rbp, r8-r15 all zero
};
ioctl(vcpu_fd, KVM_SET_REGS, ®s);
// --- KVM_SET_MSRS ---
// Set Model-Specific Registers. Critical MSRs:
struct {
struct kvm_msrs header;
struct kvm_msr_entry entries[10];
} msrs = {
.header.nmsrs = 4,
.entries = {
{ .index = MSR_IA32_EFER, .data = 0 }, // EFER (LME, LMA, SCE)
{ .index = MSR_STAR, .data = 0 }, // SYSCALL target
{ .index = MSR_LSTAR, .data = 0 }, // 64-bit SYSCALL entry
{ .index = MSR_IA32_TSC, .data = 0 }, // TSC value
},
};
ioctl(vcpu_fd, KVM_SET_MSRS, &msrs);
// --- KVM_GET_MSRS ---
// Read MSRs. Specify which MSRs to read in the entries array.
// On return, data fields are filled in. Returns number of MSRs read.
msrs.header.nmsrs = 1;
msrs.entries[0].index = MSR_IA32_TSC;
int nmsrs = ioctl(vcpu_fd, KVM_GET_MSRS, &msrs);
// msrs.entries[0].data now contains TSC value
// --- KVM_GET_FPU / KVM_SET_FPU ---
struct kvm_fpu fpu;
ioctl(vcpu_fd, KVM_GET_FPU, &fpu);
// fpu.fpr[8][16] -- FP registers
// fpu.fcw -- FP control word
// fpu.xmm[16][16] -- SSE registers
// fpu.mxcsr -- MXCSR register
ioctl(vcpu_fd, KVM_SET_FPU, &fpu);
// --- KVM_GET_LAPIC / KVM_SET_LAPIC ---
// Get/set the LAPIC state (when in-kernel IRQCHIP is used).
struct kvm_lapic_state lapic;
ioctl(vcpu_fd, KVM_GET_LAPIC, &lapic);
// lapic.regs[KVM_APIC_REG_SIZE] -- raw APIC register page
ioctl(vcpu_fd, KVM_SET_LAPIC, &lapic);
// --- KVM_INTERRUPT ---
// Inject an external interrupt. Only when NOT using in-kernel IRQCHIP.
struct kvm_interrupt irq = { .irq = 0x30 }; // vector number
ioctl(vcpu_fd, KVM_INTERRUPT, &irq);
// --- KVM_SET_SIGNAL_MASK ---
// Set the signal mask for the vCPU thread while running guest code.
// Signals not in this mask can kick the vCPU out of guest mode.
struct kvm_signal_mask *sigmask = calloc(1, sizeof(*sigmask) + sizeof(sigset_t));
sigmask->len = sizeof(sigset_t);
sigset_t *set = (sigset_t *)sigmask->sigset;
sigemptyset(set);
sigaddset(set, SIGUSR1); // Only block SIGUSR1 during KVM_RUN
ioctl(vcpu_fd, KVM_SET_SIGNAL_MASK, sigmask);
// --- KVM_GET_VCPU_EVENTS / KVM_SET_VCPU_EVENTS ---
// Get/set pending exceptions, interrupts, NMIs, SMIs.
// Critical for migration (must preserve in-flight events).
struct kvm_vcpu_events events;
ioctl(vcpu_fd, KVM_GET_VCPU_EVENTS, &events);
// events.exception.injected, events.exception.nr, events.exception.has_error_code
// events.interrupt.injected, events.interrupt.nr, events.interrupt.soft
// events.nmi.injected, events.nmi.pending, events.nmi.masked
// events.sipi_vector
// --- KVM_RUN ---
// Enter guest mode. Blocks until a VM exit occurs.
ioctl(vcpu_fd, KVM_RUN, 0);
// Exit reason is in run->exit_reason
2.4 The kvm_run Shared Memory Region
The struct kvm_run is the shared memory page between KVM and userspace. It contains:
struct kvm_run {
/* in: set by userspace */
__u8 request_interrupt_window; // ask KVM to exit when interrupt window opens
__u8 immediate_exit; // exit immediately (for signal handling race fix)
__u8 padding1[6];
/* out: set by KVM on exit */
__u32 exit_reason; // WHY we exited (see below)
__u8 ready_for_interrupt_injection; // can we inject an interrupt now?
__u8 if_flag; // guest's IF flag (interrupts enabled?)
__u16 flags;
/* Architecture-specific state */
__u64 cr8; // guest CR8 (TPR)
__u64 apic_base; // guest APIC base MSR
/* Exit-reason-specific data (union) */
union {
/* KVM_EXIT_IO */
struct {
__u8 direction; // KVM_EXIT_IO_IN or KVM_EXIT_IO_OUT
__u8 size; // access width: 1, 2, or 4
__u16 port; // I/O port number
__u32 count; // number of accesses (for REP INS/OUTS)
__u64 data_offset; // offset into kvm_run page for data
} io;
/* KVM_EXIT_MMIO */
struct {
__u64 phys_addr; // guest physical address
__u8 data[8]; // data read/written
__u32 len; // access length
__u8 is_write; // 1 = write, 0 = read
} mmio;
/* KVM_EXIT_HYPERCALL */
struct {
__u64 nr; // hypercall number
__u64 args[6]; // arguments
__u64 ret; // return value (set by VMM)
__u32 longmode; // 1 = 64-bit mode
__u32 pad;
} hypercall;
/* KVM_EXIT_INTERNAL_ERROR */
struct {
__u32 suberror; // KVM_INTERNAL_ERROR_EMULATION,
// KVM_INTERNAL_ERROR_SIMUL_EX,
// KVM_INTERNAL_ERROR_DELIVERY_EV
__u32 ndata;
__u64 data[16]; // additional error data
} internal;
/* KVM_EXIT_SYSTEM_EVENT */
struct {
__u32 type; // KVM_SYSTEM_EVENT_SHUTDOWN,
// KVM_SYSTEM_EVENT_RESET,
// KVM_SYSTEM_EVENT_CRASH,
// KVM_SYSTEM_EVENT_WAKEUP,
// KVM_SYSTEM_EVENT_SUSPEND,
// KVM_SYSTEM_EVENT_SEV_TERM
__u32 ndata;
__u64 data[16];
} system_event;
/* KVM_EXIT_IOAPIC_EOI */
struct {
__u8 vector; // which vector was EOI'd
} eoi;
/* KVM_EXIT_HYPERV */
struct kvm_hyperv_exit hyperv; // Hyper-V specific exits
// ... other exit types
};
};
2.5 Exit Reasons -- Complete Handler Guide
while (1) {
int ret = ioctl(vcpu_fd, KVM_RUN, 0);
if (ret < 0 && errno != EINTR) {
perror("KVM_RUN failed");
break;
}
switch (run->exit_reason) {
case KVM_EXIT_IO: {
// Guest executed IN or OUT instruction on an I/O port not handled
// by an in-kernel device or IOEVENTFD.
//
// Port I/O is the primary mechanism for legacy x86 device communication.
// Common ports:
// 0x3F8-0x3FF: COM1 (serial port)
// 0x60, 0x64: PS/2 keyboard controller
// 0x20, 0x21: PIC master
// 0xA0, 0xA1: PIC slave
// 0xCF8, 0xCFC: PCI config space access
// 0x40-0x43: PIT (i8254)
// 0x70, 0x71: CMOS/RTC
uint8_t *data = (uint8_t *)run + run->io.data_offset;
if (run->io.direction == KVM_EXIT_IO_OUT) {
// Guest is writing data to a port
if (run->io.port == 0x3F8 && run->io.size == 1) {
// COM1 output -- simplest "display" for a minimal VMM
write(STDOUT_FILENO, data, 1);
}
// Handle other ports (PCI config writes, etc.)
} else {
// Guest is reading from a port (KVM_EXIT_IO_IN)
// VMM must fill in the data buffer with the response
memset(data, 0xFF, run->io.size); // default: all 1s (nothing connected)
}
break;
}
case KVM_EXIT_MMIO: {
// Guest accessed a memory-mapped I/O region not backed by RAM.
// Any GPA not covered by KVM_SET_USER_MEMORY_REGION causes this exit.
//
// Common MMIO regions:
// 0xFEE00000: LAPIC registers (if not using in-kernel IRQCHIP)
// 0xFEC00000: IOAPIC registers
// PCIe BAR regions mapped by the guest OS
// Virtio MMIO device regions (at addresses you choose)
if (run->mmio.is_write) {
handle_mmio_write(run->mmio.phys_addr, run->mmio.data, run->mmio.len);
} else {
handle_mmio_read(run->mmio.phys_addr, run->mmio.data, run->mmio.len);
}
break;
}
case KVM_EXIT_HLT:
// Guest executed HLT instruction (idle).
// Options:
// 1. Wait for an interrupt (poll or epoll on IRQFD eventfds)
// 2. If no interrupts expected, the VM is halted -- exit the loop
// 3. Use KVM's halt-polling (kernel-side busy-wait before sleeping)
// If using a multi-vCPU VM, only this vCPU is halted; others continue.
break;
case KVM_EXIT_SHUTDOWN:
// Guest executed triple fault or shutdown sequence.
// This is usually a bug in guest code (bad IDT, bad GDT, etc.)
// or a deliberate ACPI shutdown.
fprintf(stderr, "VM shutdown (triple fault)\n");
return;
case KVM_EXIT_FAIL_ENTRY: {
// KVM could not enter guest mode at all.
// Common causes:
// - Invalid guest state (bad segment registers, bad CR0/CR4 combination)
// - VMCS/VMCB misconfiguration
// run->fail_entry.hardware_entry_failure_reason contains
// the hardware-specific error code (Intel VMCS exit reason field).
fprintf(stderr, "KVM_EXIT_FAIL_ENTRY: reason=0x%llx\n",
run->fail_entry.hardware_entry_failure_reason);
return;
}
case KVM_EXIT_INTERNAL_ERROR: {
// KVM internal error -- typically emulation failure.
// suberror values:
// KVM_INTERNAL_ERROR_EMULATION (1): KVM could not emulate an instruction
// KVM_INTERNAL_ERROR_SIMUL_EX (2): simultaneous exceptions
// KVM_INTERNAL_ERROR_DELIVERY_EV (3): error during event delivery
// KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON (4): hardware bug?
fprintf(stderr, "KVM internal error: suberror=%d\n",
run->internal.suberror);
for (uint32_t i = 0; i < run->internal.ndata; i++)
fprintf(stderr, " data[%d] = 0x%llx\n", i, run->internal.data[i]);
return;
}
case KVM_EXIT_SYSTEM_EVENT:
// Guest requested a system-level event (e.g., ACPI shutdown/reset/crash).
// type: KVM_SYSTEM_EVENT_SHUTDOWN, _RESET, _CRASH
if (run->system_event.type == KVM_SYSTEM_EVENT_SHUTDOWN) {
printf("Guest requested ACPI shutdown\n");
return;
} else if (run->system_event.type == KVM_SYSTEM_EVENT_RESET) {
printf("Guest requested reset\n");
// Reset VM state and re-enter
}
break;
case KVM_EXIT_IOAPIC_EOI:
// Guest wrote to the LAPIC EOI register for a level-triggered interrupt.
// Only occurs with split IRQCHIP mode.
// VMM should deassert the interrupt and optionally re-trigger if still pending.
handle_eoi(run->eoi.vector);
break;
case KVM_EXIT_HYPERV:
// Hyper-V specific exits (synthetic MSR access, hypercalls, synic).
// Used when emulating Hyper-V enlightenments for Windows guests.
break;
case KVM_EXIT_DEBUG:
// Guest hit a debug breakpoint/watchpoint set via KVM_SET_GUEST_DEBUG.
break;
case KVM_EXIT_X86_RDMSR:
case KVM_EXIT_X86_WRMSR:
// MSR access that KVM doesn't handle and user_msr filtering is enabled
// (KVM_CAP_X86_USER_SPACE_MSR).
break;
default:
fprintf(stderr, "Unhandled exit reason: %d\n", run->exit_reason);
return;
}
}
2.6 Important KVM Capabilities
Key KVM_CHECK_EXTENSION values and their meaning:
| Capability | Value | Meaning |
|---|---|---|
KVM_CAP_IRQCHIP | 0 or 1 | In-kernel APIC/IOAPIC/PIC emulation |
KVM_CAP_PIT2 | 0 or 1 | In-kernel i8254 PIT |
KVM_CAP_USER_MEMORY | 0 or 1 | KVM_SET_USER_MEMORY_REGION support |
KVM_CAP_SET_TSS_ADDR | 0 or 1 | KVM_SET_TSS_ADDR support (x86) |
KVM_CAP_EXT_CPUID | 0 or 1 | Extended CPUID (KVM_GET_SUPPORTED_CPUID) |
KVM_CAP_NR_VCPUS | N | Recommended max vCPUs (soft limit) |
KVM_CAP_MAX_VCPUS | N | Hard max vCPUs (up to 1024+) |
KVM_CAP_IRQFD | 0 or 1 | IRQ injection via eventfd |
KVM_CAP_IOEVENTFD | 0 or 1 | I/O event notification via eventfd |
KVM_CAP_SPLIT_IRQCHIP | 0 or 1 | Userspace IOAPIC + in-kernel LAPIC |
KVM_CAP_IMMEDIATE_EXIT | 0 or 1 | kvm_run.immediate_exit field support |
KVM_CAP_TSC_CONTROL | 0 or 1 | TSC frequency scaling |
KVM_CAP_TSC_DEADLINE_TIMER | 0 or 1 | TSC deadline LAPIC timer mode |
KVM_CAP_READONLY_MEM | 0 or 1 | KVM_MEM_READONLY flag support |
KVM_CAP_DIRTY_LOG_RING | N | Ring-based dirty page tracking (N = max entries) |
KVM_CAP_X86_USER_SPACE_MSR | 0 or 1 | Forward unhandled MSR access to userspace |
KVM_CAP_SGX_ATTRIBUTE | 0 or 1 | SGX virtualization support |
KVM_CAP_MULTI_ADDRESS_SPACE | N | Number of separate address spaces (used by SEV) |
3. Hardware Virtualization Support
3.1 Intel VT-x (VMX)
Intel VT-x introduces two CPU modes:
┌─────────────────────────────────┐
│ VMX Root Mode (Host) │
│ Ring 0: Hypervisor/KVM │
│ Ring 3: VMM (QEMU, etc.) │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ VMLAUNCH │ │VM Exit │ │
│ │ VMRESUME │ │(automatic│ │
│ │ │ │ trap) │ │
│ └────┬─────┘ └────▲─────┘ │
└───────┼─────────────┼──────────┘
│ │
▼ │
┌───────┴─────────────┴──────────┐
│ VMX Non-Root Mode (Guest) │
│ Ring 0: Guest kernel │
│ Ring 3: Guest applications │
│ │
│ Has its OWN ring 0/3 but │
│ certain instructions cause │
│ automatic VM exits │
└─────────────────────────────────┘
VMCS (Virtual Machine Control Structure):
The VMCS is a hardware-defined 4KB data structure that controls VMX operation. It contains:
VMCS Layout (Intel SDM Vol. 3, Chapter 24):
┌──────────────────────────────────────────────────────────┐
│ VMCS Header │
│ Revision ID, abort indicator, VMCS state │
├──────────────────────────────────────────────────────────┤
│ Guest-State Area │
│ CR0, CR3, CR4, DR7 │
│ RSP, RIP, RFLAGS │
│ CS/SS/DS/ES/FS/GS/LDTR/TR (sel, base, limit, access) │
│ GDTR, IDTR (base, limit) │
│ MSRs: IA32_DEBUGCTL, IA32_SYSENTER_*, IA32_EFER, │
│ IA32_PAT, IA32_PERF_GLOBAL_CTRL │
│ SMBASE, activity state, interruptibility state │
│ Pending debug exceptions, VMCS link pointer │
│ Preemption timer value, PDPTEs (if PAE), guest CET │
├──────────────────────────────────────────────────────────┤
│ Host-State Area │
│ CR0, CR3, CR4 │
│ RSP, RIP (host entry point after VM exit) │
│ CS/SS/DS/ES/FS/GS/TR selectors, FS/GS/TR bases │
│ GDTR base, IDTR base │
│ MSRs: IA32_SYSENTER_*, IA32_EFER, IA32_PAT, │
│ IA32_PERF_GLOBAL_CTRL │
├──────────────────────────────────────────────────────────┤
│ VM-Execution Control Fields │
│ Pin-based controls: │
│ External interrupt exiting, NMI exiting, │
│ virtual NMIs, preemption timer, posted interrupts │
│ Processor-based controls (primary): │
│ HLT exiting, INVLPG exiting, MWAIT exiting, │
│ RDPMC exiting, RDTSC exiting, CR3-load/store exit, │
│ MOV-DR exiting, I/O bitmap, MSR bitmap, │
│ use TPR shadow, activate secondary controls │
│ Processor-based controls (secondary): │
│ Virtualize APIC, enable EPT, descriptor-table exit, │
│ RDTSCP, virtualize x2APIC, enable VPID, WBINVD, │
│ unrestricted guest, APIC register virtualization, │
│ virtual interrupt delivery, PAUSE-loop exiting, │
│ RDRAND exiting, INVPCID, VMFUNC, ENCLS exiting, │
│ SPP (Sub-Page Permission), PT uses guest phys addr, │
│ TSC scaling, WAITPKG exiting, ENCLV exiting │
│ Exception bitmap: 32-bit bitmap (1 = exit on this exception) │
│ I/O bitmap addresses (A and B) │
│ MSR bitmap address │
│ EPT pointer (EPTP) │
│ VPID │
│ PLE gap, PLE window (pause-loop exiting) │
├──────────────────────────────────────────────────────────┤
│ VM-Exit Control Fields │
│ VM exit controls: save debug controls, host address- │
│ space size (64-bit host), load IA32_PERF/PAT/EFER, │
│ acknowledge interrupt on exit, save/clear IA32_BNDCFGS│
│ VM exit MSR-store/load count and addresses │
├──────────────────────────────────────────────────────────┤
│ VM-Entry Control Fields │
│ VM entry controls: load debug controls, IA32 mode │
│ guest, entry to SMM, deactivate dual-monitor │
│ VM entry interruption-information field │
│ VM entry exception error code │
│ VM entry instruction length │
│ VM entry MSR-load count and address │
├──────────────────────────────────────────────────────────┤
│ VM-Exit Information Fields (read-only) │
│ Exit reason (basic + specific) │
│ Exit qualification (instruction-specific details) │
│ VM exit interruption information/error code │
│ IDT-vectoring information/error code │
│ VM exit instruction length/information │
│ Guest-linear/physical address │
└──────────────────────────────────────────────────────────┘
Key VMX instructions:
| Instruction | Action |
|---|---|
VMXON | Enable VMX operation. Allocates a VMXON region. |
VMXOFF | Disable VMX operation. |
VMCLEAR | Initialize/clear a VMCS. Must be done before VMPTRLD. |
VMPTRLD | Load a VMCS as the "current" VMCS for this logical processor. |
VMREAD | Read a field from the current VMCS. |
VMWRITE | Write a field to the current VMCS. |
VMLAUNCH | Enter VMX non-root mode (first entry for this VMCS). |
VMRESUME | Enter VMX non-root mode (subsequent entries). |
VMCALL | Guest-to-host hypercall. Causes unconditional VM exit. |
INVEPT | Invalidate EPT TLB entries. |
INVVPID | Invalidate TLB entries tagged with a specific VPID. |
What causes VM exits (configurable in VMCS controls):
- Privileged instructions: CPUID (always), HLT (configurable), INVLPG, MOV CR, MOV DR, IN/OUT, RDMSR/WRMSR (configurable via MSR bitmap), LGDT/LIDT/LLDT/LTR, SGDT/SIDT/SLDT/STR
- Memory: EPT violation (unmapped GPA), EPT misconfiguration
- Interrupts: external interrupt, NMI (configurable)
- Special: triple fault, INIT, SIPI, VMCALL, preemption timer, MONITOR/MWAIT, PAUSE (configurable via PLE), RDTSC/RDTSCP (configurable)
Unrestricted Guest Mode (VT-x + EPT on Westmere and later): Allows running real-mode and unpaged protected-mode guest code without emulation. Before this, KVM had to emulate real mode in software. This is why KVM_SET_TSS_ADDR is required -- KVM uses it for real-mode emulation on older hardware.
3.2 AMD-V (SVM)
AMD-V (also called SVM -- Secure Virtual Machine) is architecturally similar to VT-x but uses different data structures:
AMD-V vs Intel VT-x Comparison:
Feature Intel VT-x AMD-V (SVM)
─────────────────────────────────────────────────────────────
Control struct VMCS (4KB, hardware- VMCB (4KB, normal
managed, VMREAD/ memory, direct
VMWRITE access) load/store access)
Enter guest VMLAUNCH / VMRESUME VMRUN
Guest -> host VM exit (automatic) #VMEXIT (automatic)
Hypercall VMCALL VMMCALL
Enable VMXON set EFER.SVME
EPT equivalent EPT (Extended Page NPT (Nested Page
Tables), EPTP in VMCS Tables), nCR3 in VMCB
TLB tagging VPID ASID (Address Space ID)
Posted interrupts Posted Interrupts AVIC (AMD Virtual
(PI descriptor) Interrupt Controller)
Nested virt VMCS shadowing Nested SVM (VMCB
caching)
VMCB (Virtual Machine Control Block):
// The VMCB is a 4KB page with two halves:
// Offset 0x000-0x3FF: Control area
// Offset 0x400-0xFFF: State save area (guest register state)
// Key control area fields:
struct vmcb_control {
uint32_t intercept_cr; // CR read/write intercepts (bitmap)
uint32_t intercept_dr; // DR read/write intercepts
uint32_t intercept_exceptions; // exception intercepts (bitmap, like VMCS exception bitmap)
uint64_t intercepts; // miscellaneous intercepts (INTR, NMI, SMI, INIT,
// VINTR, CR0 selective, RDIDTR, RDGDTR, RDLDTR, RDTR,
// RDTSC, RDPMC, PUSHF, POPF, CPUID, RSM, IRET,
// INTn, INVD, PAUSE, HLT, INVLPG, INVLPGA,
// I/O bitmap, MSR bitmap, task switch, FERR freeze,
// shutdown, VMRUN, VMMCALL, VMLOAD, VMSAVE, STGI, CLGI,
// SKINIT, RDTSCP, ICEBP, WBINVD, MONITOR, MWAIT,
// XSETBV, RDPRU, EFER write after finish, CR0-15 write after)
uint64_t iopm_base_pa; // physical address of I/O permission bitmap (12KB, 3 pages)
uint64_t msrpm_base_pa; // physical address of MSR permission bitmap (8KB, 2 pages)
uint64_t tsc_offset; // added to RDTSC/RDTSCP results
uint32_t guest_asid; // ASID for TLB tagging (must be non-zero)
uint8_t tlb_ctl; // TLB flush control on VMRUN
uint8_t v_intr; // virtual interrupt control
uint64_t exitcode; // exit reason (set by hardware on #VMEXIT)
uint64_t exitinfo1; // additional exit info (qualification)
uint64_t exitinfo2; // additional exit info
uint64_t n_cr3; // Nested page table root (NPT CR3)
uint64_t avic_backing_page; // AVIC backing page physical address
// ... more fields
};
Advantages of VMCB over VMCS:
- VMCB is regular memory -- you can
memcpyit, inspect it withgdb, no special instructions needed - VMCS requires
VMREAD/VMWRITEfor every field access (more complex code) - VMCB makes debugging easier and nested virtualization simpler
3.3 ARM Virtualization (EL2)
ARM's virtualization is built into the exception level hierarchy:
ARM Exception Levels with Virtualization:
┌──────────────────────────────────────────────────┐
│ EL3: Secure Monitor (TrustZone) │
│ - SMC handling, secure/non-secure world switch │
├──────────────────────────────────────────────────┤
│ EL2: Hypervisor │
│ - Stage-2 page tables (IPA -> PA) │
│ - Trap/emulate sensitive instructions │
│ - HVC (Hypervisor Call) handling │
│ - VTTBR_EL2: guest stage-2 table base │
│ - HCR_EL2: hypervisor configuration register │
│ - VBAR_EL2: hypervisor exception vectors │
├──────────────────────────────────────────────────┤
│ EL1: Guest OS kernel │
│ - Full kernel privileges (but trapped by EL2) │
│ - Stage-1 page tables (VA -> IPA) │
│ - System registers trapped as configured │
├──────────────────────────────────────────────────┤
│ EL0: Guest userspace │
│ - Applications │
└──────────────────────────────────────────────────┘
VHE (Virtualized Host Extensions) -- ARMv8.1:
Without VHE, the host kernel runs at EL1 and the hypervisor runs at EL2. On every VM exit, KVM has to save/restore the host kernel's EL1 state. VHE allows the host kernel to run directly at EL2, eliminating this overhead:
Without VHE: With VHE (ARMv8.1+):
EL2: KVM "stub" EL2: Host kernel + KVM
▲ ▲
│ HVC / trap │ trap (direct)
│ │
EL1: Host kernel EL1: Guest kernel
or Guest kernel (only guests use EL1)
VHE is detected via ID_AA64MMFR1_EL1.VH and KVM on ARM always uses VHE when available (all modern ARMv8.1+ SoCs). The VM exit cost drops ~30-50% with VHE.
Stage-2 Translation (ARM's EPT equivalent):
Guest VA ──[TTBR0_EL1/TTBR1_EL1]──> IPA ──[VTTBR_EL2]──> PA
- Stage-1: Guest-controlled (VA -> IPA). Guest OS sets these page tables.
- Stage-2: Hypervisor-controlled (IPA -> PA). KVM sets these page tables.
- Hardware walks both stages automatically (2D page table walk).
- Permission intersection: the final permission is the most restrictive
of Stage-1 and Stage-2.
KVM on ARM source: arch/arm64/kvm/ -- key files:
hyp/-- EL2 code that runs at hypervisor privilegehandle_exit.c-- ARM VM exit handler dispatchmmu.c-- Stage-2 page table managementsys_regs.c-- System register emulation/trapping
3.4 RISC-V H Extension
The RISC-V Hypervisor extension adds a two-level address translation similar to Intel EPT and ARM Stage-2:
RISC-V Privilege Modes with H Extension:
M-mode (Machine) ─── firmware / SBI
│
HS-mode (Hypervisor-extended Supervisor) ─── host kernel + KVM
│
VS-mode (Virtual Supervisor) ─── guest kernel
│
VU-mode (Virtual User) ─── guest applications
Key CSRs:
hgatp-- Guest address translation pointer (like EPTP / VTTBR_EL2)hstatus-- Hypervisor status registerhtval-- Trap value (faulting guest physical address)htinst-- Trapped instruction (helps emulation without fetching from guest memory)hedeleg/hideleg-- Exception/interrupt delegation from HS to VS modehvip-- Hypervisor virtual interrupt pending (inject interrupts into guest)
Special instructions:
HLV.B/H/W/D-- Hypervisor Load Virtual: load from guest address space while in HS-modeHSV.B/H/W/D-- Hypervisor Store Virtual: store to guest address spaceHFENCE.VVMA-- Flush guest TLB entries (like INVEPT/INVVPID)HFENCE.GVMA-- Flush Stage-2 translations
KVM on RISC-V source: arch/riscv/kvm/
3.5 EPT / NPT (Extended / Nested Page Tables)
EPT (Intel) and NPT (AMD) solve the same problem: eliminating shadow page tables by providing hardware-assisted two-dimensional page table walks.
Without EPT (shadow page tables):
Guest VA ──[Guest PT]──> GPA ──[Shadow PT maintained by KVM]──> HPA
Problems:
- KVM must intercept ALL guest page table modifications
- KVM must maintain a "shadow" page table that maps GVA->HPA
- Every CR3 load, INVLPG, or page table write causes a VM exit
- Extremely expensive for workloads with heavy page table activity
With EPT/NPT (hardware 2D walk):
Guest VA ──[Guest PT]──> GPA ──[EPT/NPT, hardware-walked]──> HPA
Benefits:
- Guest can modify its own page tables WITHOUT VM exits
- CR3 loads and INVLPG are handled in hardware
- KVM only manages the EPT/NPT tables (GPA->HPA mapping)
- Eliminates the entire shadow page table machinery
Cost:
- Page table walk is deeper: each level of guest PT requires
a full EPT walk to translate the guest PT page's GPA to HPA
- Worst case: 4-level guest PT * 4-level EPT = 24 memory accesses
per TLB miss (vs 4 with flat page tables)
- Mitigated by TLB caching and page walk caches
EPT Page Table Structure:
EPT is a standard 4-level page table (like x86-64 regular page tables):
PML4 (512 entries) -> PDPT (512) -> PD (512) -> PT (512) -> 4KB page
-> 1GB huge page
-> 2MB huge page
EPT PTE format (64-bit):
┌──────────────────────────────────────────────────────────────────┐
│ 63-52: ignored │ 51-12: HPA page frame │ 11-8: type │ 7: IGN │
│ 6: dirty │ 5: accessed │ 4-3: EPT type │ 2: X │ 1: W │ 0: R │
└──────────────────────────────────────────────────────────────────┘
Bits 2:0 (R/W/X): read, write, execute permissions
Bits 5:3: EPT memory type (UC=0, WC=1, WT=4, WP=5, WB=6)
Bit 7: 1 = this is a large page (1GB at PDPT level, 2MB at PD level)
Bit 8: accessed (if enabled via EPTP)
Bit 9: dirty (if enabled via EPTP)
EPT violations (the EPT equivalent of page faults) cause VM exits with exit qualification bits indicating:
- Was it a read, write, or instruction fetch?
- Was it caused by an EPT paging-structure entry or a leaf entry?
- Was the GPA valid (in a memory slot) or not?
KVM handles EPT violations by:
- Looking up the faulting GPA in memory slots
- If found: installing an EPT mapping (GPA -> HPA from the slot's mmap'd memory)
- If not found: forwarding as MMIO to userspace
- If permissions mismatch: handling dirty/accessed tracking, or MMIO
3.6 VPID (Virtual Processor Identifier)
Without VPID, every VM entry/exit flushes the entire TLB. VPID tags TLB entries with a per-vCPU identifier so the CPU can distinguish between different VMs' translations:
TLB Entry with VPID:
[VPID | GVA | GPA | HPA | permissions]
- VPID 0 = VMX root mode (host)
- VPID 1-65535 = guest vCPUs
- On VM switch, no TLB flush needed -- entries for other VPIDs are ignored
- INVVPID instruction for selective invalidation:
- Individual address: flush one GVA for one VPID
- Single context: flush all entries for one VPID
- All contexts: flush all non-zero VPIDs
- Single context retaining globals: flush all non-global entries for one VPID
KVM allocates a VPID per vCPU and never reuses VPIDs while a vCPU exists.
3.7 Posted Interrupts
Posted Interrupts allow an external interrupt to be delivered directly to a guest vCPU without causing a VM exit. This is critical for VFIO device passthrough performance.
Traditional Interrupt Flow (causes VM exit):
Device ──IRQ──> Host LAPIC ──VMEXIT──> KVM ──inject──> Guest LAPIC ──> Guest ISR
Posted Interrupt Flow (no VM exit):
Device ──IRQ──> Posted Interrupt Descriptor ──direct──> Guest LAPIC ──> Guest ISR
(in memory) (no VM exit!)
Posted Interrupt Descriptor (PID) layout:
┌──────────────────────────────────────────────────────────────┐
│ Bits 255:0: Posted Interrupt Requests (PIR) │
│ 256-bit bitmap, one bit per interrupt vector │
│ Bit 256: Outstanding Notification (ON) │
│ Set when a new interrupt is posted, cleared by hardware │
│ Bit 257: Suppress Notification (SN) │
│ If set, don't send notification (vCPU is not running) │
│ Bits 271:258: reserved │
│ Bits 279:272: Notification Vector (NV) │
│ Vector for the notification interrupt │
│ Bits 287:280: reserved │
│ Bits 319:288: Notification Destination (NDST) │
│ APIC ID of the physical CPU running this vCPU │
└──────────────────────────────────────────────────────────────┘
Requirements: VT-x, APIC virtualization, process-posted interrupts in secondary execution controls.
4. Memory Virtualization
4.1 Memory Slot Architecture
KVM manages guest physical memory through "memory slots" -- regions that map a GPA range to a HVA (host virtual address) range:
KVM Memory Slot Model:
Guest Physical Address Space Host Virtual Address Space
(what the guest sees) (VMM process memory)
┌────────────────────┐ 4GB ┌────────────────────┐
│ MMIO/ROM hole │ │ │
├────────────────────┤ 3.5GB │ │
│ │ │ │
│ Slot 1: High RAM │ ───────────────> │ mmap region 2 │
│ │ │ │
├────────────────────┤ 3GB ├────────────────────┤
│ MMIO gap │ (no slot) │ │
│ (PCI BARs) │ │ │
├────────────────────┤ 0xE0000000 │ │
│ │ │ │
│ │ │ │
│ Slot 0: Low RAM │ ───────────────> │ mmap region 1 │
│ │ │ │
│ │ │ (may be backed by │
│ │ │ huge pages) │
├────────────────────┤ 0x100000 (1MB) ├────────────────────┤
│ Slot 2: VGA ROM │ ───────────────> │ ROM image │
├────────────────────┤ 0xC0000 ├────────────────────┤
│ Slot 3: BIOS ROM │ ───────────────> │ BIOS image │
├────────────────────┤ 0x0 ├────────────────────┤
└────────────────────┘ └────────────────────┘
Memory slot management:
// Allocate guest RAM with huge page backing
void *guest_mem = mmap(NULL, guest_ram_size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE,
-1, 0);
// Request huge pages (2MB) for performance
madvise(guest_mem, guest_ram_size, MADV_HUGEPAGE);
// For even better performance, use explicit hugetlbfs:
int hugefd = open("/dev/hugepages/guest-ram", O_CREAT | O_RDWR, 0600);
ftruncate(hugefd, guest_ram_size);
void *guest_mem = mmap(NULL, guest_ram_size, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_POPULATE, hugefd, 0);
// Register with KVM
struct kvm_userspace_memory_region region = {
.slot = 0,
.flags = 0,
.guest_phys_addr = 0,
.memory_size = guest_ram_size,
.userspace_addr = (uint64_t)guest_mem,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
// To delete a slot: set memory_size to 0
region.memory_size = 0;
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
// To update a slot (e.g., enable dirty logging):
region.memory_size = guest_ram_size;
region.flags = KVM_MEM_LOG_DIRTY_PAGES;
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
// Read-only memory (for ROM):
struct kvm_userspace_memory_region rom_region = {
.slot = 2,
.flags = KVM_MEM_READONLY,
.guest_phys_addr = 0xFFFC0000, // 256KB below 4GB
.memory_size = 256 * 1024,
.userspace_addr = (uint64_t)bios_image,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &rom_region);
4.2 GPA to HPA Translation Path
Full Translation Path (with EPT):
Guest executes: MOV RAX, [0x1000]
1. Guest VA (0x1000) is translated via guest's page tables:
Guest CR3 -> PML4[idx] -> PDPT[idx] -> PD[idx] -> PT[idx] -> GPA
But EACH of these page table pages is at a GPA, so each step
requires an EPT walk:
2. EPT walk for guest CR3's GPA:
EPTP -> EPT-PML4 -> EPT-PDPT -> EPT-PD -> EPT-PT -> HPA of guest CR3
3. Read guest PML4 entry from HPA, get GPA of PDPT page
4. EPT walk for PDPT page's GPA:
EPTP -> EPT-PML4 -> EPT-PDPT -> EPT-PD -> EPT-PT -> HPA of guest PDPT
5. Read guest PDPT entry, get GPA of PD page
6. EPT walk for PD page's GPA ... (repeat)
7. Read guest PD entry, get GPA of PT page
8. EPT walk for PT page's GPA ... (repeat)
9. Read guest PT entry, get final GPA of the data page
10. EPT walk for data page's GPA:
EPTP -> EPT-PML4 -> EPT-PDPT -> EPT-PD -> EPT-PT -> HPA of data
Total memory accesses on TLB miss (4-level guest + 4-level EPT):
5 guest PT levels * 4 EPT levels + 4 EPT levels for final data = 24
In practice, TLBs and page walk caches make this rare.
4.3 EPT Violation Handling in KVM
When the guest accesses a GPA that has no EPT mapping (or the wrong permissions), an EPT violation occurs:
// In KVM kernel code (simplified from arch/x86/kvm/mmu/mmu.c):
static int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
u64 error_code, void *insn, int insn_len)
{
// 1. Check if GPA is in a memory slot
struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gpa >> PAGE_SHIFT);
if (!slot) {
// No slot -> this is MMIO. Let userspace handle it.
// KVM caches MMIO information in special "MMIO SPTEs" to avoid
// repeated exits for the same MMIO GPA.
return handle_mmio_page_fault(vcpu, gpa);
}
// 2. It's a real memory access. Map the page.
// Get the HPA by looking up the HVA in the VMM process's page tables.
kvm_pfn_t pfn = gfn_to_pfn(vcpu->kvm, gpa >> PAGE_SHIFT);
// 3. Install the EPT mapping
// KVM tries to use huge pages (2MB, 1GB) when possible.
// It checks alignment and whether the entire huge page range
// is within the same memory slot with uniform attributes.
int level = mapping_level(vcpu, gpa >> PAGE_SHIFT); // 1=4KB, 2=2MB, 3=1GB
// 4. Write the EPT PTE
mmu_set_spte(vcpu, gpa, pfn, level, ...);
// 5. Resume guest execution (no exit to userspace needed)
return RET_PF_CONTINUE;
}
4.4 Dirty Page Tracking
KVM provides two mechanisms for tracking which guest pages have been modified:
Bitmap-based (classic):
// Enable dirty logging on a memory slot
region.flags = KVM_MEM_LOG_DIRTY_PAGES;
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
// KVM now write-protects all EPT entries for this slot.
// When the guest writes, an EPT violation occurs:
// 1. KVM marks the page as dirty in a bitmap
// 2. KVM removes write-protection from the EPT entry
// 3. Guest resumes without userspace exit
// Periodically collect dirty pages:
struct kvm_dirty_log log = { .slot = 0 };
log.dirty_bitmap = calloc(1, bitmap_size); // 1 bit per 4KB page
ioctl(vm_fd, KVM_GET_DIRTY_LOG, &log);
// NOTE: KVM_GET_DIRTY_LOG clears the bitmap AND re-protects all EPT entries
// KVM_CLEAR_DIRTY_LOG (newer, more efficient):
// Allows clearing dirty bits for a subrange without re-protecting everything
struct kvm_clear_dirty_log clear = {
.slot = 0,
.first_page = 0,
.num_pages = 1024,
.dirty_bitmap = bitmap,
};
ioctl(vm_fd, KVM_CLEAR_DIRTY_LOG, &clear);
Ring-based (KVM_CAP_DIRTY_LOG_RING, Linux 5.18+):
// More efficient: KVM pushes dirty page notifications into a ring buffer
// shared with userspace, avoiding the bitmap scan overhead.
struct kvm_enable_cap cap = {
.cap = KVM_CAP_DIRTY_LOG_RING,
.args[0] = ring_size, // must be power of 2, >= 1024
};
ioctl(vm_fd, KVM_ENABLE_CAP, &cap);
// Ring entries:
struct kvm_dirty_gfn {
__u32 flags; // KVM_DIRTY_GFN_F_DIRTY or KVM_DIRTY_GFN_F_RESET
__u32 slot; // memory slot
__u64 offset; // page offset within slot
};
// mmap the ring per vCPU (after the kvm_run region):
struct kvm_dirty_gfn *ring = (struct kvm_dirty_gfn *)
((char *)run + vcpu_mmap_size);
4.5 Memory Ballooning
Virtio-balloon allows the host to reclaim memory from guests:
Balloon Inflation (host reclaims memory):
1. Host sends "inflate" request via virtio-balloon device
2. Guest driver allocates pages and reports their GPAs to host
3. Host calls madvise(MADV_DONTNEED) on corresponding HVAs
4. Host kernel reclaims the physical pages
Balloon Deflation (guest gets memory back):
1. Host sends "deflate" request
2. Guest driver frees the balloon pages
3. On next access, page faults bring in new physical pages
┌─────────────────────────────────────────────┐
│ Guest VM (sees 4GB RAM) │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Usable RAM (3GB) │ │
│ ├─────────────────────────────────────┤ │
│ │ Balloon (1GB) ─── reported to host │ │
│ │ (allocated by guest, unused) │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Host physical memory:
The 1GB balloon pages have been MADV_DONTNEED'd and reclaimed.
Guest still thinks it has 4GB but 1GB is "held" by the balloon.
4.6 KSM (Kernel Same-page Merging)
KSM scans memory regions marked with MADV_MERGEABLE and CoW-merges identical pages:
// Mark guest memory as mergeable
madvise(guest_mem, guest_ram_size, MADV_MERGEABLE);
// KSM kernel thread (ksmd) periodically:
// 1. Computes content hashes of marked pages
// 2. Finds identical pages (even across different VMs)
// 3. CoW-merges them (one physical page, multiple mappings)
// 4. If a guest later writes, a CoW fault gives it a private copy
// Tuning (via /sys/kernel/mm/ksm/):
// pages_to_scan -- pages scanned per sleep interval (default 100)
// sleep_millisecs -- interval between scans (default 20)
// merge_across_nodes -- merge across NUMA nodes (default 1)
Savings: 100 VMs running the same OS image can share ~30-50% of memory. The tradeoff is CPU cost of ksmd and CoW fault latency spikes.
4.7 userfaultfd for Post-Copy Migration
// Register guest memory with userfaultfd
int uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
struct uffdio_api api = { .api = UFFD_API };
ioctl(uffd, UFFDIO_API, &api);
struct uffdio_register reg = {
.range = { .start = (uint64_t)guest_mem, .len = guest_ram_size },
.mode = UFFDIO_REGISTER_MODE_MISSING,
};
ioctl(uffd, UFFDIO_REGISTER, ®);
// When guest accesses an unmigrated page:
// 1. Page fault -> userfaultfd event -> VMM reads event from uffd
// 2. VMM requests the page from the source host over the network
// 3. VMM provides the page via UFFDIO_COPY
// 4. Guest resumes
struct uffdio_copy copy = {
.dst = (uint64_t)faulting_addr,
.src = (uint64_t)page_from_network,
.len = 4096,
.mode = 0,
};
ioctl(uffd, UFFDIO_COPY, ©);
5. Interrupt Virtualization
5.1 Interrupt Controller Modes
KVM supports three IRQCHIP modes:
Mode 1: Full In-Kernel IRQCHIP (KVM_CREATE_IRQCHIP)
─────────────────────────────────────────────────────
┌──────────────────────────────┐
│ KVM (kernel) │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ PIC │ │IOAPIC│ │ LAPIC│ │
│ │(8259)│ │ │ │(per │ │
│ │ │ │ │ │ vCPU)│ │
│ └──────┘ └──────┘ └──────┘ │
└──────────────────────────────┘
Pros: Fastest. All interrupt routing in kernel. No exits for EOI/ICR.
Cons: Less flexible. Limited to what KVM implements.
Used by: QEMU (default), early Firecracker.
Mode 2: Split IRQCHIP (KVM_CAP_SPLIT_IRQCHIP)
──────────────────────────────────────────────
┌──────────────────────────────┐
│ Userspace (VMM) │
│ ┌──────┐ ┌──────┐ │
│ │ PIC │ │IOAPIC│ │
│ │(8259)│ │ │ │
│ └──────┘ └──────┘ │
└──────────────────────────────┘
┌──────────────────────────────┐
│ KVM (kernel) │
│ ┌──────┐ │
│ │ LAPIC│ (in-kernel, fast) │
│ └──────┘ │
└──────────────────────────────┘
Pros: LAPIC stays fast (no exits for timer/IPI). IOAPIC in userspace
allows flexible routing and MSI emulation.
Cons: Slightly more complex userspace code.
Used by: Firecracker, Cloud Hypervisor, crosvm.
Mode 3: Userspace IRQCHIP (no KVM_CREATE_IRQCHIP)
─────────────────────────────────────────────────
┌──────────────────────────────┐
│ Userspace (VMM) │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ PIC │ │IOAPIC│ │ LAPIC│ │
│ └──────┘ └──────┘ └──────┘ │
└──────────────────────────────┘
Pros: Maximum flexibility.
Cons: Slowest. Every LAPIC access (timer tick, IPI, EOI) causes a VM exit.
Used by: Nobody in production. Only for testing.
5.2 Interrupt Injection Flow
How an interrupt reaches the guest:
1. Device generates interrupt
│
▼
2. Routed to KVM via IRQFD (eventfd) or ioctl(KVM_IRQ_LINE)
│
▼
3. KVM GSI routing table ──> determines destination (IOAPIC pin, MSI, etc.)
│
▼
4. IOAPIC (in-kernel or userspace) determines:
- Which LAPIC(s) to deliver to (destination field)
- What vector to use
- Edge vs level triggered
│
▼
5. LAPIC receives interrupt request
- Checks priority against current TPR (Task Priority Register)
- If higher priority: sets bit in IRR (Interrupt Request Register)
- When guest IF=1 and no higher-priority interrupt pending:
sets bit in ISR (In-Service Register), clears IRR bit
│
▼
6. On next VM entry, KVM injects the interrupt:
- Intel: writes interrupt info into VMCS VM-entry interruption-information field
- AMD: writes event injection field in VMCB
- ARM: sets appropriate HCR_EL2 bits (VI, VF) or uses GICv3/GICv4
│
▼
7. Guest executes ISR (Interrupt Service Routine)
│
▼
8. Guest writes EOI to LAPIC
- If level-triggered: KVM notifies IOAPIC to re-evaluate
- If using IRQFD with resamplefd: KVM signals the resample eventfd
5.3 IRQFD Mechanism
IRQFD connects an eventfd to KVM's interrupt injection path, enabling zero-copy interrupt delivery:
IRQFD Flow:
┌──────────────┐ eventfd_signal() ┌──────────────────┐
│ Device thread│ ──────────────────────── │ KVM (irqfd │
│ (in VMM) │ │ worker thread) │
│ │ write(efd, 1) │ │
└──────────────┘ ◄────────────────────── │ Inject IRQ into │
resamplefd (level-triggered │ guest LAPIC │
(EOI notification) re-trigger) └──────────────────┘
Benefits:
- No ioctl needed per interrupt injection
- Works with epoll (VMM can multiplex I/O events and interrupt injection)
- Compatible with VFIO (device interrupts -> eventfd -> KVM -> guest)
- For level-triggered: resamplefd notifies VMM when guest EOIs
Setup:
int efd = eventfd(0, 0);
struct kvm_irqfd irqfd = { .fd = efd, .gsi = 5 };
ioctl(vm_fd, KVM_IRQFD, &irqfd);
Inject interrupt:
uint64_t val = 1;
write(efd, &val, sizeof(val));
// KVM immediately injects GSI 5 into the guest
5.4 MSI/MSI-X Injection
MSI (Message Signaled Interrupts) bypass the IOAPIC entirely. The device writes a specific value to a specific address, and the chipset routes it directly to a LAPIC:
// MSI routing entry in KVM_SET_GSI_ROUTING:
struct kvm_irq_routing_entry msi_entry = {
.gsi = 24, // assigned GSI
.type = KVM_IRQ_ROUTING_MSI,
.u.msi = {
// MSI address format (x86):
// Bits 31:20 = 0xFEE (fixed prefix)
// Bits 19:12 = destination APIC ID
// Bit 3 = RH (redirect hint)
// Bit 2 = DM (destination mode: 0=physical, 1=logical)
.address_lo = 0xFEE00000 | (dest_apic_id << 12),
.address_hi = 0,
// MSI data format:
// Bits 7:0 = vector
// Bit 14 = level (0=deassert, 1=assert)
// Bit 15 = trigger mode (0=edge, 1=level)
.data = vector | (1 << 14), // edge-triggered, assert
},
};
6. I/O Virtualization
6.1 I/O Trapping: PIO and MMIO
I/O Trap Path:
Guest executes:
OUT 0x3F8, AL (PIO -- serial port write)
or
MOV [0xFEE00000], EAX (MMIO -- LAPIC write)
│
▼
Hardware traps (VM exit):
Intel: Exit reason = "I/O instruction" or "EPT violation"
AMD: Exit code = IOIO or NPF
│
▼
KVM checks:
1. Is there an IOEVENTFD for this address? → signal eventfd, resume (no exit)
2. Is this an in-kernel device (LAPIC, IOAPIC, PIT)? → handle in kernel
3. Otherwise → exit to userspace via kvm_run
│
▼
VMM handles exit:
kvm_run.exit_reason = KVM_EXIT_IO or KVM_EXIT_MMIO
VMM emulates the device, fills in response data
VMM calls KVM_RUN to resume guest
6.2 IOEVENTFD
IOEVENTFD is the "doorbell" mechanism -- KVM signals an eventfd when the guest writes to a specific address, WITHOUT causing a VM exit to userspace:
Without IOEVENTFD: With IOEVENTFD:
Guest write to 0x500 Guest write to 0x500
│ │
▼ ▼
VM exit (expensive) KVM signals eventfd
│ (NO vm exit)
▼ │
VMM handles in userspace ▼
VMM calls KVM_RUN VMM device thread
(context switches, TLB wakes up via epoll
pressure) (asynchronous)
This is used as the virtio notification mechanism. The guest writes to the doorbell address to notify the VMM that new virtio descriptors are available. The VMM device thread, blocked on epoll, wakes up and processes the virtqueue.
// Set up IOEVENTFD for a virtio MMIO doorbell at 0xd0000050
int doorbell_fd = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
struct kvm_ioeventfd ev = {
.addr = 0xd0000050, // QueueNotify register for virtio-mmio
.len = 4, // 4-byte write
.fd = doorbell_fd,
.flags = 0, // MMIO (not PIO)
// .flags = KVM_IOEVENTFD_FLAG_PIO // for PIO
// .flags |= KVM_IOEVENTFD_FLAG_DATAMATCH // only trigger on specific value
// .datamatch = 0x1234 // the value to match
};
ioctl(vm_fd, KVM_IOEVENTFD, &ev);
// VMM device thread:
struct epoll_event events[10];
int epfd = epoll_create1(0);
struct epoll_event ev_cfg = { .events = EPOLLIN, .data.fd = doorbell_fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, doorbell_fd, &ev_cfg);
while (running) {
int n = epoll_wait(epfd, events, 10, -1);
for (int i = 0; i < n; i++) {
uint64_t val;
read(events[i].data.fd, &val, sizeof(val));
process_virtqueue();
}
}
6.3 Virtio Architecture
Virtio is the standard paravirtual I/O framework. The guest knows it is virtualized and cooperates with the VMM through shared memory virtqueues:
Virtio Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Guest │
│ ┌──────────────┐ │
│ │ virtio driver│ (e.g., virtio-net, virtio-blk) │
│ │ (guest OS) │ │
│ └──────┬───────┘ │
│ │ writes descriptors to vring │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Virtqueue (vring) │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │ │
│ │ │ Descriptor │ │ Available │ │ Used Ring │ │ │
│ │ │ Table │ │ Ring │ │ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ [addr,len, │ │ [idx, │ │ [idx, │ │ │
│ │ │ flags,next]│ │ ring[]] │ │ ring[]] │ │ │
│ │ └─────────────┘ └─────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ notify (MMIO write or PIO) │
└─────────┼───────────────────────────────────────────────────┘
│ IOEVENTFD (no VM exit)
▼
┌─────────────────────────────────────────────────────────────┐
│ VMM (userspace) │
│ ┌──────────────┐ │
│ │ virtio device│ (e.g., TAP backend, file backend) │
│ │ (backend) │ │
│ └──────┬───────┘ │
│ │ processes descriptors, puts them on Used Ring │
│ │ signals guest via IRQFD │
└─────────┘ │
└─────────────────────────────────────────────────────────────┘
Vring structure (from linux/virtio_ring.h):
// Descriptor: describes a buffer
struct vring_desc {
__le64 addr; // Guest physical address of buffer
__le32 len; // Length of buffer
__le16 flags; // VRING_DESC_F_NEXT (chained), _WRITE (device writes), _INDIRECT
__le16 next; // Index of next descriptor in chain
};
// Available ring: guest puts descriptor indices here for the device to consume
struct vring_avail {
__le16 flags; // VRING_AVAIL_F_NO_INTERRUPT (suppress notifications)
__le16 idx; // Where guest will put next entry (monotonically increasing)
__le16 ring[]; // Array of descriptor indices
};
// Used ring: device puts completed descriptor indices here
struct vring_used {
__le16 flags; // VRING_USED_F_NO_NOTIFY (suppress notifications)
__le16 idx; // Where device will put next entry
struct vring_used_elem ring[];
};
struct vring_used_elem {
__le32 id; // Index of head of descriptor chain
__le32 len; // Number of bytes written by device
};
Virtio transport types:
| Transport | How it works | Used by |
|---|---|---|
| virtio-pci | Device appears as PCI device. Bars for config/notify/ISR. | QEMU, most VMMs |
| virtio-mmio | Device at a fixed MMIO address. Simpler, no PCI bus. | Firecracker, embedded |
6.4 vhost: In-Kernel Virtio Data Plane
vhost moves the virtio data plane into the kernel, eliminating userspace context switches for every packet/block I/O:
Without vhost: With vhost-net:
Guest Guest
│ virtqueue │ virtqueue
▼ ▼
VM exit (IOEVENTFD) VM exit (IOEVENTFD)
│ │
▼ ▼
VMM (userspace) KVM (kernel)
│ read from virtqueue │
│ process packet ▼
│ write to TAP fd vhost-net kernel thread
│ system call overhead │ directly accesses vring
▼ │ (guest memory is mapped)
Kernel (TAP driver) │ copies to/from TAP
│ │ NO userspace involvement
▼ ▼
Network stack Network stack
Overhead: 2 context switches Overhead: 0 context switches
per packet (user->kernel->user) on data path
vhost variants:
| Variant | Location | Use Case |
|---|---|---|
vhost-net | In kernel (/dev/vhost-net) | Network I/O via TAP |
vhost-scsi | In kernel (/dev/vhost-scsi) | SCSI target for block I/O |
vhost-vsock | In kernel (/dev/vhost-vsock) | Host-guest socket communication |
vhost-user | Userspace process (socket-based) | DPDK, SPDK, custom backends |
vhost-user uses a Unix domain socket protocol between the VMM and a separate backend process. This allows high-performance backends like DPDK (for networking) or SPDK (for storage) to serve virtio devices without being part of the VMM process.
6.5 VFIO Device Passthrough
VFIO gives a guest direct access to a physical device with IOMMU protection. See the companion document VFIO Internals for the full VFIO API. Key integration points with KVM:
VFIO + KVM Device Passthrough:
┌───────────────┐
│ Guest VM │
│ ┌──────────┐ │ Direct access (no VMM involvement)
│ │ Device │ │ ◄─────────────────────────────────────────┐
│ │ Driver │ │ │
│ └──────────┘ │ │
└───────────────┘ │
│
┌────────────────────────────────────────────────────────┐ │
│ VMM (QEMU/Firecracker) │ │
│ │ │
│ 1. Open VFIO group (/dev/vfio/N) │ │
│ 2. Get device fd (ioctl VFIO_GROUP_GET_DEVICE_FD) │ │
│ 3. Map device BARs into guest address space │ │
│ (KVM_SET_USER_MEMORY_REGION for device MMIO) │ │
│ 4. Map DMA: VFIO_IOMMU_MAP_DMA │──┘
│ (GPA range -> HVA, so device DMA reaches guest RAM)│
│ 5. Configure interrupts: VFIO_DEVICE_SET_IRQS │
│ (device MSI-X -> eventfd -> KVM IRQFD -> guest) │
│ 6. Set up posted interrupts (if supported) │
└────────────────────────────────────────────────────────┘
Performance: Device passthrough achieves near-native performance:
- Network: Line-rate with DPDK, <2us latency
- NVMe: Full IOPS (millions per second) with no hypervisor overhead
- GPU: Full GPU performance for ML/HPC workloads
7. vCPU Scheduling & Performance
7.1 vCPUs as Linux Threads
Each KVM vCPU is a regular CLONE_VM Linux thread in the VMM process. This means:
- CFS (Completely Fair Scheduler) schedules vCPUs alongside all other host threads
- vCPUs compete for CPU time with host processes, other VMs, kernel threads
- Standard Linux scheduling tools work:
taskset,cpuset,chrt,nice,cgroups - vCPU threads show up in
/proc/<pid>/task/<tid>/
Process view of a VM with 4 vCPUs:
QEMU PID 12345
├── Main thread (event loop, management)
├── vCPU 0 (tid 12346) ── KVM_RUN loop
├── vCPU 1 (tid 12347) ── KVM_RUN loop
├── vCPU 2 (tid 12348) ── KVM_RUN loop
├── vCPU 3 (tid 12349) ── KVM_RUN loop
├── I/O thread (block I/O)
└── VNC/SPICE thread (display)
7.2 vCPU Pinning
Pinning vCPUs to physical CPUs eliminates scheduling jitter and improves cache locality:
// Pin vCPU thread to physical CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(physical_cpu, &cpuset);
pthread_setaffinity_np(vcpu_thread, sizeof(cpuset), &cpuset);
// QEMU command-line equivalent:
// -vcpu 4 -object iothread,id=iot0
// With libvirt:
// <vcpupin vcpu="0" cpuset="2"/>
// <vcpupin vcpu="1" cpuset="3"/>
// <emulatorpin cpuset="0-1"/>
NUMA considerations:
- Pin vCPUs to CPUs in the same NUMA node as the guest's backing memory
- Use
numactlorset_mempolicy()to bind guest memory to the correct NUMA node - Cross-NUMA memory access adds 50-100ns latency per access
7.3 Halt-Polling
When a vCPU executes HLT (idle), KVM has three options:
vCPU HLT handling:
Guest executes HLT
│
▼
┌────────────────────┐
│ Halt-poll phase │ KVM busy-waits for halt_poll_ns
│ (kernel busy-wait) │ (default: 200,000 ns = 200us)
│ │
│ Check: interrupt │──── Yes ──► Resume guest immediately
│ pending? │ (near-zero latency)
│ │
│ Timer expired? │──── Yes ──► Fall through to sleep
└────────────────────┘
│
▼
┌────────────────────┐
│ Scheduled out │ Linux scheduler sleeps the vCPU thread
│ (kernel sleep) │ (context switch, cache cold on wake)
│ │
│ Wake on: │
│ - Interrupt │
│ - Timer │
│ - KVM_RUN signal │
└────────────────────┘
Tuning: /sys/module/kvm/parameters/halt_poll_ns
- 0 = never poll (save CPU, higher latency)
- 200000 = default (200us busy-wait)
- Higher values = lower latency for interrupt-heavy workloads but wastes CPU
KVM also adapts the poll time dynamically: it doubles the poll window when polling succeeds (fast wakeup) and halves it when polling times out (wasted CPU).
7.4 Paravirtual Optimizations
PV Spinlocks: When a vCPU is preempted while holding a spinlock, other vCPUs spin waiting for a lock that cannot be released until the holder is scheduled back. PV spinlocks solve this:
1. Guest kernel detects it's on KVM (CPUID leaf 0x40000001)
2. Guest uses KVM_HC_KICK_CPU hypercall to wake a specific vCPU
3. If a vCPU spins too long on a lock, it yields via HLT
4. When the lock holder releases, it uses KVM_HC_KICK_CPU to wake waiters
Without PV spinlocks: O(N^2) wasted cycles (N = overcommitted vCPUs)
With PV spinlocks: lock waiters sleep, holder wakes them on release
PV TLB Flush:
Remote TLB flush (e.g., munmap on a multi-threaded process) normally sends IPIs to all CPUs. In a VM, this causes VM exits on every target vCPU. PV TLB flush batches these:
1. Instead of per-vCPU IPI for TLB flush, guest sets a flag in shared memory
2. On next VM entry, KVM checks the flag and does the flush
3. Avoids N-1 VM exits for an N-vCPU TLB shootdown
PV Steal Time: KVM reports to the guest how much CPU time was "stolen" (the vCPU was scheduled out):
// KVM writes steal time info to a per-vCPU shared memory page
struct kvm_steal_time {
__u64 steal; // cumulative stolen time in nanoseconds
__u32 version; // incremented on update (seqlock-like)
__u32 flags; // KVM_VCPU_STATE_PREEMPTED
__u8 preempted; // 1 if vCPU is currently preempted
__u8 pad[3];
__u32 pad2;
};
// Guest reads this to:
// - Account stolen time in CPU usage statistics (avoids 100% CPU lies)
// - Adjust timeout calculations
// - Decide whether to spin or yield on locks
7.5 TSC Handling
The TSC (Time Stamp Counter) is the highest-resolution time source on x86. Virtualizing it correctly is critical:
TSC Challenges in VMs:
1. Guest TSC must be monotonic even if vCPU migrates between physical CPUs
2. Guest TSC should not jump when vCPU is descheduled
3. Multiple VMs must have independent TSC values
4. Live migration: source and destination hosts may have different TSC frequencies
Solutions:
- TSC offsetting: VMCS/VMCB has a TSC_OFFSET field added to every RDTSC
guest_tsc = host_tsc + tsc_offset
- TSC scaling (KVM_CAP_TSC_CONTROL): multiply host TSC by a factor
guest_tsc = host_tsc * scale_factor + tsc_offset
Allows live migration between hosts with different TSC frequencies
- KVM_SET_TSC_KHZ: set the guest's TSC frequency
- If TSC offsetting/scaling can't work: KVM traps RDTSC and emulates it
(expensive, causes VM exit on every RDTSC)
KVM ioctl:
struct kvm_enable_cap cap = {
.cap = KVM_CAP_TSC_CONTROL,
};
ioctl(vm_fd, KVM_ENABLE_CAP, &cap);
// Set per-vCPU TSC frequency
struct kvm_enable_cap tsc_khz = {
.cap = KVM_CAP_TSC_CONTROL,
.args[0] = 2400000, // 2.4 GHz
};
// Or use KVM_SET_TSC_KHZ:
ioctl(vcpu_fd, KVM_SET_TSC_KHZ, 2400000);
TSC Deadline Timer: The LAPIC timer in TSC-deadline mode fires when TSC reaches a programmed value. This is the most precise timer available and is used by modern guest kernels:
// Check support:
ioctl(kvm_fd, KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER);
// KVM exposes this via CPUID (bit 24 of CPUID.1.ECX)
// and handles the timer internally using hrtimers
7.6 Performance Counters Virtualization (vPMU)
KVM can expose hardware performance counters to the guest:
// Enable via CPUID filtering (expose PMU CPUID leaves)
// and MSR access (allow RDPMC, PERF_GLOBAL_CTRL, etc.)
// Performance impact: vPMU adds VM exits for PMU counter overflow interrupts
// Some clouds disable vPMU for security (side channels) and performance
8. Live Migration
8.1 Pre-Copy Migration
The default QEMU migration algorithm:
Pre-Copy Migration Timeline:
Source Host Destination Host
│ │
│ 1. Setup phase │
│ - Negotiate capabilities │
│ - Create destination VM shell │
│ │
│ 2. Bulk transfer (iteration 0) │
│ - Transfer ALL guest RAM ────────────> Receive and map
│ - Enable dirty logging │
│ │
│ 3. Iterative phase │
│ - Get dirty bitmap ──────────────────> Track progress
│ - Transfer dirty pages ──────────────> Apply pages
│ - Repeat until dirty rate < bandwidth
│ or max iterations reached │
│ │
│ 4. Stop-and-copy (downtime starts) │
│ - Pause source VM │
│ - Transfer remaining dirty pages │
│ - Transfer device state ─────────────> Apply state
│ - Transfer vCPU state ───────────────> Apply state
│ │
│ 5. Switchover │
│ - Signal destination to start ───────> Resume VM
│ - (downtime ends) │
│ - Source VM destroyed │
└──────────────────────────────────────┘
Typical metrics:
- Total migration time: seconds to minutes (depends on RAM size and dirty rate)
- Downtime: 10-100ms (optimized) to seconds (naive)
- Bandwidth: limited by network (10Gbps = ~1GB/s)
8.2 Post-Copy Migration
Post-copy transfers the minimum state first (vCPU, device state) and demand-pages the rest:
Post-Copy Migration:
Source Host Destination Host
│ │
│ 1. Pause source VM │
│ 2. Transfer vCPU + device state ────────> Apply state
│ 3. Register userfaultfd on ─────────────> Resume VM immediately
│ guest memory │ (downtime: ~milliseconds)
│ │
│ 4. Guest accesses unmigrated page │
│ ◄──────── userfaultfd event
│ Send requested page ─────────────────> UFFDIO_COPY, resume
│ │
│ 5. Background push remaining pages │
│ (proactively, while guest runs) │
│ │
│ Tradeoff: │
│ - Minimal downtime (ms) │
│ - But page faults during runtime │
│ cause latency spikes │
│ - Source must stay alive until all │
│ pages transferred │
└──────────────────────────────────────┘
8.3 KVM Dirty Page Tracking APIs
// Method 1: Bitmap (classic, KVM_CAP_DIRTY_LOG)
// Enable:
region.flags |= KVM_MEM_LOG_DIRTY_PAGES;
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
// Collect:
ioctl(vm_fd, KVM_GET_DIRTY_LOG, &dirty_log);
// Each bit = one 4KB page. Bit set = page was written since last query.
// KVM re-write-protects all pages after KVM_GET_DIRTY_LOG.
// Method 2: Clear-dirty-log (KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2)
// More granular: clear dirty bits for a subrange without re-protecting everything
// Enable:
struct kvm_enable_cap cap = { .cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 };
cap.args[0] = KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE;
ioctl(vm_fd, KVM_ENABLE_CAP, &cap);
// Use KVM_CLEAR_DIRTY_LOG for subrange clearing
// Method 3: Dirty ring (KVM_CAP_DIRTY_LOG_RING, Linux 5.18+)
// Ring buffer per vCPU, KVM pushes dirty page events as they happen.
// Avoids scanning a large bitmap.
// Best for large VMs where bitmap scanning is expensive.
struct kvm_enable_cap cap = {
.cap = KVM_CAP_DIRTY_LOG_RING,
.args[0] = ring_size, // must be power of 2
};
ioctl(vm_fd, KVM_ENABLE_CAP, &cap);
// Ring is at the end of the kvm_run mmap region:
// entries = (kvm_run_mmap + kvm_run_size) to (kvm_run_mmap + kvm_run_size + ring_size)
8.4 Device State Serialization
During migration, all device state must be captured and restored. Each device emulated by the VMM must implement save/load:
Device state for migration:
vCPU state:
- Registers (KVM_GET_REGS, KVM_GET_SREGS, KVM_GET_FPU)
- MSRs (KVM_GET_MSRS)
- CPUID (KVM_GET_CPUID2)
- LAPIC (KVM_GET_LAPIC)
- XCRS (KVM_GET_XCRS) -- extended control registers (XCR0 for AVX, etc.)
- Debug registers (KVM_GET_DEBUGREGS)
- vCPU events (KVM_GET_VCPU_EVENTS) -- pending exceptions/NMIs/SMIs
- Nested state (KVM_GET_NESTED_STATE) -- if running nested VMs
- TSC value
In-kernel device state:
- PIT (KVM_GET_PIT2)
- IOAPIC (KVM_GET_IRQCHIP with KVM_IRQCHIP_IOAPIC)
- PIC (KVM_GET_IRQCHIP with KVM_IRQCHIP_PIC_MASTER / _SLAVE)
- Clock (KVM_GET_CLOCK)
VMM-emulated device state:
- Each device serializes its own state (QEMU uses VMStateDescription)
- Serial ports, network cards, block devices, USB, etc.
- Includes in-flight I/O, buffer contents, register values
8.5 Downtime Optimization Techniques
| Technique | How it Helps |
|---|---|
| Auto-converge | Throttle guest CPU to reduce dirty rate when migration isn't converging |
| XBZRLE (Xor-Based Zero Run-Length Encoding) | Compress dirty pages by XOR with previous version, RLE encode the diff |
| Multifd | Parallel migration streams across multiple TCP connections |
| Postcopy | Eliminate stop-and-copy phase entirely |
| Dirty limit | Cap the dirty page rate via KVM throttling |
| Compression | zlib/zstd compress pages before transfer |
9. Security & Confidential Computing
9.1 AMD SEV (Secure Encrypted Virtualization)
SEV encrypts guest VM memory with a per-VM AES key managed by the AMD Secure Processor (PSP/ASP), a dedicated ARM Cortex-A5 on the CPU die:
SEV Architecture:
┌────────────────────────────────────────────┐
│ AMD CPU Die │
│ │
│ ┌───────────┐ ┌───────────────┐ │
│ │ x86 Cores │ │ AMD Secure │ │
│ │ │ │ Processor │ │
│ │ ┌─────┐ │ │ (ARM Cortex │ │
│ │ │Guest│ │ │ A5, runs │ │
│ │ │ VM │ │ │ firmware) │ │
│ │ └─────┘ │ │ │ │
│ └───────────┘ │ Manages: │ │
│ │ │ - AES keys │ │
│ │ │ - Attestation │ │
│ ▼ │ - Key derivn │ │
│ ┌───────────┐ └───────────────┘ │
│ │ Memory │ │
│ │ Controller│ AES-128 encryption inline │
│ │ (w/ AES) │ (every cache line leaving │
│ │ │ to DRAM is encrypted) │
│ └───────────┘ │
└────────────────────────────────────────────┘
SEV variants:
| Variant | Protection | Protects Against |
|---|---|---|
| SEV (EPYC Naples) | Memory encryption with per-VM keys | Physical memory snooping, cold boot attacks |
| SEV-ES (EPYC Rome) | + Encrypted register state (VMCB is encrypted) | Hypervisor inspecting guest registers on VM exit |
| SEV-SNP (EPYC Milan) | + Integrity protection (Reverse Map Table), attestation | Hypervisor remapping guest memory, replay attacks, tampering |
// SEV API (simplified):
// 1. Create SEV-enabled VM
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// 2. Enable SEV
struct kvm_sev_cmd cmd = {
.id = KVM_SEV_INIT, // or KVM_SEV_ES_INIT, KVM_SEV_SNP_INIT
.sev_fd = open("/dev/sev", O_RDWR),
};
ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd);
// 3. Launch start
struct kvm_sev_launch_start start = {
.handle = 0,
.policy = SEV_POLICY_ES | SEV_POLICY_NODBG,
};
cmd.id = KVM_SEV_LAUNCH_START;
cmd.data = (uint64_t)&start;
ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd);
// 4. Launch update -- encrypt guest memory
struct kvm_sev_launch_update_data update = {
.uaddr = (uint64_t)guest_mem,
.len = guest_mem_size,
};
cmd.id = KVM_SEV_LAUNCH_UPDATE_DATA;
cmd.data = (uint64_t)&update;
ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd);
// 5. Launch measure -- get measurement for attestation
struct kvm_sev_launch_measure measure = { ... };
cmd.id = KVM_SEV_LAUNCH_MEASURE;
ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd);
// 6. Launch finish
cmd.id = KVM_SEV_LAUNCH_FINISH;
ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd);
9.2 Intel TDX (Trust Domain Extensions)
TDX creates "Trust Domains" (TDs) that are protected from the VMM, other VMs, and even SMM:
TDX Architecture:
┌──────────────────────────────────────────────────┐
│ Intel CPU │
│ │
│ TDX Module (runs in SEAM mode, signed by Intel) │
│ ┌────────────────────────────────────────────┐ │
│ │ Manages TD lifecycle, memory encryption, │ │
│ │ attestation, SEPT (Secure EPT) │ │
│ └────────────────────────────────────────────┘ │
│ ▲ ▲ │
│ │ SEAMCALL │ TDCALL │
│ │ │ │
│ ┌─────────┐ ┌─────────┐ │
│ │ VMM/KVM │ │ TD Guest│ │
│ │ (host) │ │ (trust │ │
│ │ │ cannot │ domain) │ │
│ │ │ read TD │ │ │
│ │ │ memory │ │ │
│ └─────────┘ └─────────┘ │
└──────────────────────────────────────────────────┘
Key differences from SEV:
- TDX Module is Intel-signed firmware, not OS-level software
- Secure EPT (SEPT): page tables managed by TDX Module, not VMM
- VMM cannot remap TD memory (prevents Heckler/WeSee attacks)
- Remote attestation built into Intel's attestation infrastructure
9.3 ARM CCA (Confidential Compute Architecture)
ARM CCA introduces "Realms" -- confidential VMs protected by the Realm Management Monitor (RMM):
ARM CCA Architecture:
EL3: Monitor (Root world)
│
├── Normal world Realm world
│ EL2: Hypervisor/KVM EL2: RMM (Realm Mgmt Monitor)
│ EL1: Host kernel EL1: Realm guest OS
│ EL0: Host apps EL0: Realm apps
│
└── Secure world (TrustZone)
EL1: Secure OS (OP-TEE)
EL0: Trusted apps
9.4 Spectre/Meltdown Mitigations
KVM applies CPU vulnerability mitigations on every VM entry/exit:
| Vulnerability | Mitigation | KVM Impact |
|---|---|---|
| Spectre v1 (bounds check bypass) | LFENCE barriers in kernel | Minimal |
| Spectre v2 (branch target injection) | IBRS/IBPB/retpoline | IBPB on vCPU switch (flush BTB) |
| Meltdown (rogue data cache load) | KPTI (kernel page table isolation) | PTI switches on every VM exit (CR3 swap) |
| L1TF (L1 Terminal Fault) | L1D flush on VM entry | kvm-intel.vmentry_l1d_flush=always (significant perf hit) |
| MDS (Microarch Data Sampling) | VERW instruction on VM entry | Clears CPU buffers |
| STIBP | Single Thread Indirect Branch Predictors | Per-thread BTB isolation |
| MMIO Stale Data | VERW + buffer overwrite | On VM entry when SMT enabled |
| Retbleed | Return-to-IBPB | IBPB on VM exit |
| GDS (Gather Data Sampling) | Microcode + VERW | On VM entry |
These mitigations collectively add ~5-20% overhead to VM entry/exit, depending on workload and CPU microarchitecture.
10. KVM Kernel Internals
10.1 Source Code Layout
linux/
├── virt/kvm/
│ ├── kvm_main.c -- Core KVM: VM/vCPU lifecycle, ioctls, memory slots
│ ├── eventfd.c -- IRQFD and IOEVENTFD implementation
│ ├── irqchip.c -- IRQ routing logic
│ ├── coalesced_mmio.c -- MMIO write coalescing (batch MMIO exits)
│ ├── async_pf.c -- Async page fault handling
│ ├── vfio.c -- KVM-VFIO integration (group/device tracking)
│ ├── binary_stats.c -- KVM statistics export
│ └── pfncache.c -- GPA-to-PFN caching
│
├── arch/x86/kvm/
│ ├── x86.c -- x86-specific KVM core (registers, CPUID, MSRs)
│ ├── emulate.c -- x86 instruction emulator (~7000 lines)
│ ├── cpuid.c -- CPUID handling and filtering
│ ├── irq.c -- x86 interrupt injection
│ ├── lapic.c -- In-kernel LAPIC emulation (~3000 lines)
│ ├── i8259.c -- In-kernel PIC (8259) emulation
│ ├── ioapic.c -- In-kernel IOAPIC emulation
│ ├── i8254.c -- In-kernel PIT (8254) emulation
│ ├── pmu.c -- Virtual PMU
│ ├── hyperv.c -- Hyper-V enlightenments
│ ├── xen.c -- Xen enlightenments
│ ├── debugfs.c -- Debugfs statistics
│ │
│ ├── vmx/ -- Intel VT-x implementation
│ │ ├── vmx.c -- VMX main: VMCS setup, VM enter/exit, event injection
│ │ ├── vmenter.S -- Assembly: VMLAUNCH/VMRESUME entry code
│ │ ├── vmcs.h -- VMCS field definitions
│ │ ├── capabilities.h -- VMX capability detection
│ │ ├── nested.c -- Nested VMX (L1/L2 virtualization)
│ │ ├── posted_intr.c -- Posted interrupts
│ │ ├── pmu_intel.c -- Intel PMU virtualization
│ │ └── sgx.c -- SGX virtualization
│ │
│ ├── svm/ -- AMD-V (SVM) implementation
│ │ ├── svm.c -- SVM main: VMCB setup, VMRUN/VMEXIT
│ │ ├── vmenter.S -- Assembly: VMRUN entry code
│ │ ├── nested.c -- Nested SVM
│ │ ├── sev.c -- SEV/SEV-ES/SEV-SNP
│ │ ├── avic.c -- AMD Virtual Interrupt Controller
│ │ └── pmu_amd.c -- AMD PMU virtualization
│ │
│ └── mmu/ -- Memory management unit
│ ├── mmu.c -- Core MMU: page fault handling, SPTEs
│ ├── tdp_mmu.c -- Two-Dimensional Paging MMU (EPT/NPT direct)
│ ├── spte.h -- Shadow/EPT page table entry manipulation
│ ├── page_track.c -- Page write tracking (for dirty log)
│ └── mmio.c -- MMIO SPTE handling
│
├── arch/arm64/kvm/
│ ├── arm.c -- ARM KVM core
│ ├── handle_exit.c -- ARM VM exit dispatch
│ ├── mmu.c -- Stage-2 page tables
│ ├── sys_regs.c -- System register emulation
│ ├── vgic/ -- Virtual GIC (interrupt controller)
│ └── hyp/ -- EL2 hypervisor code
│
└── arch/riscv/kvm/
├── main.c -- RISC-V KVM core
├── vcpu_exit.c -- Exit handling
├── mmu.c -- G-stage page tables
└── vcpu_sbi.c -- SBI (Supervisor Binary Interface) emulation
10.2 The kvm_vcpu_run() Main Loop
// Simplified from arch/x86/kvm/x86.c
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
struct kvm_run *kvm_run = vcpu->run;
// Check for immediate exit request
if (kvm_run->immediate_exit)
return -EINTR;
// Signal handling
vcpu_load(vcpu); // load vCPU state onto this physical CPU
for (;;) {
// Check for pending signals
if (signal_pending(current)) {
r = -EINTR;
break;
}
// Check for pending requests (e.g., TLB flush, clock update)
if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
kvm_vcpu_flush_tlb_guest(vcpu);
if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu))
kvm_guest_time_update(vcpu);
// ... many more request types
// Inject pending events (interrupts, exceptions, NMIs)
r = kvm_x86_ops.inject_pending_event(vcpu);
if (r < 0)
break;
// --- THE CRITICAL SECTION ---
// Prepare for VM entry
preempt_disable();
kvm_x86_ops.prepare_switch_to_guest(vcpu);
// Actually enter guest mode (VMLAUNCH/VMRESUME or VMRUN)
// This is the assembly code in vmenter.S
r = kvm_x86_ops.vcpu_run(vcpu);
// We're back from VM exit
kvm_x86_ops.prepare_switch_to_host(vcpu);
preempt_enable();
// Handle the VM exit
r = kvm_x86_ops.handle_exit(vcpu, KVM_ISA_VMX);
if (r <= 0) {
// Exit to userspace (r == 0) or error (r < 0)
break;
}
// r > 0 means handle internally, re-enter guest
}
vcpu_put(vcpu);
return r;
}
10.3 VMX Entry/Exit Assembly
; From arch/x86/kvm/vmx/vmenter.S (simplified)
; __vmx_vcpu_run(struct vcpu_vmx *vmx, unsigned long *regs, unsigned int flags)
SYM_FUNC_START(__vmx_vcpu_run)
; Save host callee-saved registers
push rbp
push r15
push r14
push r13
push r12
push rbx
; Load guest registers from the regs array
mov rax, [rsi + VCPU_RAX]
mov rbx, [rsi + VCPU_RBX]
mov rcx, [rsi + VCPU_RCX]
mov rdx, [rsi + VCPU_RDX]
mov rbp, [rsi + VCPU_RBP]
mov r8, [rsi + VCPU_R8]
; ... r9-r15, rdi, rsi (rsi saved last since it holds the pointer)
; Enter guest mode
test flags, VMX_RUN_VMRESUME
jnz .Lvmresume
vmlaunch ; First entry
jmp .Lvmfail ; vmlaunch failed (CF or ZF set)
.Lvmresume:
vmresume ; Subsequent entries
jmp .Lvmfail ; vmresume failed
; --- VM EXIT LANDS HERE ---
; (hardware automatically restores host RSP, RIP from VMCS host-state area,
; but general-purpose registers still contain guest values)
.Lvmexit:
; Save guest registers back to the regs array
mov [rsi + VCPU_RAX], rax
mov [rsi + VCPU_RBX], rbx
; ... all registers
; Restore host callee-saved registers
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
; Apply Spectre mitigations
; IBPB (Indirect Branch Prediction Barrier) if needed
; L1D flush if needed (MDS/L1TF)
ret
SYM_FUNC_END(__vmx_vcpu_run)
10.4 VM Exit Handling Dispatch (Intel VMX)
// From arch/x86/kvm/vmx/vmx.c
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_EXCEPTION_NMI] = handle_exception_nmi,
[EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt,
[EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault,
[EXIT_REASON_INIT_SIGNAL] = handle_init,
[EXIT_REASON_IO_INSTRUCTION] = handle_io,
[EXIT_REASON_CR_ACCESS] = handle_cr,
[EXIT_REASON_DR_ACCESS] = handle_dr,
[EXIT_REASON_CPUID] = handle_cpuid,
[EXIT_REASON_MSR_READ] = handle_rdmsr,
[EXIT_REASON_MSR_WRITE] = handle_wrmsr,
[EXIT_REASON_INTERRUPT_WINDOW] = handle_interrupt_window,
[EXIT_REASON_HLT] = handle_halt,
[EXIT_REASON_INVLPG] = handle_invlpg,
[EXIT_REASON_VMCALL] = handle_vmcall,
[EXIT_REASON_EPT_VIOLATION] = handle_ept_violation,
[EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig,
[EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause,
[EXIT_REASON_RDTSC] = handle_rdtsc,
[EXIT_REASON_RDTSCP] = handle_rdtscp,
[EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer,
[EXIT_REASON_XSETBV] = handle_xsetbv,
[EXIT_REASON_APIC_ACCESS] = handle_apic_access,
[EXIT_REASON_APIC_WRITE] = handle_apic_write,
[EXIT_REASON_EOI_INDUCED] = handle_apic_eoi_induced,
// ... ~50 total handlers
};
static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
u32 exit_reason = vmx->exit_reason.full;
u32 vectoring_info = vmx->idt_vectoring_info;
// Fast path for common exits:
if (exit_fastpath != EXIT_FASTPATH_NONE)
return handle_fastpath_set_tscdeadline(vcpu, exit_fastpath);
// The most common exit reasons (in order of frequency for typical workloads):
// 1. EPT_VIOLATION -- guest accessed unmapped memory
// 2. IO_INSTRUCTION -- guest did IN/OUT
// 3. HLT -- guest is idle
// 4. CPUID -- guest queried CPU features
// 5. MSR_READ/WRITE -- guest accessed MSRs
// 6. EXTERNAL_INTERRUPT -- host interrupt while in guest mode
// 7. EPT_MISCONFIG -- EPT entry has invalid configuration
// 8. PREEMPTION_TIMER -- VMX preemption timer fired
return kvm_vmx_exit_handlers[exit_reason](vcpu);
}
10.5 Timer Emulation
KVM emulates several timers:
Timer Hierarchy in a KVM Guest:
┌────────────────────────────────────────────────────────────┐
│ LAPIC Timer (per-vCPU) │
│ - Most commonly used by modern guest kernels │
│ - Three modes: │
│ 1. One-shot: fire once after N ticks │
│ 2. Periodic: fire every N ticks │
│ 3. TSC-deadline: fire when TSC reaches value X │
│ - KVM uses host hrtimers to implement │
│ - TSC-deadline mode is most precise (~1ns resolution) │
│ - KVM exposes via CPUID and handles in lapic.c │
├────────────────────────────────────────────────────────────┤
│ PIT (i8254, in-kernel) │
│ - Legacy timer, used during boot (BIOS) │
│ - ~1.19318 MHz clock, channel 0 generates IRQ 0 │
│ - KVM emulates in i8254.c, uses host hrtimers │
│ - Only needed for legacy boot; UEFI guests don't use it │
├────────────────────────────────────────────────────────────┤
│ HPET (High Precision Event Timer) │
│ - Optional, emulated in QEMU (not in-kernel KVM) │
│ - ~10 MHz or higher, MMIO-based (at 0xFED00000) │
│ - Used by some guest OSes as a fallback timer │
├────────────────────────────────────────────────────────────┤
│ RTC/CMOS (MC146818) │
│ - Emulated in QEMU (MMIO at port 0x70/0x71) │
│ - Provides date/time and periodic interrupts (IRQ 8) │
│ - 32.768 KHz crystal, 2Hz-8192Hz programmable │
└────────────────────────────────────────────────────────────┘
10.6 KVM Tracepoints
KVM provides extensive tracepoints for debugging and performance analysis:
# List all KVM tracepoints
ls /sys/kernel/debug/tracing/events/kvm/
# Key tracepoints:
# kvm_entry -- VM entry (guest mode start)
# kvm_exit -- VM exit (with reason)
# kvm_mmio -- MMIO access
# kvm_pio -- Port I/O access
# kvm_cr -- Control register access
# kvm_msr -- MSR read/write
# kvm_page_fault -- EPT/NPT fault
# kvm_inj_virq -- Virtual interrupt injection
# kvm_apic -- APIC events
# kvm_halt_poll_ns -- Halt-poll timing
# kvm_nested_vmexit -- Nested VM exit
# Enable a tracepoint:
echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable
# Read trace:
cat /sys/kernel/debug/tracing/trace
# Using perf:
perf stat -e 'kvm:kvm_exit' -a sleep 5 # count VM exits
perf record -e 'kvm:kvm_exit' -a sleep 5 # record VM exit events
perf script # decode events
# Using trace-cmd:
trace-cmd record -e kvm -p function_graph sleep 5
trace-cmd report
# KVM statistics via debugfs:
cat /sys/kernel/debug/kvm/vm-*/vcpu-*/stats
# Or via binary stats interface (KVM_GET_STATS_FD, KVM_STATS_GET)
11. Building a Minimal VMM
11.1 Complete Real-Mode VMM in C
This is a complete, compilable program that creates a VM and runs x86 real-mode code:
/* minimal_vmm.c -- A complete KVM-based VMM that runs real-mode code.
*
* The guest code writes "Hello, KVM!\n" to serial port 0x3F8
* character by character, then halts.
*
* Compile: gcc -o minimal_vmm minimal_vmm.c
* Run: sudo ./minimal_vmm
*
* Requirements: /dev/kvm accessible, Intel VT-x or AMD-V enabled in BIOS.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <linux/kvm.h>
#include <stdint.h>
#include <errno.h>
#define GUEST_MEM_SIZE (1 << 20) /* 1 MB */
/* Guest code (x86 real mode assembly):
*
* mov si, msg ; point to message
* loop:
* lodsb ; load next byte into AL, increment SI
* or al, al ; check for null terminator
* jz halt ; if zero, done
* out 0x3F8, al ; write character to COM1
* jmp loop
* halt:
* hlt ; halt CPU
* msg:
* db "Hello, KVM!", 0x0A, 0x00
*/
const uint8_t guest_code[] = {
0xBE, 0x10, 0x00, /* mov si, 0x0010 (offset of msg from CS:0) */
0xAC, /* lodsb */
0x08, 0xC0, /* or al, al */
0x74, 0x05, /* jz halt (skip 5 bytes ahead) */
0xE6, 0xF8, /* out 0xF8, al -- but wait, 0x3F8 needs special handling */
/* Actually, in real mode, OUT imm8 only handles ports 0-255.
For port 0x3F8, we need: mov dx, 0x3F8; out dx, al */
};
/* Corrected guest code using DX for port addressing: */
const uint8_t guest_code_v2[] = {
/* org 0x0000 (loaded at CS:IP = 0:0x1000, but code is position-independent) */
0xBA, 0xF8, 0x03, /* mov dx, 0x3F8 ; COM1 port */
0xBE, 0x13, 0x10, /* mov si, 0x1013 ; address of msg */
/* loop: */
0xAC, /* lodsb ; AL = [SI++] */
0x08, 0xC0, /* or al, al ; test for null */
0x74, 0x04, /* jz halt ; if zero, stop */
0xEE, /* out dx, al ; write to COM1 */
0xEB, 0xF8, /* jmp loop ; next character */
/* halt: */
0xF4, /* hlt ; halt CPU */
/* msg: (at offset 0x13 from start, so at address 0x1013) */
'H','e','l','l','o',',',' ','K','V','M','!','\n', 0x00,
};
int main(void)
{
int kvm_fd, vm_fd, vcpu_fd;
int vcpu_mmap_size;
struct kvm_run *run;
void *guest_mem;
struct kvm_sregs sregs;
struct kvm_regs regs;
/* 1. Open /dev/kvm */
kvm_fd = open("/dev/kvm", O_RDWR | O_CLOEXEC);
if (kvm_fd < 0) { perror("open /dev/kvm"); return 1; }
/* 2. Check API version */
int api = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
if (api != 12) { fprintf(stderr, "KVM API version %d != 12\n", api); return 1; }
/* 3. Create VM */
vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
if (vm_fd < 0) { perror("KVM_CREATE_VM"); return 1; }
/* 4. Set TSS address (required for in-kernel APIC on Intel) */
if (ioctl(vm_fd, KVM_SET_TSS_ADDR, 0xFFFBD000) < 0) {
perror("KVM_SET_TSS_ADDR");
/* Non-fatal on AMD, but required on Intel */
}
/* 5. Allocate guest memory */
guest_mem = mmap(NULL, GUEST_MEM_SIZE,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE,
-1, 0);
if (guest_mem == MAP_FAILED) { perror("mmap guest memory"); return 1; }
/* 6. Load guest code at address 0x1000 */
memcpy((uint8_t *)guest_mem + 0x1000, guest_code_v2, sizeof(guest_code_v2));
/* 7. Register guest memory with KVM */
struct kvm_userspace_memory_region region = {
.slot = 0,
.flags = 0,
.guest_phys_addr = 0,
.memory_size = GUEST_MEM_SIZE,
.userspace_addr = (uint64_t)guest_mem,
};
if (ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion) < 0) {
perror("KVM_SET_USER_MEMORY_REGION"); return 1;
}
/* 8. Create vCPU */
vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
if (vcpu_fd < 0) { perror("KVM_CREATE_VCPU"); return 1; }
/* 9. mmap the kvm_run structure */
vcpu_mmap_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
run = mmap(NULL, vcpu_mmap_size, PROT_READ | PROT_WRITE,
MAP_SHARED, vcpu_fd, 0);
if (run == MAP_FAILED) { perror("mmap kvm_run"); return 1; }
/* 10. Set up special registers (real mode) */
if (ioctl(vcpu_fd, KVM_GET_SREGS, &sregs) < 0) {
perror("KVM_GET_SREGS"); return 1;
}
/* Real mode: CS.base = 0, CS.selector = 0 */
sregs.cs.base = 0;
sregs.cs.selector = 0;
if (ioctl(vcpu_fd, KVM_SET_SREGS, &sregs) < 0) {
perror("KVM_SET_SREGS"); return 1;
}
/* 11. Set up general registers */
memset(®s, 0, sizeof(regs));
regs.rip = 0x1000; /* Start executing at 0x1000 */
regs.rflags = 0x2; /* Bit 1 must be set (reserved, always 1) */
if (ioctl(vcpu_fd, KVM_SET_REGS, ®s) < 0) {
perror("KVM_SET_REGS"); return 1;
}
/* 12. Run the vCPU */
printf("Running guest...\n");
for (;;) {
int ret = ioctl(vcpu_fd, KVM_RUN, 0);
if (ret < 0) {
if (errno == EINTR) continue; /* signal, retry */
perror("KVM_RUN");
break;
}
switch (run->exit_reason) {
case KVM_EXIT_IO:
if (run->io.direction == KVM_EXIT_IO_OUT &&
run->io.port == 0x3F8 &&
run->io.size == 1) {
/* Guest wrote to COM1 -- print the character */
uint8_t *data = (uint8_t *)run + run->io.data_offset;
write(STDOUT_FILENO, data, 1);
}
break;
case KVM_EXIT_HLT:
printf("\nGuest halted.\n");
goto done;
case KVM_EXIT_SHUTDOWN:
printf("Guest shutdown (triple fault).\n");
goto done;
case KVM_EXIT_FAIL_ENTRY:
fprintf(stderr, "FAIL_ENTRY: reason=0x%llx\n",
(unsigned long long)run->fail_entry.hardware_entry_failure_reason);
goto done;
case KVM_EXIT_INTERNAL_ERROR:
fprintf(stderr, "INTERNAL_ERROR: suberror=%d\n",
run->internal.suberror);
goto done;
default:
fprintf(stderr, "Unexpected exit reason: %d\n", run->exit_reason);
goto done;
}
}
done:
/* 13. Cleanup */
munmap(run, vcpu_mmap_size);
close(vcpu_fd);
munmap(guest_mem, GUEST_MEM_SIZE);
close(vm_fd);
close(kvm_fd);
return 0;
}
11.2 Setting Up Protected Mode
To run 32-bit protected mode guest code:
/* After KVM_GET_SREGS, modify sregs for protected mode: */
/* 1. Set up a GDT (Global Descriptor Table) in guest memory */
struct gdt_entry {
uint16_t limit_low;
uint16_t base_low;
uint8_t base_mid;
uint8_t access;
uint8_t flags_limit_high;
uint8_t base_high;
} __attribute__((packed));
/* Place GDT at physical address 0x0 */
struct gdt_entry *gdt = (struct gdt_entry *)guest_mem;
/* Entry 0: null descriptor (required) */
gdt[0] = (struct gdt_entry){0};
/* Entry 1: code segment (CS) -- base=0, limit=4GB, 32-bit, ring 0 */
gdt[1] = (struct gdt_entry){
.limit_low = 0xFFFF,
.base_low = 0, .base_mid = 0, .base_high = 0,
.access = 0x9A, /* present, ring 0, code, readable */
.flags_limit_high = 0xCF, /* 4KB granularity, 32-bit, limit 0xFFFFF */
};
/* Entry 2: data segment (DS/SS/ES) -- base=0, limit=4GB, ring 0 */
gdt[2] = (struct gdt_entry){
.limit_low = 0xFFFF,
.base_low = 0, .base_mid = 0, .base_high = 0,
.access = 0x92, /* present, ring 0, data, writable */
.flags_limit_high = 0xCF,
};
/* 2. Configure SREGS for protected mode */
sregs.gdt.base = 0x0; /* GDT base address */
sregs.gdt.limit = sizeof(struct gdt_entry) * 3 - 1;
/* Code segment: selector 0x08 (GDT entry 1) */
sregs.cs.selector = 0x08;
sregs.cs.base = 0;
sregs.cs.limit = 0xFFFFFFFF;
sregs.cs.type = 0x0B; /* code, execute/read, accessed */
sregs.cs.present = 1;
sregs.cs.dpl = 0;
sregs.cs.db = 1; /* 32-bit */
sregs.cs.s = 1; /* code/data segment */
sregs.cs.l = 0; /* not 64-bit */
sregs.cs.g = 1; /* 4KB granularity */
/* Data segments: selector 0x10 (GDT entry 2) */
sregs.ds = sregs.es = sregs.ss = (struct kvm_segment){
.selector = 0x10,
.base = 0,
.limit = 0xFFFFFFFF,
.type = 0x03, /* data, read/write, accessed */
.present = 1,
.dpl = 0,
.db = 1,
.s = 1,
.g = 1,
};
/* Enable protected mode */
sregs.cr0 |= 1; /* CR0.PE = 1 */
ioctl(vcpu_fd, KVM_SET_SREGS, &sregs);
11.3 Setting Up Long Mode (64-bit)
/* 64-bit mode requires:
* 1. CR0.PE = 1 (protected mode)
* 2. CR0.PG = 1 (paging enabled)
* 3. CR4.PAE = 1 (Physical Address Extension)
* 4. EFER.LME = 1 (Long Mode Enable)
* 5. EFER.LMA = 1 (Long Mode Active -- set by hardware when PG=1 and LME=1)
* 6. CR3 points to a valid PML4 page table
* 7. CS.L = 1 (64-bit code segment)
*/
/* Set up identity-mapped page tables in guest memory */
/* Place at physical address 0x2000 */
#define PML4_ADDR 0x2000
#define PDPT_ADDR 0x3000
#define PD_ADDR 0x4000
/* PML4[0] -> PDPT */
uint64_t *pml4 = (uint64_t *)((uint8_t *)guest_mem + PML4_ADDR);
pml4[0] = PDPT_ADDR | 0x03; /* present, writable */
/* PDPT[0] -> PD */
uint64_t *pdpt = (uint64_t *)((uint8_t *)guest_mem + PDPT_ADDR);
pdpt[0] = PD_ADDR | 0x03; /* present, writable */
/* PD: map first 1GB using 2MB pages (identity mapped) */
uint64_t *pd = (uint64_t *)((uint8_t *)guest_mem + PD_ADDR);
for (int i = 0; i < 512; i++) {
pd[i] = (i * (2ULL << 20)) | 0x83; /* present, writable, 2MB page */
}
/* Configure SREGS for 64-bit mode */
sregs.cr3 = PML4_ADDR;
sregs.cr4 |= (1 << 5); /* CR4.PAE = 1 */
sregs.cr0 |= (1 << 0); /* CR0.PE = 1 */
sregs.cr0 |= (1UL << 31); /* CR0.PG = 1 */
sregs.efer |= (1 << 8); /* EFER.LME = 1 */
sregs.efer |= (1 << 10); /* EFER.LMA = 1 */
/* 64-bit code segment: L=1, D=0 */
sregs.cs.selector = 0x08;
sregs.cs.base = 0;
sregs.cs.limit = 0xFFFFFFFF;
sregs.cs.type = 0x0B;
sregs.cs.present = 1;
sregs.cs.dpl = 0;
sregs.cs.db = 0; /* Must be 0 in long mode */
sregs.cs.s = 1;
sregs.cs.l = 1; /* 64-bit mode */
sregs.cs.g = 1;
/* Also set EFER via KVM_SET_MSRS */
struct {
struct kvm_msrs header;
struct kvm_msr_entry entries[1];
} msrs = {
.header.nmsrs = 1,
.entries[0] = {
.index = 0xC0000080, /* MSR_IA32_EFER */
.data = (1 << 8) | (1 << 10) | (1 << 0), /* LME | LMA | SCE */
},
};
ioctl(vcpu_fd, KVM_SET_MSRS, &msrs);
11.4 Production VMM Architecture
How real VMMs (Firecracker, crosvm, Cloud Hypervisor) are structured:
Production VMM Architecture (Firecracker):
┌──────────────────────────────────────────────────────────────────┐
│ main() │
│ ├── Parse command line / API config │
│ ├── Open /dev/kvm │
│ ├── Create VM (KVM_CREATE_VM) │
│ ├── Configure VM: │
│ │ ├── Set TSS address │
│ │ ├── Create in-kernel IRQCHIP (split mode) │
│ │ ├── Set up memory regions │
│ │ └── Create IRQFD/IOEVENTFD bindings │
│ ├── Load kernel (bzImage/PE) into guest memory │
│ │ ├── Parse kernel header (boot_params) │
│ │ ├── Set up boot_params (zero page) at 0x7000 │
│ │ ├── Load kernel at 0x100000 (1MB, default load address) │
│ │ ├── Load initrd after kernel │
│ │ └── Set up kernel command line │
│ ├── Create vCPUs: │
│ │ ├── KVM_CREATE_VCPU for each │
│ │ ├── Set CPUID, SREGS, REGS, MSRs │
│ │ ├── For BSP (vCPU 0): set RIP to kernel entry │
│ │ └── For APs: wait for INIT-SIPI-SIPI sequence │
│ ├── Create device manager: │
│ │ ├── Serial (UART 16550) │
│ │ ├── virtio-net (via MMIO transport) │
│ │ ├── virtio-block │
│ │ ├── virtio-vsock │
│ │ └── virtio-balloon │
│ ├── Start vCPU threads: │
│ │ └── Each thread runs: │
│ │ loop { │
│ │ KVM_RUN │
│ │ handle_exit(run) │
│ │ } │
│ └── Main thread: event loop (epoll) │
│ ├── API socket (Unix domain socket for REST API) │
│ ├── IOEVENTFD events (virtio doorbells) │
│ ├── IRQFD events (interrupt injection) │
│ ├── Timer events (rate limiting) │
│ └── Signal handling │
└──────────────────────────────────────────────────────────────────┘
Linux Boot Protocol in a VMM (x86_64):
Guest memory layout for Linux kernel boot:
0x00000000 - 0x00000FFF : Real-mode IVT (not used in 64-bit direct boot)
0x00007000 - 0x00007FFF : boot_params (zero page)
0x00008000 - 0x0000FFFF : Kernel command line (null-terminated)
0x00010000 - 0x0001FFFF : GDT, page tables (set up by VMM)
0x00100000 - 0x???????? : Kernel (bzImage loaded here, 1MB+)
0x???????? - 0x???????? : initrd (loaded after kernel)
0x???????? - 0xBFFFFFFF : Free (guest RAM)
0xC0000000 - 0xFEBFFFFF : PCI MMIO space (not backed by RAM)
0xFEC00000 - 0xFEC00FFF : IOAPIC
0xFEE00000 - 0xFEE00FFF : LAPIC
0xFFFC0000 - 0xFFFFFFFF : BIOS ROM (if needed)
11.5 The rust-vmm Ecosystem
Production Rust VMMs use the rust-vmm crate ecosystem:
rust-vmm Crate Dependency Graph:
┌────────────────────────┐
│ Your VMM │
│ (Firecracker, Cloud │
│ Hypervisor, crosvm) │
└────────┬───────────────┘
│ uses
┌────────┴───────────────────────────────────────────────────┐
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ kvm-ioctls │ │ kvm-bindings │ │ vmm-sys-util │ │
│ │ │ │ │ │ │ │
│ │ Safe Rust │ │ Auto-gen'd │ │ EventFd, │ │
│ │ wrappers │ │ KVM structs │ │ Terminal, │ │
│ │ for KVM │ │ from kernel │ │ TempFile, │ │
│ │ ioctls │ │ headers │ │ signal handling │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ vm-memory │ │ vm-virtio │ │ vm-superio │ │
│ │ │ │ │ │ │ │
│ │ Guest mem │ │ Virtqueue │ │ Serial (16550) │ │
│ │ abstraction │ │ impl, │ │ i8042 keyboard │ │
│ │ (GuestMem, │ │ descriptor │ │ RTC (MC146818) │ │
│ │ MmapRegion) │ │ chain iter │ │ │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ vhost │ │ linux-loader │ │ event-manager │ │
│ │ │ │ │ │ │ │
│ │ vhost-user │ │ Load bzImage │ │ epoll-based │ │
│ │ protocol, │ │ /PE/ELF into │ │ event loop │ │
│ │ vhost-kern │ │ guest memory │ │ (MutEventSubsc │ │
│ │ │ │ │ │ riber trait) │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
└────────────────────────────────────────────────────────────┘
Example: Creating a VM with kvm-ioctls (Rust):
use kvm_ioctls::{Kvm, VcpuExit};
use kvm_bindings::{kvm_userspace_memory_region, kvm_regs, KVM_MEM_LOG_DIRTY_PAGES};
fn main() {
// 1. Open /dev/kvm
let kvm = Kvm::new().expect("Failed to open /dev/kvm");
// 2. Create VM
let vm = kvm.create_vm().expect("Failed to create VM");
// 3. Set TSS address (x86)
vm.set_tss_address(0xFFFB_D000).expect("Failed to set TSS");
// 4. Allocate and register guest memory
let guest_mem = unsafe {
libc::mmap(
std::ptr::null_mut(),
1 << 20, // 1MB
libc::PROT_READ | libc::PROT_WRITE,
libc::MAP_PRIVATE | libc::MAP_ANONYMOUS,
-1,
0,
)
};
let mem_region = kvm_userspace_memory_region {
slot: 0,
guest_phys_addr: 0,
memory_size: 1 << 20,
userspace_addr: guest_mem as u64,
flags: 0,
};
unsafe { vm.set_user_memory_region(mem_region).unwrap() };
// 5. Load guest code
let code: &[u8] = &[0xBA, 0xF8, 0x03, /* mov dx, 0x3F8 */
0xB0, 0x41, /* mov al, 'A' */
0xEE, /* out dx, al */
0xF4]; /* hlt */
unsafe {
let dest = (guest_mem as *mut u8).add(0x1000);
std::ptr::copy_nonoverlapping(code.as_ptr(), dest, code.len());
}
// 6. Create vCPU
let vcpu = vm.create_vcpu(0).expect("Failed to create vCPU");
// 7. Set registers
let mut sregs = vcpu.get_sregs().unwrap();
sregs.cs.base = 0;
sregs.cs.selector = 0;
vcpu.set_sregs(&sregs).unwrap();
let regs = kvm_regs {
rip: 0x1000,
rflags: 0x2,
..Default::default()
};
vcpu.set_regs(®s).unwrap();
// 8. Run
loop {
match vcpu.run().expect("KVM_RUN failed") {
VcpuExit::IoOut(port, data) => {
if port == 0x3F8 {
print!("{}", data[0] as char);
}
}
VcpuExit::Hlt => {
println!("\nGuest halted");
break;
}
exit => {
eprintln!("Unexpected exit: {:?}", exit);
break;
}
}
}
}
12. Advanced Topics
12.1 Nested Virtualization
Nested virtualization allows running a hypervisor inside a VM (L1 guest runs L2 guests):
Nested Virtualization Layers:
L0 (bare metal host)
├── KVM (actual hardware VMX/SVM)
│
└── L1 (guest VM running a hypervisor)
├── KVM / Hyper-V / VMware
│
└── L2 (guest-of-guest VM)
└── Application code
How KVM handles nested VMX (Intel):
L1 executes VMLAUNCH to start L2:
1. This causes a VM exit to L0 (VMLAUNCH traps)
2. L0 KVM reads L1's VMCS12 (the VMCS that L1 prepared for L2)
3. L0 merges VMCS12 with its own VMCS01 to create VMCS02:
- VMCS02.guest_state = VMCS12.guest_state (L2's registers)
- VMCS02.host_state = VMCS01.guest_state (L1's registers)
- VMCS02.controls = merge of VMCS01 and VMCS12 controls
- VMCS02.EPTP = composed EPT (L2 GPA -> L1 GPA -> L0 HPA)
4. L0 does VMLAUNCH with VMCS02 (L2 runs on real hardware)
L2 triggers a VM exit:
1. Hardware exits to L0 (the real host)
2. L0 KVM checks: should this exit go to L1?
- If L1 asked to intercept this (in VMCS12): reflect to L1
- If not: L0 handles it internally
3. To reflect to L1: load L1's register state, set L1's VMCS12
exit reason, resume L1 (which thinks it's handling an L2 exit)
VMCS shadowing (hardware optimization, Haswell+):
- L0 maps a "shadow VMCS" that L1 can VMREAD/VMWRITE without VM exits
- Only VMLAUNCH/VMRESUME and certain field changes still exit to L0
- Dramatically reduces nested virtualization overhead
Performance: Nested virtualization typically adds 10-50% overhead depending on workload and hardware support.
12.2 vGPU Technologies
| Technology | Approach | Performance | Use Case |
|---|---|---|---|
| NVIDIA vGPU | Time-sliced GPU sharing (mediated passthrough) | ~90% of bare metal | Enterprise VDI, ML inference |
| Intel GVT-g | Mediated passthrough (vfio-mdev) | ~80% of bare metal | Client/embedded virtualization |
| virtio-gpu | Paravirtual, VMM renders using host GPU | Depends on impl | General display, 3D acceleration |
| SR-IOV GPU | Hardware VF partitioning | Near-native | Data center GPU sharing |
vGPU via vfio-mdev:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ VM 1 │ │ VM 2 │ │ VM 3 │
│ vGPU driver │ │ vGPU driver │ │ vGPU driver │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌──────┴─────────────────┴─────────────────┴───────┐
│ vfio-mdev (mediates access) │
│ /sys/class/mdev_bus/... │
├──────────────────────────────────────────────────┤
│ Physical GPU Driver │
│ (nvidia, i915) │
├──────────────────────────────────────────────────┤
│ Physical GPU (single device) │
└──────────────────────────────────────────────────┘
12.3 QEMU TCG vs KVM
QEMU execution modes:
TCG (Tiny Code Generator) -- Software emulation:
- Translates guest instructions to host instructions at runtime
- Can run guest code for ANY architecture on ANY host
- ~10-100x slower than native
- Used for cross-architecture emulation (ARM on x86, etc.)
- Useful for development when KVM is unavailable
KVM acceleration:
- Guest code runs directly on hardware (VMX/SVM non-root mode)
- Near-native speed (typically <5% overhead for compute)
- Only works when guest arch == host arch (or compatible)
- Privileged instructions trap to KVM for emulation
QEMU can fall back from KVM to TCG for individual instructions
that KVM cannot handle (rare with modern hardware support).
12.4 KVM Unit Tests
# The KVM unit test framework (kvm-unit-tests) tests KVM functionality
# by running small programs directly on KVM (no OS, no bootloader):
git clone https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
cd kvm-unit-tests
./configure
make
./run_tests.sh
# Tests cover:
# - APIC, IOAPIC, PIT, RTC emulation
# - EPT/NPT, VPID, posted interrupts
# - MSR handling, CPUID
# - Exception injection
# - Nested virtualization
# - TSC handling
# - SEV, TDX, CCA
# - ARM: GIC, timers, stage-2
# - RISC-V: SBI, timer, interrupts
12.5 eBPF Integration with KVM
eBPF can observe and even modify KVM behavior:
1. kfunc-based eBPF programs (Linux 6.x+):
- Attach to KVM functions via fentry/fexit
- Observe VM exits, interrupt injection, memory mapping
- Example: count VM exits by reason per vCPU
2. BPF-based scheduler (sched_ext, Linux 6.12+):
- Custom scheduling policies for vCPU threads
- Pin vCPUs, implement gang scheduling, NUMA-aware placement
- Example: co-schedule all vCPUs of a latency-sensitive VM
3. Tracing:
- perf + eBPF programs attached to KVM tracepoints
- Build custom KVM performance dashboards
4. KVM + eBPF in guest:
- Guest can run eBPF programs normally
- eBPF verifier works in guest kernel
- No special KVM support needed
12.6 Common Pitfalls and Gotchas
1. Forgetting KVM_SET_TSS_ADDR on Intel: Without this, the in-kernel IRQCHIP fails silently on Intel CPUs. AMD does not need it. Always check the return value -- it succeeds on AMD but is required on Intel.
2. Invalid guest state (KVM_EXIT_FAIL_ENTRY): The most common cause is inconsistent segment registers or control register settings. The hardware_entry_failure_reason field in the kvm_run struct gives the VMCS error code. Common errors:
- Setting CR0.PG without CR0.PE (paging requires protected mode)
- Setting EFER.LMA without CR4.PAE
- CS.L=1 with CS.D=1 (long mode requires D=0)
- Mismatched segment register attributes (type, S, DPL)
3. Memory region alignment:
guest_phys_addrmust be page-aligned (4KB)memory_sizemust be page-aligned and non-zero (or zero to delete)userspace_addrmust be page-aligned- Overlapping slots are not allowed
4. vCPU thread affinity: KVM_RUN must be called from the same thread that created the vCPU (via KVM_CREATE_VCPU). You cannot migrate a vCPU fd between threads.
5. CPUID filtering: If you pass through the host's CPUID without filtering, the guest may try to use features that KVM doesn't support. Always start with KVM_GET_SUPPORTED_CPUID and filter based on what your VMM actually supports.
6. Signal handling and KVM_RUN:
If a signal arrives while the vCPU is in guest mode, KVM_RUN returns -1 with errno=EINTR. You must handle this (check for pending signals, then re-enter KVM_RUN). The kvm_run.immediate_exit field helps avoid race conditions: set it to 1 before KVM_RUN, and KVM will immediately return with -EINTR without entering guest mode. This is used to handle signals that arrived between checking for pending signals and calling KVM_RUN.
7. Dirty log performance: Enabling dirty logging (KVM_MEM_LOG_DIRTY_PAGES) write-protects all EPT entries. This means the first access to every page after enabling dirty logging causes an EPT violation (VM exit). For large VMs (hundreds of GB), this initial storm of EPT violations can cause a significant pause. Use KVM_CLEAR_DIRTY_LOG with subranges to avoid re-protecting all pages at once.
8. IRQCHIP must be created before vCPUs: KVM_CREATE_IRQCHIP must be called before KVM_CREATE_VCPU. If you create vCPUs first, the LAPIC for each vCPU won't be properly initialized.
9. Memory slot limits: KVM has a maximum number of memory slots (typically 509 on x86, configurable via KVM_CAP_NR_MEMSLOTS). Running out of slots is a real problem for VMs with many PCI device BARs.
10. Nested virtualization performance: L2 VM exits are extremely expensive (L2 -> L0 -> L1 -> L0 -> L2). Minimize them by ensuring L1's VMCS intercepts are minimal and that VPID/EPT are properly configured for L2.
13. Key References
Core Documentation
- KVM API Documentation --
Documentation/virt/kvm/api.rstin the Linux kernel source tree. The definitive reference for all KVM ioctls. - Intel SDM Volume 3, Chapters 23-34 -- VMX (Virtual Machine Extensions) specification. Intel 64 and IA-32 Architectures Software Developer's Manual.
- AMD APM Volume 2, Chapter 15 -- SVM (Secure Virtual Machine) specification. AMD64 Architecture Programmer's Manual.
- ARM Architecture Reference Manual -- Chapter D1 (AArch64 System Level Architecture), sections on EL2 and Stage-2 translation.
- RISC-V Privileged Specification -- Chapter 8, Hypervisor Extension.
Key Papers
- Kivity, A. et al. "kvm: the Linux Virtual Machine Monitor." Proceedings of the Linux Symposium, 2007. -- The original KVM paper.
- Adams, K. and Agesen, O. "A Comparison of Software and Hardware Techniques for x86 Virtualization." ASPLOS, 2006. -- Explains why hardware-assisted virtualization (VMX/SVM) eventually beat binary translation.
- Ben-Yehuda, M. et al. "The Turtles Project: Design and Implementation of Nested Virtualization." OSDI, 2010. -- Foundational work on nested KVM.
- Amit, N. and Wei, M. "The Design and Implementation of Hyperupcalls." ATC, 2018. -- Optimization for VM exit handling.
- Agache, A. et al. "Firecracker: Lightweight Virtualization for Serverless Applications." NSDI, 2020. -- Firecracker design, including KVM usage and <125ms boot times.
- Kaplan, D. "AMD Memory Encryption." AMD Whitepaper, 2016. -- SEV architecture.
- Intel Corporation. "Intel Trust Domain Extensions (TDX) Module Specification." 2023.
- Uhlig, R. et al. "Intel Virtualization Technology." IEEE Computer, 2005. -- Original VT-x design and motivation.
- Abramson, D. et al. "Intel Virtualization Technology for Directed I/O." Intel Technology Journal, 2006. -- VT-d (IOMMU) for device passthrough.
- Dall, C. and Nieh, J. "KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor." ASPLOS, 2014. -- KVM on ARM architecture.
- Gordon, A. et al. "ELI: Bare-Metal Performance for I/O Virtualization." ASPLOS, 2012. -- Exit-less interrupts for device passthrough.
- Waldspurger, C. "Memory Resource Management in VMware ESX Server." OSDI, 2002. -- Ballooning, content-based page sharing (KSM predecessor), idle memory taxation.
Practical Resources
- kvmtool -- A minimal KVM VMM in ~15K lines of C. Excellent for learning:
https://github.com/kvmtool/kvmtool - rust-vmm -- The Rust VMM crate ecosystem:
https://github.com/rust-vmm - kvm-unit-tests -- Unit test framework for KVM:
https://gitlab.com/kvm-unit-tests/kvm-unit-tests - QEMU source --
https://github.com/qemu/qemu-- the reference VMM implementation - Firecracker source --
https://github.com/firecracker-microvm/firecracker - Cloud Hypervisor source --
https://github.com/cloud-hypervisor/cloud-hypervisor - crosvm source --
https://chromium.googlesource.com/crosvm/crosvm - Linux kernel KVM documentation --
https://www.kernel.org/doc/html/latest/virt/kvm/
Cross-references within this repository
- VFIO Internals -- IOMMU and device passthrough details complementing Section 6.5
- Critical ISA Instructions -- VMX/SVM/EL2/H-ext instruction details (Section 13: Virtualization)
- Expert-Level Linux Syscalls -- KVM-related syscalls and kernel interfaces
- io_uring Internals -- Async I/O relevant to VMM I/O backends