Buffer Management Predictive Translation

Buffer Management: Predictive Translation

Paper: "Predictive Translation: High-Performance Buffer Management Without the Trade-Offs" Zinsmeister, Nguyen, Leis, Neumann — SIGMOD 2026, TU München

Background: Buffer Manager Design Space

Storage-based DBMSs structure data into fixed-size pages (4–64KB) and rely on a buffer manager to cache frequently accessed pages in RAM. The primary operation is translating a logical page ID (PID) into an in-memory buffer frame address.

Traditional: Hash Table

Every page request does: hashtable.lookup(pid) → frame address → page read.

Used by PostgreSQL, MySQL (InnoDB), SQLite, Shore-MT. Simple, robust, portable.

Bottlenecks (analyzed via Shore-MT):

Dependent indirections: PID → bucket → control block → frame header → buffer frame. Each level depends on the result of the prior, serializing cache misses. Conflicts with superscalar execution.
Contended bucket latches: Exclusive latches on hot buckets cause cache-line invalidation and kill multi-core scalability.

Modern Alternatives and Their Problems

Design	Mechanism	Problem
Pointer swizzling (LeanStore)	Embed frame address directly in B-tree nodes	Invasive; requires modifying all data structure code; can't support multiple/cyclic references (no sibling pointers, no graph traversal)
LIPAH	Fat pointer: (PID, hint address); fall back on mismatch	32-bit PIDs only → 17TB max database; still invasive
vmcache	Use OS page table (TLB) for translation	Requires `exmap` kernel module; metadata ∝ SSD size (480GB overhead for 30TB SSD); hard to use sampling-based eviction; can't use huge pages freely

Design Goals (Section 2.2)

High performance — at most ~1 cache miss overhead for in-memory workloads
Low memory overhead — metadata ∝ buffer pool size, not storage size
Portability — no kernel modules, no data structure invasiveness
Multiple page references — cyclic references, sibling pointers, graph structures
Unlimited PIDs — 64-bit page ID space

Traditional hash table satisfies goals 2–5 but not 1. The paper attacks goal 1 directly.

Superscalar Execution Primer

Modern CPUs are superscalar: they can issue multiple independent loads in parallel.

// Dependent: serialized, ~2× latency
T* ptr2 = *ptr1;    // must wait
T data = *ptr2;     // depends on ptr1 result

// Independent: parallel, ~1× latency
T data1 = *ptr1;    // issued together
T data2 = *ptr2;    // no dependency

The key insight: if hash table translation and page read are independent, the CPU can overlap them.

Predictive Translation (Core Idea)

Assign each PID a preferred buffer frame deterministically:

preferred_frame(pid) = hash(pid) % pool_size

This enables an optimistic fast path: speculatively read from the predicted address while the hash table lookup is in flight.

// Traditional (sequential)
BufferFrame* bf = hashtable.lookup(pid);  // cache miss
read(bf);                                  // depends on above

// Predictive Translation (parallel)
BufferFrame* pred = predictFrame(pid);    // cheap arithmetic
BufferFrame* real = hashtable.lookup(pid); // cache miss (concurrent)
if (isInPredictedFrame(real)):
    read(pred)   // overlaps with lookup via superscalar exec
else:
    read(real)   // rare slow path

The CPU's branch predictor takes the fast path. On misprediction, the CPU flushes the speculative work and falls back — no correctness issue, just a performance penalty.

The branch in the code doesn't trigger explicit parallelism: it merely gives the CPU the opportunity to speculate. No intrinsics or special instructions required.

Managing Frame Conflicts (Section 3.2)

Multiple PIDs share the same preferred frame (e.g., PIDs 1, 11, 21 all prefer frame 1 in a 10-frame pool). Page lifecycle manages placement:

Arbitrary        Preferred
Position         Position
   |                 |
   |-- 1. Load ----> |  (if preferred frame is free)
   |                 |
   |<-- 2. Promote --| (on 2nd+ access, probabilistically)
   |                 |
   |-- 3. Demote ----| (if hotter page wants the frame)
   |                 |
   4. Evict      4. Evict

Key observations:

Cold pages are often one-hit wonders — don't waste preferred frames on them
Hot pages get promoted because they're accessed frequently, increasing promotion probability naturally
Promotion probability: 1/32 (no demotion needed), 1/512 (demotion required)
Cost is amortized: empirically < 1 byte copied per page access

Demotion: if a hotter page P wants the preferred frame currently held by P', the system allocates a new free frame for P', copies P' there, updates the translation layer, then moves P into the preferred slot.

Hash Table Design (Section 4)

Requirements for predictive translation:

Lightweight — must not overwhelm the page read it's being interleaved with
Lock-free — otherwise translation blocks concurrent page access

Chaining with Inlined First Slot

Use chaining (not Cuckoo/linear probing) — open-addressing relocates entries, requiring holding multiple latches simultaneously under contention
Size table at 2× pool capacity → expected fill factor ~50%, 77% of chains have length 1
Inline the first chain slot directly in the bucket array → first probe is a single array lookup, no pointer chase
Inline frame header into the hash table entry → eliminates one more indirection

Chain length distribution at fill factor 0.75: 92% of chains have length ≤ 2. At fill factor 1.0: 97% have length ≤ 3.

Optimistic Latching

No mutex or reader-writer latch on buckets (both require a memory write per operation → cache invalidation)
Use a version counter: incremented when the latch transitions out of exclusive mode
Read path: (1) read version, (2) traverse chain, (3) validate version — if changed, retry
Enables fully lock-free reads → superscalar overlap of hash lookup and page access

Page state (shared/exclusive latch + version counter) packed into a single std::atomic<uint64_t>.

Implementation: PrediCache (Section 5)

Full pseudocode for the optimistic read:

readOptimistic(pid, callback):
  pred = predictFrame(pid)
  loop:
    [fh, latchState] = ht.find(pid)

    // Fast path: 3 conditions must hold
    if fh != null && isUnlatched(latchState) && fh.isInPredictedFrame:
        callback(pred, latchState)  // CPU overlaps this with ht.find above
        if fh.latchState == latchState: return
        else: continue  // version changed, retry

    // Miss: not in cache yet
    if fh == null:
        loadFromStorage(pid); continue

    // Slow path: use actual frame pointer
    if isUnlatched(latchState):
        callback(fh.framePtr, latchState)
        if fh.latchState == latchState:
            maybePromote(pid)  // probabilistic
            return
        else: continue

The critical point: the callback(pred, ...) call and ht.find(pid) are not explicitly parallelized anywhere. The branch on line 3 is what lets the CPU speculate — the hardware does the rest.

Frame vs. Header Separation

Headers stored in hash table; frames in a separate memory pool. Reason: frames must be 512-byte aligned for direct I/O. Inlining headers before frames (LeanStore style) causes massive padding waste. Separation avoids this while still achieving low-indirection access.

Evaluation Summary (Section 6)

Setup: AMD EPYC 9645P (96 cores/192 HT), 384GB RAM, Kioxia PCIe 5.0 NVMe SSD, Linux 6.8.

In-Memory Results

Workload	PrediCache vs vmcache	PrediCache vs LeanStore
TPC-C (192T)	+34%	+19%
Random Read	~equal	~equal
Skewed YCSB	slightly better	better

vmcache lag reasons:

TPC-C page allocation triggers exmap syscalls
vmcache page state array: 8 states per cache line → cache-line ping-pong on adjacent updates ("neworder problem")
vmcache can't use huge pages freely

LeanStore lag reasons: can't use sibling pointers (no multiple refs) → less efficient B-tree layout

Out-of-Memory Results

All three modern designs converge when I/O dominates. Traditional hash table also competitive here — OOM hides its in-memory overhead.

Ablation (random read, single-thread in-memory)

Optimization	Throughput change
Optimistic latching	+7.4× at 192T (scalability)
Inlined first slot	+24.8% (cache locality)
Predictive translation	+20.6% (superscalar hiding)

Microarchitectural Analysis (IPC counters)

System	Cycles	Instructions	IPC	Frontend stalls
PrediCache	3510	1915	0.55	131
vmcache	3939	1315	0.33	150
Traditional	5090	1574	0.31	155

PrediCache executes more instructions than vmcache but achieves 67% better IPC, resulting in fewer total cycles. The extra instructions (hash table work) are hidden by superscalar execution.

Memory Overhead

Design	Overhead at 100× OOM
vmcache	~39% of buffer pool size
PrediCache	~2–3% (proportional to pool, not storage)
LeanStore	~2–3%

vmcache must allocate frame headers for every page on the SSD, not just cached pages.

Preferred Frame Hit Rate

Dataset size	Pages in preferred pos.	Accesses hitting predicted frame
0.5× pool	75%	96%
3× pool	3%	91%

Even at 3× OOM, 91% of accesses use prediction (hot B-tree inner nodes stay in preferred frames).

Design Comparison Table

	impl.	#pages	metadata	perf	graph refs
Traditional	easy	unlimited	low	low	yes
Ptr. swizzling	hard	unlimited	low	high	no
LIPAH	medium	32-bit only	low	high	yes
vmcache	medium†	limited by metadata	high	high	yes
Predictive Translation	easy	unlimited	low	high	yes

†requires OS modification; hard to support sampling-based eviction

Key Papers

Zinsmeister et al., "Predictive Translation: High-Performance Buffer Management Without the Trade-Offs", SIGMOD 2026
Leis et al., "Virtual-Memory Assisted Buffer Management" (vmcache), SIGMOD 2023
Leis et al., "LeanStore: In-Memory Data Management beyond Main Memory" (pointer swizzling), ICDE 2018
Chang et al., "Resource-Adaptive Query Execution with Paged Memory Management" (LIPAH), CIDR 2025
Johnson et al., "Shore-MT: a scalable storage manager for the multicore era", EDBT 2009
Leis et al., "Optimistic Lock Coupling", IEEE Data Eng. Bull. 2019
Vöhringer & Leis, "Write-Aware Timestamp Tracking" (WATT eviction), VLDB 2023

Limitations and Future Work

No native variable-size page support (orthogonal to transaction processing; tiered pool like Umbra/Bf-Tree is a starting point)
Cloud-native cost model not yet explored (re-evaluating the 5-minute rule for cloud)
Only single-node; disaggregated storage has different buffer management requirements