Home X Github About

Buffer Management Predictive Translation

Buffer Management: Predictive Translation

Paper: "Predictive Translation: High-Performance Buffer Management Without the Trade-Offs" Zinsmeister, Nguyen, Leis, Neumann — SIGMOD 2026, TU München


Background: Buffer Manager Design Space

Storage-based DBMSs structure data into fixed-size pages (4–64KB) and rely on a buffer manager to cache frequently accessed pages in RAM. The primary operation is translating a logical page ID (PID) into an in-memory buffer frame address.

Traditional: Hash Table

Every page request does: hashtable.lookup(pid) → frame address → page read.

Used by PostgreSQL, MySQL (InnoDB), SQLite, Shore-MT. Simple, robust, portable.

Bottlenecks (analyzed via Shore-MT):

  1. Dependent indirections: PID → bucket → control block → frame header → buffer frame. Each level depends on the result of the prior, serializing cache misses. Conflicts with superscalar execution.
  2. Contended bucket latches: Exclusive latches on hot buckets cause cache-line invalidation and kill multi-core scalability.

Modern Alternatives and Their Problems

DesignMechanismProblem
Pointer swizzling (LeanStore)Embed frame address directly in B-tree nodesInvasive; requires modifying all data structure code; can't support multiple/cyclic references (no sibling pointers, no graph traversal)
LIPAHFat pointer: (PID, hint address); fall back on mismatch32-bit PIDs only → 17TB max database; still invasive
vmcacheUse OS page table (TLB) for translationRequires exmap kernel module; metadata ∝ SSD size (480GB overhead for 30TB SSD); hard to use sampling-based eviction; can't use huge pages freely

Design Goals (Section 2.2)

  1. High performance — at most ~1 cache miss overhead for in-memory workloads
  2. Low memory overhead — metadata ∝ buffer pool size, not storage size
  3. Portability — no kernel modules, no data structure invasiveness
  4. Multiple page references — cyclic references, sibling pointers, graph structures
  5. Unlimited PIDs — 64-bit page ID space

Traditional hash table satisfies goals 2–5 but not 1. The paper attacks goal 1 directly.


Superscalar Execution Primer

Modern CPUs are superscalar: they can issue multiple independent loads in parallel.

// Dependent: serialized, ~2× latency
T* ptr2 = *ptr1;    // must wait
T data = *ptr2;     // depends on ptr1 result

// Independent: parallel, ~1× latency
T data1 = *ptr1;    // issued together
T data2 = *ptr2;    // no dependency

The key insight: if hash table translation and page read are independent, the CPU can overlap them.


Predictive Translation (Core Idea)

Assign each PID a preferred buffer frame deterministically:

preferred_frame(pid) = hash(pid) % pool_size

This enables an optimistic fast path: speculatively read from the predicted address while the hash table lookup is in flight.

// Traditional (sequential)
BufferFrame* bf = hashtable.lookup(pid);  // cache miss
read(bf);                                  // depends on above

// Predictive Translation (parallel)
BufferFrame* pred = predictFrame(pid);    // cheap arithmetic
BufferFrame* real = hashtable.lookup(pid); // cache miss (concurrent)
if (isInPredictedFrame(real)):
    read(pred)   // overlaps with lookup via superscalar exec
else:
    read(real)   // rare slow path

The CPU's branch predictor takes the fast path. On misprediction, the CPU flushes the speculative work and falls back — no correctness issue, just a performance penalty.

The branch in the code doesn't trigger explicit parallelism: it merely gives the CPU the opportunity to speculate. No intrinsics or special instructions required.


Managing Frame Conflicts (Section 3.2)

Multiple PIDs share the same preferred frame (e.g., PIDs 1, 11, 21 all prefer frame 1 in a 10-frame pool). Page lifecycle manages placement:

Arbitrary        Preferred
Position         Position
   |                 |
   |-- 1. Load ----> |  (if preferred frame is free)
   |                 |
   |<-- 2. Promote --| (on 2nd+ access, probabilistically)
   |                 |
   |-- 3. Demote ----| (if hotter page wants the frame)
   |                 |
   4. Evict      4. Evict

Key observations:

  • Cold pages are often one-hit wonders — don't waste preferred frames on them
  • Hot pages get promoted because they're accessed frequently, increasing promotion probability naturally
  • Promotion probability: 1/32 (no demotion needed), 1/512 (demotion required)
  • Cost is amortized: empirically < 1 byte copied per page access

Demotion: if a hotter page P wants the preferred frame currently held by P', the system allocates a new free frame for P', copies P' there, updates the translation layer, then moves P into the preferred slot.


Hash Table Design (Section 4)

Requirements for predictive translation:

  1. Lightweight — must not overwhelm the page read it's being interleaved with
  2. Lock-free — otherwise translation blocks concurrent page access

Chaining with Inlined First Slot

  • Use chaining (not Cuckoo/linear probing) — open-addressing relocates entries, requiring holding multiple latches simultaneously under contention
  • Size table at 2× pool capacity → expected fill factor ~50%, 77% of chains have length 1
  • Inline the first chain slot directly in the bucket array → first probe is a single array lookup, no pointer chase
  • Inline frame header into the hash table entry → eliminates one more indirection

Chain length distribution at fill factor 0.75: 92% of chains have length ≤ 2. At fill factor 1.0: 97% have length ≤ 3.

Optimistic Latching

  • No mutex or reader-writer latch on buckets (both require a memory write per operation → cache invalidation)
  • Use a version counter: incremented when the latch transitions out of exclusive mode
  • Read path: (1) read version, (2) traverse chain, (3) validate version — if changed, retry
  • Enables fully lock-free reads → superscalar overlap of hash lookup and page access

Page state (shared/exclusive latch + version counter) packed into a single std::atomic<uint64_t>.


Implementation: PrediCache (Section 5)

Full pseudocode for the optimistic read:

readOptimistic(pid, callback):
  pred = predictFrame(pid)
  loop:
    [fh, latchState] = ht.find(pid)

    // Fast path: 3 conditions must hold
    if fh != null && isUnlatched(latchState) && fh.isInPredictedFrame:
        callback(pred, latchState)  // CPU overlaps this with ht.find above
        if fh.latchState == latchState: return
        else: continue  // version changed, retry

    // Miss: not in cache yet
    if fh == null:
        loadFromStorage(pid); continue

    // Slow path: use actual frame pointer
    if isUnlatched(latchState):
        callback(fh.framePtr, latchState)
        if fh.latchState == latchState:
            maybePromote(pid)  // probabilistic
            return
        else: continue

The critical point: the callback(pred, ...) call and ht.find(pid) are not explicitly parallelized anywhere. The branch on line 3 is what lets the CPU speculate — the hardware does the rest.

Frame vs. Header Separation

Headers stored in hash table; frames in a separate memory pool. Reason: frames must be 512-byte aligned for direct I/O. Inlining headers before frames (LeanStore style) causes massive padding waste. Separation avoids this while still achieving low-indirection access.


Evaluation Summary (Section 6)

Setup: AMD EPYC 9645P (96 cores/192 HT), 384GB RAM, Kioxia PCIe 5.0 NVMe SSD, Linux 6.8.

In-Memory Results

WorkloadPrediCache vs vmcachePrediCache vs LeanStore
TPC-C (192T)+34%+19%
Random Read~equal~equal
Skewed YCSBslightly betterbetter

vmcache lag reasons:

  • TPC-C page allocation triggers exmap syscalls
  • vmcache page state array: 8 states per cache line → cache-line ping-pong on adjacent updates ("neworder problem")
  • vmcache can't use huge pages freely

LeanStore lag reasons: can't use sibling pointers (no multiple refs) → less efficient B-tree layout

Out-of-Memory Results

All three modern designs converge when I/O dominates. Traditional hash table also competitive here — OOM hides its in-memory overhead.

Ablation (random read, single-thread in-memory)

OptimizationThroughput change
Optimistic latching+7.4× at 192T (scalability)
Inlined first slot+24.8% (cache locality)
Predictive translation+20.6% (superscalar hiding)

Microarchitectural Analysis (IPC counters)

SystemCyclesInstructionsIPCFrontend stalls
PrediCache351019150.55131
vmcache393913150.33150
Traditional509015740.31155

PrediCache executes more instructions than vmcache but achieves 67% better IPC, resulting in fewer total cycles. The extra instructions (hash table work) are hidden by superscalar execution.

Memory Overhead

DesignOverhead at 100× OOM
vmcache~39% of buffer pool size
PrediCache~2–3% (proportional to pool, not storage)
LeanStore~2–3%

vmcache must allocate frame headers for every page on the SSD, not just cached pages.

Preferred Frame Hit Rate

Dataset sizePages in preferred pos.Accesses hitting predicted frame
0.5× pool75%96%
3× pool3%91%

Even at 3× OOM, 91% of accesses use prediction (hot B-tree inner nodes stay in preferred frames).


Design Comparison Table

impl.#pagesmetadataperfgraph refs
Traditionaleasyunlimitedlowlowyes
Ptr. swizzlinghardunlimitedlowhighno
LIPAHmedium32-bit onlylowhighyes
vmcachemedium†limited by metadatahighhighyes
Predictive Translationeasyunlimitedlowhighyes

†requires OS modification; hard to support sampling-based eviction


Key Papers

  • Zinsmeister et al., "Predictive Translation: High-Performance Buffer Management Without the Trade-Offs", SIGMOD 2026
  • Leis et al., "Virtual-Memory Assisted Buffer Management" (vmcache), SIGMOD 2023
  • Leis et al., "LeanStore: In-Memory Data Management beyond Main Memory" (pointer swizzling), ICDE 2018
  • Chang et al., "Resource-Adaptive Query Execution with Paged Memory Management" (LIPAH), CIDR 2025
  • Johnson et al., "Shore-MT: a scalable storage manager for the multicore era", EDBT 2009
  • Leis et al., "Optimistic Lock Coupling", IEEE Data Eng. Bull. 2019
  • Vöhringer & Leis, "Write-Aware Timestamp Tracking" (WATT eviction), VLDB 2023

Limitations and Future Work

  • No native variable-size page support (orthogonal to transaction processing; tiered pool like Umbra/Bf-Tree is a starting point)
  • Cloud-native cost model not yet explored (re-evaluating the 5-minute rule for cloud)
  • Only single-node; disaggregated storage has different buffer management requirements

See Also