Buffer Management Predictive Translation
Buffer Management: Predictive Translation
Paper: "Predictive Translation: High-Performance Buffer Management Without the Trade-Offs" Zinsmeister, Nguyen, Leis, Neumann — SIGMOD 2026, TU München
Background: Buffer Manager Design Space
Storage-based DBMSs structure data into fixed-size pages (4–64KB) and rely on a buffer manager to cache frequently accessed pages in RAM. The primary operation is translating a logical page ID (PID) into an in-memory buffer frame address.
Traditional: Hash Table
Every page request does: hashtable.lookup(pid) → frame address → page read.
Used by PostgreSQL, MySQL (InnoDB), SQLite, Shore-MT. Simple, robust, portable.
Bottlenecks (analyzed via Shore-MT):
- Dependent indirections: PID → bucket → control block → frame header → buffer frame. Each level depends on the result of the prior, serializing cache misses. Conflicts with superscalar execution.
- Contended bucket latches: Exclusive latches on hot buckets cause cache-line invalidation and kill multi-core scalability.
Modern Alternatives and Their Problems
| Design | Mechanism | Problem |
|---|---|---|
| Pointer swizzling (LeanStore) | Embed frame address directly in B-tree nodes | Invasive; requires modifying all data structure code; can't support multiple/cyclic references (no sibling pointers, no graph traversal) |
| LIPAH | Fat pointer: (PID, hint address); fall back on mismatch | 32-bit PIDs only → 17TB max database; still invasive |
| vmcache | Use OS page table (TLB) for translation | Requires exmap kernel module; metadata ∝ SSD size (480GB overhead for 30TB SSD); hard to use sampling-based eviction; can't use huge pages freely |
Design Goals (Section 2.2)
- High performance — at most ~1 cache miss overhead for in-memory workloads
- Low memory overhead — metadata ∝ buffer pool size, not storage size
- Portability — no kernel modules, no data structure invasiveness
- Multiple page references — cyclic references, sibling pointers, graph structures
- Unlimited PIDs — 64-bit page ID space
Traditional hash table satisfies goals 2–5 but not 1. The paper attacks goal 1 directly.
Superscalar Execution Primer
Modern CPUs are superscalar: they can issue multiple independent loads in parallel.
// Dependent: serialized, ~2× latency
T* ptr2 = *ptr1; // must wait
T data = *ptr2; // depends on ptr1 result
// Independent: parallel, ~1× latency
T data1 = *ptr1; // issued together
T data2 = *ptr2; // no dependency
The key insight: if hash table translation and page read are independent, the CPU can overlap them.
Predictive Translation (Core Idea)
Assign each PID a preferred buffer frame deterministically:
preferred_frame(pid) = hash(pid) % pool_size
This enables an optimistic fast path: speculatively read from the predicted address while the hash table lookup is in flight.
// Traditional (sequential)
BufferFrame* bf = hashtable.lookup(pid); // cache miss
read(bf); // depends on above
// Predictive Translation (parallel)
BufferFrame* pred = predictFrame(pid); // cheap arithmetic
BufferFrame* real = hashtable.lookup(pid); // cache miss (concurrent)
if (isInPredictedFrame(real)):
read(pred) // overlaps with lookup via superscalar exec
else:
read(real) // rare slow path
The CPU's branch predictor takes the fast path. On misprediction, the CPU flushes the speculative work and falls back — no correctness issue, just a performance penalty.
The branch in the code doesn't trigger explicit parallelism: it merely gives the CPU the opportunity to speculate. No intrinsics or special instructions required.
Managing Frame Conflicts (Section 3.2)
Multiple PIDs share the same preferred frame (e.g., PIDs 1, 11, 21 all prefer frame 1 in a 10-frame pool). Page lifecycle manages placement:
Arbitrary Preferred
Position Position
| |
|-- 1. Load ----> | (if preferred frame is free)
| |
|<-- 2. Promote --| (on 2nd+ access, probabilistically)
| |
|-- 3. Demote ----| (if hotter page wants the frame)
| |
4. Evict 4. Evict
Key observations:
- Cold pages are often one-hit wonders — don't waste preferred frames on them
- Hot pages get promoted because they're accessed frequently, increasing promotion probability naturally
- Promotion probability: 1/32 (no demotion needed), 1/512 (demotion required)
- Cost is amortized: empirically < 1 byte copied per page access
Demotion: if a hotter page P wants the preferred frame currently held by P', the system allocates a new free frame for P', copies P' there, updates the translation layer, then moves P into the preferred slot.
Hash Table Design (Section 4)
Requirements for predictive translation:
- Lightweight — must not overwhelm the page read it's being interleaved with
- Lock-free — otherwise translation blocks concurrent page access
Chaining with Inlined First Slot
- Use chaining (not Cuckoo/linear probing) — open-addressing relocates entries, requiring holding multiple latches simultaneously under contention
- Size table at 2× pool capacity → expected fill factor ~50%, 77% of chains have length 1
- Inline the first chain slot directly in the bucket array → first probe is a single array lookup, no pointer chase
- Inline frame header into the hash table entry → eliminates one more indirection
Chain length distribution at fill factor 0.75: 92% of chains have length ≤ 2. At fill factor 1.0: 97% have length ≤ 3.
Optimistic Latching
- No mutex or reader-writer latch on buckets (both require a memory write per operation → cache invalidation)
- Use a version counter: incremented when the latch transitions out of exclusive mode
- Read path: (1) read version, (2) traverse chain, (3) validate version — if changed, retry
- Enables fully lock-free reads → superscalar overlap of hash lookup and page access
Page state (shared/exclusive latch + version counter) packed into a single std::atomic<uint64_t>.
Implementation: PrediCache (Section 5)
Full pseudocode for the optimistic read:
readOptimistic(pid, callback):
pred = predictFrame(pid)
loop:
[fh, latchState] = ht.find(pid)
// Fast path: 3 conditions must hold
if fh != null && isUnlatched(latchState) && fh.isInPredictedFrame:
callback(pred, latchState) // CPU overlaps this with ht.find above
if fh.latchState == latchState: return
else: continue // version changed, retry
// Miss: not in cache yet
if fh == null:
loadFromStorage(pid); continue
// Slow path: use actual frame pointer
if isUnlatched(latchState):
callback(fh.framePtr, latchState)
if fh.latchState == latchState:
maybePromote(pid) // probabilistic
return
else: continue
The critical point: the callback(pred, ...) call and ht.find(pid) are not explicitly parallelized anywhere. The branch on line 3 is what lets the CPU speculate — the hardware does the rest.
Frame vs. Header Separation
Headers stored in hash table; frames in a separate memory pool. Reason: frames must be 512-byte aligned for direct I/O. Inlining headers before frames (LeanStore style) causes massive padding waste. Separation avoids this while still achieving low-indirection access.
Evaluation Summary (Section 6)
Setup: AMD EPYC 9645P (96 cores/192 HT), 384GB RAM, Kioxia PCIe 5.0 NVMe SSD, Linux 6.8.
In-Memory Results
| Workload | PrediCache vs vmcache | PrediCache vs LeanStore |
|---|---|---|
| TPC-C (192T) | +34% | +19% |
| Random Read | ~equal | ~equal |
| Skewed YCSB | slightly better | better |
vmcache lag reasons:
- TPC-C page allocation triggers
exmapsyscalls - vmcache page state array: 8 states per cache line → cache-line ping-pong on adjacent updates ("neworder problem")
- vmcache can't use huge pages freely
LeanStore lag reasons: can't use sibling pointers (no multiple refs) → less efficient B-tree layout
Out-of-Memory Results
All three modern designs converge when I/O dominates. Traditional hash table also competitive here — OOM hides its in-memory overhead.
Ablation (random read, single-thread in-memory)
| Optimization | Throughput change |
|---|---|
| Optimistic latching | +7.4× at 192T (scalability) |
| Inlined first slot | +24.8% (cache locality) |
| Predictive translation | +20.6% (superscalar hiding) |
Microarchitectural Analysis (IPC counters)
| System | Cycles | Instructions | IPC | Frontend stalls |
|---|---|---|---|---|
| PrediCache | 3510 | 1915 | 0.55 | 131 |
| vmcache | 3939 | 1315 | 0.33 | 150 |
| Traditional | 5090 | 1574 | 0.31 | 155 |
PrediCache executes more instructions than vmcache but achieves 67% better IPC, resulting in fewer total cycles. The extra instructions (hash table work) are hidden by superscalar execution.
Memory Overhead
| Design | Overhead at 100× OOM |
|---|---|
| vmcache | ~39% of buffer pool size |
| PrediCache | ~2–3% (proportional to pool, not storage) |
| LeanStore | ~2–3% |
vmcache must allocate frame headers for every page on the SSD, not just cached pages.
Preferred Frame Hit Rate
| Dataset size | Pages in preferred pos. | Accesses hitting predicted frame |
|---|---|---|
| 0.5× pool | 75% | 96% |
| 3× pool | 3% | 91% |
Even at 3× OOM, 91% of accesses use prediction (hot B-tree inner nodes stay in preferred frames).
Design Comparison Table
| impl. | #pages | metadata | perf | graph refs | |
|---|---|---|---|---|---|
| Traditional | easy | unlimited | low | low | yes |
| Ptr. swizzling | hard | unlimited | low | high | no |
| LIPAH | medium | 32-bit only | low | high | yes |
| vmcache | medium† | limited by metadata | high | high | yes |
| Predictive Translation | easy | unlimited | low | high | yes |
†requires OS modification; hard to support sampling-based eviction
Key Papers
- Zinsmeister et al., "Predictive Translation: High-Performance Buffer Management Without the Trade-Offs", SIGMOD 2026
- Leis et al., "Virtual-Memory Assisted Buffer Management" (vmcache), SIGMOD 2023
- Leis et al., "LeanStore: In-Memory Data Management beyond Main Memory" (pointer swizzling), ICDE 2018
- Chang et al., "Resource-Adaptive Query Execution with Paged Memory Management" (LIPAH), CIDR 2025
- Johnson et al., "Shore-MT: a scalable storage manager for the multicore era", EDBT 2009
- Leis et al., "Optimistic Lock Coupling", IEEE Data Eng. Bull. 2019
- Vöhringer & Leis, "Write-Aware Timestamp Tracking" (WATT eviction), VLDB 2023
Limitations and Future Work
- No native variable-size page support (orthogonal to transaction processing; tiered pool like Umbra/Bf-Tree is a starting point)
- Cloud-native cost model not yet explored (re-evaluating the 5-minute rule for cloud)
- Only single-node; disaggregated storage has different buffer management requirements
See Also
- HyPer/Umbra/CedarDB — Umbra's vmcache and pointer swizzling that this paper directly builds upon
- Disaggregated Storage — Cloud-native buffer management challenges mentioned as future work
- MongoDB/WiredTiger Internals — WiredTiger's cache and eviction policies are a production buffer management system
- ISA Critical Instructions — Memory barriers and cache control instructions relevant to superscalar-aware buffer design