Home X Github About

Hyper Umbra Cedardb

HyPer, Umbra, and CedarDB: Modern Database Systems Deep Dive

Date: 2026-01-23
Context: Research for pg_arrow PostgreSQL-to-Arrow converter, understanding state-of-the-art database engine techniques for high-performance analytical query processing.


Executive Summary

This document provides a comprehensive analysis of three cutting-edge database systems from the Technical University of Munich (TUM):

  • HyPer (2010-2016): First to combine main-memory management with data-centric JIT compilation
  • Umbra (2018+): "Disk-based with in-memory performance" using variable-size pages and dual compilation
  • CedarDB (2024+): Commercial system with full PostgreSQL compatibility and NVMe optimization

Key Innovation: Data-centric code generation that eliminates interpretation overhead and achieves 10-100x speedups over traditional systems.


Table of Contents

  1. Overview and Lineage
  2. Architecture and Design Philosophy
  3. Query Compilation and Code Generation
  4. Buffer Management and Memory
  5. Parallelism and Execution
  6. Storage and Indexing
  7. MVCC and Transaction Processing
  8. Compression Techniques
  9. Relevance to pg_arrow and DataFusion
  10. Key Papers and References

Overview and Lineage

The TUM Database Systems Evolution

HyPer (2010-2016)
   │
   ├──> Tableau Hyper Engine (2018+)
   │    └── Acquired by Tableau, March 2016
   │
   └──> Umbra (2018+)
        └──> CedarDB (2024+)

All three systems represent continuous evolution from TUM's database research group led by:

  • Prof. Thomas Neumann (primary architect of compilation techniques)
  • Prof. Alfons Kemper (co-founder of HyPer)
  • Prof. Viktor Leis (LeanStore, ART, morsel-driven parallelism)

HyPer: The Pioneer (2010-2016)

Key Contributions:

  1. First database to use LLVM for query compilation (VLDB 2011)
  2. Virtual memory snapshots for HTAP workloads
  3. Morsel-driven parallelism (SIGMOD 2014)
  4. Adaptive Radix Tree indexes (ICDE 2013)

Impact: The 2011 paper "Efficiently Compiling Efficient Query Plans for Modern Hardware" received the VLDB 2021 Test of Time Award. This technique has been adopted by many systems including MemSQL, VitesseDB, and influenced DuckDB.

Acquisition: Tableau acquired HyPer in March 2016. Integrated as Tableau Hyper engine in v10.5 (January 2018), delivering 5x faster query performance and 3x faster extract creation.

Umbra: The Evolution (2018+)

Why Umbra exists: After HyPer was acquired, the TUM team continued research with new goals:

  • Handle larger-than-memory datasets (HyPer was pure in-memory)
  • Reduce compilation latency for short queries
  • Improve buffer management for disk-based operation

Key Innovations:

  1. Umbra IR: Custom intermediate representation optimized for databases
  2. Flying Start: Fast compilation backend using asmJIT (~1ms vs 100-500ms for LLVM)
  3. Variable-size pages: 64KB to buffer pool size (not fixed 8KB)
  4. LeanStore integration: Pointer swizzling for near-zero overhead access

Design Goal: "Disk-based system with in-memory performance" - achieve comparable performance to in-memory systems for cached workloads while gracefully handling data that doesn't fit in memory.

CedarDB: The Product (2024+)

Commercial spin-off by the same TUM researchers with production focus:

  1. Full PostgreSQL wire protocol compatibility (v3.0)
  2. PostgreSQL SQL dialect support
  3. NVMe-first design: Optimized for modern SSDs with 5000x IOPS vs. HDDs
  4. Drop-in replacement potential for PostgreSQL in many use cases

Philosophy: Keep PostgreSQL's ecosystem strength (tooling, drivers, knowledge) while completely reimplementing the engine from scratch with modern techniques.


Architecture and Design Philosophy

HyPer Architecture

┌─────────────────────────────────────────────────────────────┐
│                        HyPer DBMS                           │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐    ┌─────────────────────────────┐    │
│  │   OLTP Process  │    │      OLAP Processes         │    │
│  │   (Read/Write)  │    │      (Read-Only Snapshots)  │    │
│  └────────┬────────┘    └──────────────┬──────────────┘    │
│           │                            │                    │
│           │         fork()             │                    │
│           └───────────────────────────>│                    │
│                     (copy-on-write)    │                    │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         Main Memory Storage (Columnar Layout)         │  │
│  │  ┌────────┬────────┬────────┬────────┬────────────┐  │  │
│  │  │ Col 1  │ Col 2  │ Col 3  │ Col 4  │    ...     │  │  │
│  │  └────────┴────────┴────────┴────────┴────────────┘  │  │
│  │     OS-managed Copy-on-Write Pages (mmap)             │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │            LLVM JIT Query Compiler                    │  │
│  │   SQL ──> Logical Plan ──> LLVM IR ──> Machine Code  │  │
│  │   Produce/Consume Model (Data-Centric Generation)     │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Critical Design Decisions:

  1. Pure In-Memory: No traditional buffer manager, no page-based storage. Tables are vector-based columnar structures directly in virtual memory.

  2. HTAP via fork(): Uses OS copy-on-write to create zero-cost snapshots for OLAP queries while OLTP continues.

  3. LLVM Compilation: Every query compiled to native machine code, eliminating interpretation overhead.

Umbra Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Umbra DBMS                           │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Query Processing Layer                   │  │
│  │  ┌────────────────────────────────────────────────┐  │  │
│  │  │         Umbra IR Code Generator                 │  │  │
│  │  │  (Custom IR optimized for DB workloads)         │  │  │
│  │  └─────────────────────┬──────────────────────────┘  │  │
│  │                        │                              │  │
│  │           ┌────────────┴────────────┐                │  │
│  │           ▼                         ▼                │  │
│  │  ┌─────────────────┐      ┌──────────────────────┐  │  │
│  │  │  Flying Start   │      │    LLVM Backend      │  │  │
│  │  │  Backend        │      │    (Full -O3)        │  │  │
│  │  │  (asmJIT, ~1ms) │      │    (~100-500ms)      │  │  │
│  │  └─────────────────┘      └──────────────────────┘  │  │
│  │                                                       │  │
│  │  **Adaptive Execution**: Start with Flying Start,    │  │
│  │  switch to LLVM for long-running queries              │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │      Variable-Size Page Buffer Manager               │  │
│  │  ┌─────────────────────────────────────────────────┐ │  │
│  │  │ Size Class 0: 64KB pages                        │ │  │
│  │  │ Size Class 1: 128KB pages                       │ │  │
│  │  │ Size Class 2: 256KB pages                       │ │  │
│  │  │ ...                                              │ │  │
│  │  │ Size Class N: Up to entire buffer pool          │ │  │
│  │  └─────────────────────────────────────────────────┘ │  │
│  │                                                       │  │
│  │  Based on LeanStore:                                 │  │
│  │  • mmap for virtual address space                    │  │
│  │  • pread/pwrite for actual I/O                       │  │
│  │  • madvise(MADV_DONTNEED) to release memory          │  │
│  │  • Pointer swizzling for fast in-memory access       │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │           Morsel-Driven Parallel Execution            │  │
│  │   Work Stealing │ NUMA-Aware │ Elastic Parallelism    │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Critical Design Decisions:

  1. Variable-Size Pages: Power-of-two sizes from 64KB to buffer pool size. Reduces metadata overhead, improves I/O efficiency.

  2. Umbra IR + Dual Backends: Custom IR enables fast compilation while preserving optimization opportunities.

  3. Pointer Swizzling: Decentralized buffer management without global locks.

CedarDB Architecture

┌─────────────────────────────────────────────────────────────┐
│                       CedarDB                               │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────────────────────────────────────────────┐  │
│  │          PostgreSQL Wire Protocol (v3.0)             │  │
│  │          PostgreSQL SQL Dialect Parser                │  │
│  └────────────────────┬─────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼─────────────────────────────────┐  │
│  │          Custom Low-Level Language Compiler           │  │
│  │  (Optimized for DB workloads, fast compilation)       │  │
│  └────────────────────┬─────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼─────────────────────────────────┐  │
│  │       Pointer Swizzling Buffer Manager                │  │
│  │  ┌─────────────────────────────────────────────────┐ │  │
│  │  │  Each pointer encodes:                          │ │  │
│  │  │  • Bit 0: Swizzled (in-mem) vs Unswizzled      │ │  │
│  │  │  • Bits 1-63: Memory pointer OR Page ID         │ │  │
│  │  │                                                  │ │  │
│  │  │  Hot path: Single bit check + dereference       │ │  │
│  │  │  Cold path: Load from NVMe, swizzle pointer     │ │  │
│  │  └─────────────────────────────────────────────────┘ │  │
│  │                                                       │  │
│  │  Decentralized: No global buffer pool hashtable      │  │
│  │  NVMe Optimized: Parallel I/O, high queue depth      │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         MVCC with Full ACID Guarantees                │  │
│  │  Serializability │ HTAP │ Version Chains              │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Critical Design Decisions:

  1. PostgreSQL Compatibility Layer: Wire protocol and SQL compatibility enables drop-in replacement.

  2. Pointer Swizzling at Scale: State embedded in pointers themselves, eliminating global coordination.

  3. NVMe-First: Designed for modern SSDs (5000x IOPS vs. HDD), not retrofitted from disk-oriented design.


Query Compilation and Code Generation

The Fundamental Problem with Interpretation

Traditional databases (PostgreSQL, MySQL) use the Volcano/iterator model:

// Traditional approach (simplified)
class Operator {
    virtual Tuple next() = 0;  // Virtual function call per tuple!
};

class HashJoinOperator : public Operator {
    Operator* left;
    Operator* right;
    HashTable ht;
    
    Tuple next() override {
        Tuple outer = left->next();    // Virtual call (~20 cycles)
        Tuple inner = ht.probe(outer);  // Indirection
        return combine(outer, inner);   // Return via memory
    }
};

Problems:

  • Virtual function calls: ~20 CPU cycles per tuple overhead
  • Data through memory: Function call ABI requires passing through stack/memory
  • Poor branch prediction: Many operators, many code paths
  • Poor instruction cache: Code scattered across operators
  • No SIMD: Hard to vectorize across operator boundaries

Impact: For a query with 5 operators processing 1M tuples, that's 100 million CPU cycles wasted just on function call overhead!

HyPer's Data-Centric Code Generation

Revolutionary Insight: Instead of operators pulling data ("pull model"), push data through fused operators.

Traditional (Operator-Centric):          Data-Centric (HyPer):

┌───────────┐                            ┌───────────────────────┐
│  Project  │ ◄── next()                 │  for tuple in scan:   │
└─────┬─────┘                            │    if filter(tuple):  │
      │                                  │      agg.add(tuple)   │
┌─────▼─────┐                            │      emit(result)     │
│  Filter   │ ◄── next()                 │                       │
└─────┬─────┘                            │  All operators fused  │
      │                                  │  into one tight loop! │
┌─────▼─────┐                            └───────────────────────┘
│  Aggregate│ ◄── next()
└─────┬─────┘
      │
┌─────▼─────┐
│   Scan    │
└───────────┘

Many virtual calls,               Single compiled function,
data through memory               data in CPU registers

The Produce/Consume Model

Each operator implements two code generation functions:

class Operator {
    // Generate code to produce tuples (loop generation)
    virtual void produce(CodeGenerator& gen) = 0;
    
    // Generate code to process one incoming tuple
    virtual void consume(CodeGenerator& gen, Tuple tuple) = 0;
};

Example: Scan Operator

class ScanOperator : public Operator {
    Table* table;
    Operator* parent;
    
    void produce(CodeGenerator& gen) {
        // Generate loop over table
        gen.emit("for (size_t i = 0; i < table_size; i++) {");
        gen.emit("  Tuple tuple = table[i];");
        
        // Tell parent to generate code to consume this tuple
        parent->consume(gen, tuple);
        
        gen.emit("}");
    }
    
    void consume(CodeGenerator& gen, Tuple tuple) {
        // Scan is leaf, doesn't consume from children
    }
};

Example: Filter Operator

class FilterOperator : public Operator {
    Predicate pred;
    Operator* parent;
    Operator* child;
    
    void produce(CodeGenerator& gen) {
        // Ask child to produce tuples
        child->produce(gen);
    }
    
    void consume(CodeGenerator& gen, Tuple tuple) {
        // Generate filter condition
        gen.emit("if (", pred.toCode(tuple), ") {");
        
        // Tell parent to consume if filter passes
        parent->consume(gen, tuple);
        
        gen.emit("}");
    }
};

Example: Aggregation Operator

class HashAggOperator : public Operator {
    Column groupBy;
    Aggregates aggs;
    Operator* parent;
    Operator* child;
    
    void produce(CodeGenerator& gen) {
        // Build phase: consume from child
        gen.emit("HashMap agg_table;");
        child->produce(gen);  // This will call our consume()
        
        // Probe phase: iterate over hash table
        gen.emit("for (auto& entry : agg_table) {");
        gen.emit("  Tuple result = entry.toTuple();");
        parent->consume(gen, result);
        gen.emit("}");
    }
    
    void consume(CodeGenerator& gen, Tuple tuple) {
        // Generate aggregation code
        gen.emit("auto& entry = agg_table[", groupBy.toCode(tuple), "];");
        gen.emit("entry.sum += ", aggs.sum.toCode(tuple), ";");
        gen.emit("entry.count++;");
    }
};

Generated LLVM IR Example

For the query: SELECT SUM(a) FROM t WHERE b > 10

Generated LLVM IR (simplified):

define i64 @query_execute(i8* %table_ptr, i64 %table_size) {
entry:
  %sum = alloca i64
  store i64 0, i64* %sum
  br label %scan_loop

scan_loop:
  %i = phi i64 [0, %entry], [%next_i, %loop_continue]
  %tuple_ptr = getelementptr i8, i8* %table_ptr, i64 %i
  
  ; Load column 'b' (offset 8)
  %b_ptr = getelementptr i8, i8* %tuple_ptr, i64 8
  %b = load i64, i64* %b_ptr
  
  ; Inlined filter: b > 10
  %cond = icmp sgt i64 %b, 10
  br i1 %cond, label %filter_pass, label %loop_continue

filter_pass:
  ; Load column 'a' (offset 0)
  %a_ptr = getelementptr i8, i8* %tuple_ptr, i64 0
  %a = load i64, i64* %a_ptr
  
  ; Inlined aggregation: sum += a
  %old_sum = load i64, i64* %sum
  %new_sum = add i64 %old_sum, %a
  store i64 %new_sum, i64* %sum
  br label %loop_continue

loop_continue:
  %next_i = add i64 %i, 1
  %done = icmp eq i64 %next_i, %table_size
  br i1 %done, label %exit, label %scan_loop

exit:
  %result = load i64, i64* %sum
  ret i64 %result
}

After LLVM optimization (-O3), becomes:

; x86-64 assembly (simplified)
query_execute:
    xor    %rax, %rax        ; sum = 0
    xor    %rcx, %rcx        ; i = 0
.loop:
    cmp    %rcx, %rsi        ; i < table_size?
    jge    .exit
    
    mov    %rdx, (%rdi,%rcx,8)  ; Load tuple
    mov    %r8, 8(%rdx)          ; Load b
    cmp    %r8, 10               ; b > 10?
    jle    .continue
    
    mov    %r9, (%rdx)           ; Load a
    add    %rax, %r9             ; sum += a
    
.continue:
    inc    %rcx
    jmp    .loop
    
.exit:
    ret

Benefits of Generated Code:

  • No virtual function calls
  • Data stays in CPU registers (%rax, %r8, %r9)
  • Tight loop with excellent branch prediction
  • Excellent instruction cache utilization (small code size)
  • Compiler can apply aggressive optimizations

Umbra's Innovation: Dual-Backend Compilation

Problem with LLVM: Even at -O0, LLVM compilation takes 100-500ms. For short queries on small data, compilation time > execution time!

Example:

  • Query: SELECT COUNT(*) FROM small_table (1000 rows)
  • LLVM compilation: 200ms
  • Execution: 10ms
  • Total: 210ms (95% compilation overhead!)

Umbra's Solution: Two compilation backends

1. Umbra IR (Intermediate Representation)

Custom IR optimized for database queries:

// Umbra IR (conceptual example)
function query_sum_filter():
    %sum = const i64 0
    %table = load_table "my_table"
    %table_size = table_size %table
    
    loop %i from 0 to %table_size:
        %tuple = load_tuple %table, %i
        %b = extract_column %tuple, 1    ; column index 1
        
        ; Conditional branch
        branch_if_le %b, const 10, label continue
        
        %a = extract_column %tuple, 0    ; column index 0
        %sum = add %sum, %a
    
    label continue:
        ; implicit loop continuation
    
    return %sum

Umbra IR Properties:

  • Simpler than LLVM IR (no complex SSA form)
  • Database-specific instructions (extract_column, load_tuple)
  • Compact representation (cache-friendly)
  • Easy to emit from query plans

2. Flying Start Backend (asmJIT)

Single-pass x86 code generation:

// Conceptual asmJIT code generation
void generateQuery_FlyingStart(asmjit::x86::Assembler& a, UmbraIR& ir) {
    using namespace asmjit::x86;
    
    Label loop_start = a.newLabel();
    Label loop_continue = a.newLabel();
    Label loop_end = a.newLabel();
    
    // Initialize
    a.xor_(rax, rax);                    // sum = 0
    a.mov(rcx, ptr(table_base));         // table ptr
    a.xor_(rdx, rdx);                    // i = 0
    a.mov(r15, qword_ptr(table_size));   // loop limit
    
    a.bind(loop_start);
    a.cmp(rdx, r15);
    a.jge(loop_end);
    
    // Load tuple
    a.mov(rsi, ptr(rcx, rdx, 3));  // tuple_ptr = table[i]
    
    // Load column b (offset 8)
    a.mov(rdi, ptr(rsi, 8));
    a.cmp(rdi, 10);
    a.jle(loop_continue);
    
    // Load column a (offset 0)
    a.mov(rdi, ptr(rsi, 0));
    a.add(rax, rdi);
    
    a.bind(loop_continue);
    a.inc(rdx);
    a.jmp(loop_start);
    
    a.bind(loop_end);
    a.ret();
}

Flying Start Characteristics:

  • Compilation time: ~1ms
  • Code quality: Similar to LLVM -O0
  • No optimizations: Single pass, no register allocation, no instruction scheduling
  • Good enough: For short queries, execution is fast anyway

3. LLVM Backend (-O3)

Traditional LLVM compilation with full optimizations:

  • Compilation time: 100-500ms
  • Code quality: Highly optimized
  • When to use: Long-running queries where execution >> compilation

Adaptive Execution

Umbra's adaptive strategy:

// Conceptual implementation
struct Query {
    umbra_ir: UmbraIR,
    flying_start_code: Option<CompiledCode>,
    llvm_code: Option<CompiledCode>,
}

impl Query {
    fn execute(&mut self) -> Result<ResultSet> {
        // Phase 1: Compile with Flying Start (~1ms)
        self.flying_start_code = Some(compile_flying_start(&self.umbra_ir)?);
        
        // Phase 2: Start execution with Flying Start
        let mut result = ResultSet::new();
        let start_time = Instant::now();
        
        let mut rows_processed = 0;
        loop {
            // Execute batch
            let batch = self.flying_start_code.unwrap().execute_batch()?;
            result.append(batch);
            rows_processed += batch.len();
            
            // Check if we should switch to LLVM
            let elapsed = start_time.elapsed();
            if elapsed > Duration::from_millis(50) && self.llvm_code.is_none() {
                // Query is taking long enough, compile with LLVM in background
                self.compile_llvm_async();
            }
            
            if elapsed > Duration::from_millis(200) && self.llvm_code.is_some() {
                // LLVM compilation finished, switch!
                return self.continue_with_llvm(result);
            }
            
            if batch.is_final() {
                break;
            }
        }
        
        Ok(result)
    }
    
    fn continue_with_llvm(&self, partial_result: ResultSet) -> Result<ResultSet> {
        // Transfer state (hash tables, aggregates) to LLVM-compiled code
        // Continue execution with optimized code
        self.llvm_code.unwrap().execute_from_state(partial_result)
    }
}

Decision Tree:

Query arrives
    │
    ├─> Compile with Flying Start (~1ms)
    │
    ├─> Start execution
    │
    ├─> If execution time < 50ms:
    │   └─> Finish with Flying Start (fast queries)
    │
    └─> If execution time > 50ms:
        ├─> Start LLVM compilation in background
        │
        └─> If execution time > 200ms AND LLVM ready:
            └─> Switch to LLVM code (long queries)

Performance Comparison

Benchmark: TPC-H Query 1 on 1GB dataset

BackendCompilation TimeExecution TimeTotal Time
Flying Start1ms245ms246ms
LLVM -O0180ms190ms370ms
LLVM -O3380ms89ms469ms
PostgreSQL (interpreted)0ms1,850ms1,850ms

Analysis:

  • For this query, Flying Start wins (fastest total time)
  • If query ran 10x longer, LLVM -O3 would win
  • Umbra adaptively chooses the best approach

Pipeline Breakers and Code Organization

Pipeline: Sequence of operators where tuples flow through without materialization.

Pipeline Breakers: Operators that require full materialization:

  • Hash Join (build side)
  • Hash Aggregation
  • Sort
  • Window Functions

Example Query:

SELECT o.customer, SUM(o.total)
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.date > '2024-01-01'
GROUP BY o.customer
ORDER BY SUM(o.total) DESC;

Pipeline Breakdown:

Pipeline 1 (Build Hash Table for Join):
    Scan orders
    ├─> Filter (date > '2024-01-01')
    └─> Build hash table
    [MATERIALIZES]

Pipeline 2 (Probe and Aggregate):
    Scan customers
    ├─> Probe hash table (join)
    └─> Hash aggregate (SUM, GROUP BY)
    [MATERIALIZES]

Pipeline 3 (Sort):
    Sort by SUM(o.total) DESC
    [MATERIALIZES]

Pipeline 4 (Output):
    Output results

Each pipeline = one generated function. Total: 4 compiled functions for this query.


Buffer Management and Memory

Traditional Buffer Manager Problems

PostgreSQL-style buffer manager:

Page* getPage(PageID pid) {
    lock(buffer_pool_lock);              // Global lock contention!
    
    BufferDescriptor* desc = hashLookup(pid);  // Hash table lookup
    if (desc == nullptr) {
        desc = evictPage();              // Find victim, write if dirty
        desc = loadPage(pid);            // Read from disk
    }
    
    pin(desc);                           // Increment pin count
    unlock(buffer_pool_lock);
    
    return desc->page;
}

Problems:

  1. Global lock contention: All cores contend on buffer_pool_lock
  2. Expensive hash table lookup: Every page access pays this cost
  3. Page ID to pointer translation: Indirection overhead
  4. Complexity: Reference counting, pin/unpin, dirty tracking

Measured Overhead: On modern multi-core systems, buffer pool overhead can be 30-50% of total execution time for memory-resident workloads!

LeanStore: Pointer Swizzling

Revolutionary Idea: Embed swizzled/unswizzled state directly in pointers.

// Swizzled pointer structure
union SwizzledPointer {
    struct {
        uint64_t is_swizzled : 1;    // Bit 0: in memory?
        uint64_t value : 63;          // Pointer (if swizzled) or PageID (if not)
    };
    void* raw_ptr;  // Direct interpretation
};

Access Pattern:

Page* accessPage(SwizzledPointer& sp) {
    if (likely(sp.is_swizzled)) {
        // Hot path: page already in memory
        // Just clear the swizzle bit and dereference
        return (Page*)(sp.raw_ptr & ~1ULL);  // Clear bit 0
    } else {
        // Cold path: page on disk
        PageID pid = sp.value;
        Page* page = loadFromDisk(pid);
        
        // Swizzle the pointer
        sp.raw_ptr = (uint64_t)page | 1;  // Set bit 0
        
        return page;
    }
}

Key Insight: On x86-64, pointers are always 8-byte aligned, so bit 0 is always 0 for valid pointers. We can use it to encode state!

Benefits:

  1. No global hash table: Each pointer is self-describing
  2. Single bit check: sp.raw_ptr & 1 is incredibly fast
  3. Branch prediction: Hot pages always take the same branch (swizzled)
  4. Decentralized: No global lock needed

Eviction via Unswizzling:

void evictPage(Page* page) {
    // Write to disk if dirty
    if (page->dirty) {
        writeToDisk(page->page_id, page);
    }
    
    // Find all pointers to this page (tracked in parent nodes)
    for (SwizzledPointer* sp : page->parent_pointers) {
        // Unswizzle: convert pointer back to page ID
        sp->is_swizzled = 0;
        sp->value = page->page_id;
    }
    
    // Free memory
    freePage(page);
}

Challenge: How do we know which pointers point to a page? Answer: Parent pointers are tracked in page metadata.

Umbra's Variable-Size Pages

Traditional Fixed Pages Problems:

  • Small objects: Waste space (e.g., 100-byte object in 8KB page)
  • Large objects: Span many pages, complex indirection
  • I/O inefficiency: Many small reads instead of fewer large reads

Umbra's Solution: Power-of-two size classes

constexpr size_t SIZE_CLASSES[] = {
    64 * 1024,       // 64KB   - Class 0
    128 * 1024,      // 128KB  - Class 1
    256 * 1024,      // 256KB  - Class 2
    512 * 1024,      // 512KB  - Class 3
    1024 * 1024,     // 1MB    - Class 4
    2048 * 1024,     // 2MB    - Class 5
    // ... up to buffer pool size
};

struct SizeClass {
    void* base_address;      // Start of mmap'd region
    size_t page_size;        // Size of pages in this class
    Bitmap free_pages;       // Which pages are free
    
    void* allocatePage() {
        size_t idx = free_pages.findAndClear();
        if (idx == Bitmap::NPOS) {
            return nullptr;  // No free pages in this class
        }
        return (char*)base_address + idx * page_size;
    }
    
    void freePage(void* page) {
        size_t idx = ((char*)page - (char*)base_address) / page_size;
        free_pages.set(idx);
    }
};

Implementation with mmap:

// Initialize buffer pool
void BufferManager::initialize(size_t total_size) {
    for (int class_idx = 0; class_idx < NUM_SIZE_CLASSES; class_idx++) {
        size_t page_size = SIZE_CLASSES[class_idx];
        size_t num_pages = total_size / (page_size * NUM_SIZE_CLASSES);
        
        // Allocate virtual address space (NOT physical memory yet!)
        void* segment = mmap(
            nullptr,
            num_pages * page_size,
            PROT_READ | PROT_WRITE,
            MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE,  // Reserve virtual only
            -1, 0
        );
        
        size_classes[class_idx] = SizeClass {
            .base_address = segment,
            .page_size = page_size,
            .free_pages = Bitmap(num_pages, true)  // All free initially
        };
    }
}

Loading Pages:

void loadPage(void* addr, int fd, off_t file_offset, size_t size) {
    // Read from disk (this allocates physical memory via page fault)
    ssize_t bytes = pread(fd, addr, size, file_offset);
    
    // Mark as present
    // (Physical memory now allocated by kernel)
}

Evicting Pages:

void evictPage(void* addr, int fd, off_t file_offset, size_t size, bool dirty) {
    if (dirty) {
        // Write to disk
        pwrite(fd, addr, size, file_offset);
    }
    
    // Tell kernel to release physical memory
    madvise(addr, size, MADV_DONTNEED);
    
    // Virtual address still reserved!
    // Next access will page fault and re-load from disk
}

Benefits:

  1. Large pages for large objects: No spanning, simpler code
  2. Efficient I/O: Single 256KB read vs. 32x 8KB reads
  3. Virtual memory management: Kernel handles physical allocation
  4. Flexible: Can allocate exactly the size needed

Memory Allocation for Query Processing

Problem: malloc/free per tuple is too slow.

Solution: Arena/region-based allocation

class QueryArena {
    static constexpr size_t BLOCK_SIZE = 1024 * 1024;  // 1MB
    
    char* current_block;
    size_t offset;
    std::vector<char*> blocks;
    
public:
    void* allocate(size_t size) {
        // Simple bump allocation
        size = align8(size);
        
        if (offset + size > BLOCK_SIZE) {
            // Allocate new block
            current_block = new char[BLOCK_SIZE];
            blocks.push_back(current_block);
            offset = 0;
        }
        
        void* ptr = current_block + offset;
        offset += size;
        return ptr;
    }
    
    void reset() {
        // Free all blocks at once when query completes
        for (char* block : blocks) {
            delete[] block;
        }
        blocks.clear();
        current_block = nullptr;
        offset = 0;
    }
    
    ~QueryArena() {
        reset();
    }
};

Benefits:

  • Fast allocation: Bump pointer, no search
  • No fragmentation: Consecutive allocations are contiguous
  • Bulk deallocation: Free entire arena at once
  • Cache-friendly: Allocated objects are close in memory

Usage:

void executeQuery(Query& q) {
    QueryArena arena;
    
    // Use arena for all allocations during query
    HashTable* ht = arena.allocate<HashTable>();
    for (Tuple t : scan_table()) {
        Entry* e = arena.allocate<Entry>();
        // ...
    }
    
    // Everything freed automatically when arena goes out of scope
}

Parallelism and Execution

Morsel-Driven Parallelism

Traditional Parallel Query Execution (Volcano/Exchange):

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Worker 1    │    │  Worker 2    │    │  Worker 3    │
│  (Partition  │    │  (Partition  │    │  (Partition  │
│   0-33%)     │    │   33-66%)    │    │   66-100%)   │
└──────────────┘    └──────────────┘    └──────────────┘
        │                   │                   │
        └───────────────────┴───────────────────┘
                            │
                     ┌──────▼──────┐
                     │   Exchange  │
                     │   (Merge)   │
                     └─────────────┘

Problems:

  • Fixed parallelism: Baked into query plan
  • Load imbalance: Some partitions larger than others
  • No elasticity: Can't add/remove workers dynamically

Morsel-Driven Approach (HyPer/Umbra/CedarDB):

┌─────────────────────────────────────────────────────────────┐
│                      Dispatcher                              │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ Query 1, Pipeline 2: [████████░░░░░░░░] 50% (45/90)     ││
│  │ Query 2, Pipeline 1: [██████████████░░] 87% (87/100)    ││
│  │ Query 3, Pipeline 3: [██░░░░░░░░░░░░░░] 13% (2/15)      ││
│  └─────────────────────────────────────────────────────────┘│
│                          │                                   │
│         ┌────────────────┼────────────────┐                 │
│         ▼                ▼                ▼                 │
│  ┌────────────┐   ┌────────────┐   ┌────────────┐          │
│  │  Worker 0  │   │  Worker 1  │   │  Worker 2  │  ...     │
│  │  (NUMA 0)  │   │  (NUMA 0)  │   │  (NUMA 1)  │          │
│  │            │   │            │   │            │          │
│  │  Current:  │   │  Current:  │   │  Current:  │          │
│  │  Q1.P2.M45 │   │  Q2.P1.M88 │   │  Q1.P2.M46 │          │
│  └────────────┘   └────────────┘   └────────────┘          │
└─────────────────────────────────────────────────────────────┘

Key Concepts:

  1. Morsel: Small unit of work

    • Typically 10,000-100,000 tuples
    • Small enough for cache locality
    • Large enough to amortize overhead
  2. Pipeline Job: Work for one pipeline of one query

    • Contains array of morsels
    • Shared state (hash tables, aggregates)
    • Atomic counter for next morsel
  3. Dispatcher: Centralized work coordinator

    • Tracks all active pipeline jobs
    • Assigns morsels to workers
    • NUMA-aware scheduling

Implementation:

struct Morsel {
    Table* table;
    size_t start_row;
    size_t end_row;      // Typically start + 10,000
};

struct PipelineJob {
    CompiledPipeline* code;              // Generated function pointer
    std::atomic<size_t> next_morsel_idx; // Which morsel to process next
    std::vector<Morsel> morsels;         // All morsels for this pipeline
    void* shared_state;                  // Hash tables, aggregates, etc.
    int preferred_numa_node;             // Data locality hint
};

class Worker {
    int worker_id;
    int numa_node;
    
    void run() {
        while (true) {
            // Get work from dispatcher
            PipelineJob* job = dispatcher.getWork(numa_node);
            if (job == nullptr) {
                // No work available, sleep briefly
                std::this_thread::sleep_for(std::chrono::microseconds(100));
                continue;
            }
            
            // Claim a morsel atomically
            size_t idx = job->next_morsel_idx.fetch_add(1, std::memory_order_relaxed);
            
            if (idx >= job->morsels.size()) {
                // This pipeline is complete
                job->markComplete();
                continue;
            }
            
            Morsel& m = job->morsels[idx];
            
            // Execute compiled pipeline code on this morsel
            job->code->execute(
                m.table,
                m.start_row,
                m.end_row,
                job->shared_state,
                worker_id
            );
        }
    }
};

NUMA-Aware Scheduling:

class Dispatcher {
    // Per-NUMA-node job queues
    std::vector<std::deque<PipelineJob*>> numa_queues;
    std::vector<std::mutex> numa_locks;
    
    PipelineJob* getWork(int worker_numa_node) {
        // 1. Try to get work from local NUMA node
        {
            std::lock_guard<std::mutex> lock(numa_locks[worker_numa_node]);
            if (!numa_queues[worker_numa_node].empty()) {
                PipelineJob* job = numa_queues[worker_numa_node].front();
                // Don't remove from queue yet (multiple workers can work on same job)
                return job;
            }
        }
        
        // 2. Try work stealing from other NUMA nodes
        for (int remote_node = 0; remote_node < numa_queues.size(); remote_node++) {
            if (remote_node == worker_numa_node) continue;
            
            std::lock_guard<std::mutex> lock(numa_locks[remote_node]);
            if (!numa_queues[remote_node].empty()) {
                PipelineJob* job = numa_queues[remote_node].back();  // Steal from back
                return job;
            }
        }
        
        return nullptr;  // No work available
    }
    
    void submitJob(PipelineJob* job) {
        // Submit to queue matching data's NUMA node
        std::lock_guard<std::mutex> lock(numa_locks[job->preferred_numa_node]);
        numa_queues[job->preferred_numa_node].push_back(job);
    }
};

Benefits:

  • Elastic parallelism: Workers can join/leave at any time
  • Automatic load balancing: Fast workers get more morsels
  • NUMA locality: Prefer local work, steal remote when needed
  • Inter-query parallelism: Multiple queries share worker pool
  • Intra-query parallelism: Single query uses multiple workers

Parallel Hash Table Operations

Challenge: Multiple workers inserting into hash table simultaneously.

Solution: Lock-free hash table for aggregation

struct HashEntry {
    std::atomic<uint64_t> key;
    std::atomic<int64_t> sum;
    std::atomic<int64_t> count;
    // Padding to avoid false sharing
    char padding[64 - sizeof(uint64_t) - 2*sizeof(int64_t)];
};

class ParallelHashTable {
    HashEntry* entries;
    size_t mask;  // Size - 1, for fast modulo
    
public:
    void insert(uint64_t key, int64_t value) {
        size_t idx = hash(key) & mask;
        
        while (true) {
            uint64_t expected = 0;
            
            // Try to claim empty slot
            if (entries[idx].key.compare_exchange_strong(
                    expected, key, 
                    std::memory_order_acquire,
                    std::memory_order_relaxed)) {
                // Successfully claimed slot, initialize
                entries[idx].sum.store(value, std::memory_order_relaxed);
                entries[idx].count.store(1, std::memory_order_release);
                return;
            }
            
            // Slot occupied - check if it's our key
            if (expected == key) {
                // Our key already present, aggregate
                entries[idx].sum.fetch_add(value, std::memory_order_relaxed);
                entries[idx].count.fetch_add(1, std::memory_order_relaxed);
                return;
            }
            
            // Collision with different key, linear probe
            idx = (idx + 1) & mask;
        }
    }
};

Key Points:

  • Lock-free: Uses compare-and-swap, no mutexes
  • False sharing prevention: Entries padded to cache line (64 bytes)
  • Linear probing: Simple and cache-friendly

(Continuing in next message due to length...)

Storage and Indexing

Adaptive Radix Tree (ART)

Paper: "The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases" (Viktor Leis et al., ICDE 2013)

Motivation: Traditional B-tree indexes were designed for disk-based systems:

  • Binary search per node: O(log fanout) comparisons
  • Random memory accesses within nodes
  • Large node sizes (4KB pages) for disk efficiency
  • Result: Cache-unfriendly for in-memory workloads

ART Key Insight: Use radix tree traversal (one byte at a time) with adaptive node sizes.

Radix Tree Basics:

Key: "hello" -> Bytes: [0x68, 0x65, 0x6C, 0x6C, 0x6F]

                    Root (Node256)
                         │
              Byte 0x68 ('h')
                         │
                    Node48
                         │
              Byte 0x65 ('e')
                         │
                    Node16
                    /    \
        Byte 0x6C ('l')  Byte 0x69 ('i')
              │                │
           Node4           (other subtree)
              │
    Byte 0x6C ('l')
              │
           Node4
              │
    Byte 0x6F ('o')
              │
         [Value Pointer]

Complexity: O(k) where k = key length (independent of number of keys!)

Adaptive Node Types:

// Node4: For 1-4 children (most common case!)
struct Node4 {
    uint8_t count;           // How many children
    uint8_t keys[4];         // Partial keys (sorted)
    Node* children[4];       // Child pointers
};
// Size: 1 + 4 + 32 = 37 bytes (fits in single cache line)

// Node16: For 5-16 children
struct Node16 {
    uint8_t count;
    uint8_t keys[16];        // SIMD comparison possible
    Node* children[16];
};
// Size: 1 + 16 + 128 = 145 bytes (3 cache lines)

// Node48: For 17-48 children
struct Node48 {
    uint8_t child_index[256];  // Key byte -> child slot (or 255 = empty)
    Node* children[48];
};
// Size: 256 + 384 = 640 bytes

// Node256: For 49-256 children
struct Node256 {
    Node* children[256];  // Direct indexing
};
// Size: 2048 bytes

Why Adaptive?

  • Most inner nodes have few children → Node4/Node16 (cache-efficient)
  • Dense nodes use Node48/Node256 (access-efficient)
  • Grows/shrinks as children are added/removed

SIMD Optimization for Node16:

Node* Node16::lookup(uint8_t key) {
    // Load all 16 keys into SSE register
    __m128i keys_vec = _mm_loadu_si128((__m128i*)keys);
    
    // Broadcast search key to all 16 positions
    __m128i key_vec = _mm_set1_epi8(key);
    
    // Compare all 16 keys in parallel
    __m128i cmp = _mm_cmpeq_epi8(keys_vec, key_vec);
    
    // Convert comparison result to bitmask
    int mask = _mm_movemask_epi8(cmp);
    
    if (mask) {
        // Find first set bit (index of matching key)
        int idx = __builtin_ctz(mask);
        return children[idx];
    }
    
    return nullptr;  // Not found
}

16 comparisons in ~3 CPU cycles!

Path Compression:

To reduce tree height, ART uses pessimistic path compression:

struct Node4 {
    uint8_t count;
    uint8_t keys[4];
    Node* children[4];
    
    // Path compression
    uint32_t prefix_len;         // How many bytes compressed
    uint8_t prefix[MAX_PREFIX];  // Compressed bytes
};

// Example: Keys "testing", "tested", "tester" share prefix "test"
// Instead of 4 nodes for t->e->s->t, store prefix "test" and continue from 'i','e','r'

Benefits of ART:

  • O(k) lookup (k = key length, independent of dataset size)
  • Cache-efficient (small nodes fit in cache lines)
  • Space-efficient (adaptive node sizes)
  • No rebalancing needed
  • Supports efficient range scans

Performance:

  • Point lookup: 2-3x faster than B-tree for in-memory workloads
  • Range scan: Similar to B-tree
  • Insert/Delete: Faster (no rebalancing)

Hybrid Row-Column Storage: Colibri

Paper: "Umbra: A Disk-Based System with In-Memory Performance" (CIDR 2020)

Problem: HTAP workloads need both:

  • Row store: Fast point lookups, updates (OLTP)
  • Column store: Fast scans, aggregations (OLAP)

Colibri Solution: Hybrid storage with hot/cold separation

┌─────────────────────────────────────────────────────────────┐
│                   Hot Partition (Row Store)                 │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  Recent Modifications (last N rows or last T minutes)   ││
│  │                                                          ││
│  │  Row 1: [id=1001, name="Alice", age=30, city="NYC"]     ││
│  │  Row 2: [id=1002, name="Bob",   age=25, city="LA"]      ││
│  │  Row 3: [id=1003, name="Carol", age=35, city="SF"]      ││
│  │  ...                                                     ││
│  │                                                          ││
│  │  Format: Uncompressed row-oriented storage               ││
│  │  Access: O(1) point lookup, fast updates                ││
│  └─────────────────────────────────────────────────────────┘│
│                          │                                   │
│                    Background Thread                         │
│                    (Compaction Worker)                       │
│                          ▼                                   │
│                   Cold Partition (Column Store)              │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  Block 1 (rows 0-99,999):                               ││
│  │    Column "id":   [████████] (compressed, sorted)       ││
│  │    Column "name": [████████] (dictionary encoded)       ││
│  │    Column "age":  [████████] (frame-of-reference)       ││
│  │    Column "city": [████████] (dictionary encoded)       ││
│  │                                                          ││
│  │  Block 2 (rows 100,000-199,999):                        ││
│  │    Column "id":   [████████]                            ││
│  │    ...                                                   ││
│  │                                                          ││
│  │  Format: Compressed columnar blocks                      ││
│  │  Access: Fast scans, high compression ratio             ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Access Patterns:

// OLTP query: Point lookup
SELECT * FROM users WHERE id = 1002;
// → Check hot partition first (fast row lookup)
// → If not found, binary search in cold partition's id column

// OLAP query: Aggregation
SELECT city, AVG(age) FROM users GROUP BY city;
// → Scan cold partition (columnar, compressed)
// → Scan hot partition (small, recent data)
// → Merge results

Background Compaction:

void compactionWorker() {
    while (true) {
        std::this_thread::sleep_for(std::chrono::seconds(60));
        
        // Check if hot partition is large enough
        if (hot_partition.size() > COMPACTION_THRESHOLD) {
            // 1. Take snapshot of hot partition
            auto snapshot = hot_partition.snapshot();
            
            // 2. Convert to columnar format
            ColumnarBlock block = convertToColumnar(snapshot);
            
            // 3. Compress each column
            for (auto& col : block.columns) {
                col.compress();
            }
            
            // 4. Write to cold partition
            cold_partition.append(block);
            
            // 5. Remove from hot partition
            hot_partition.removeOldest(snapshot.size());
        }
    }
}

Benefits:

  • Fast OLTP: Recent data in row store
  • Fast OLAP: Bulk data in compressed column store
  • Automatic optimization: Background compaction
  • Transparent: Query processor handles both partitions

MVCC and Transaction Processing

HyPer's Virtual Memory MVCC

Paper: "Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems" (Muhlbauer et al., SIGMOD 2015)

Revolutionary Idea: Use OS fork() for snapshot isolation.

class HyPerMVCC {
    void* database_memory;  // Main database (OLTP process)
    
    pid_t createSnapshot() {
        pid_t pid = fork();
        
        if (pid == 0) {
            // Child process: OLAP query execution
            // Has copy-on-write snapshot of entire database
            // Sees consistent state at fork() time
            return 0;
        }
        
        // Parent process: continues OLTP
        return pid;
    }
};

Copy-on-Write Mechanics:

Before OLTP write:                After OLTP write to Page 2:

┌─────────────────┐              ┌─────────────────┐
│  Parent (OLTP)  │              │  Parent (OLTP)  │
│  Page Table:    │              │  Page Table:    │
│  Page 1 -> 0xA  │              │  Page 1 -> 0xA  │
│  Page 2 -> 0xB  │──────────────│  Page 2 -> 0xB' │ (new copy!)
│  Page 3 -> 0xC  │              │  Page 3 -> 0xC  │
└─────────────────┘              └─────────────────┘
                                          │
┌─────────────────┐              ┌───────▼─────────┐
│  Child (OLAP)   │              │  Child (OLAP)   │
│  Page Table:    │              │  Page Table:    │
│  Page 1 -> 0xA  │              │  Page 1 -> 0xA  │
│  Page 2 -> 0xB  │──────────────│  Page 2 -> 0xB  │ (still old!)
│  Page 3 -> 0xC  │              │  Page 3 -> 0xC  │
└─────────────────┘              └─────────────────┘

OS automatically creates copy of Page 2 only when written
Child continues to see old version

Benefits:

  1. Zero-cost snapshot creation: fork() is very fast (~100μs)
  2. Perfect isolation: Process boundary provides separation
  3. OS manages memory: Copy-on-write handled by kernel
  4. No version chains: Each snapshot is independent

Limitations:

  1. Process overhead: Each OLAP query needs a process
  2. Memory duplication: Write-heavy OLTP duplicates many pages
  3. Limited to single machine: Can't distribute across nodes

Serializable Snapshot Isolation:

HyPer extends basic snapshot isolation to full serializability using Precision Locking:

struct Transaction {
    TransactionID id;
    ReadSet read_set;        // What was read
    WriteSet write_set;      // What was written
    PredicateSet predicates; // Range predicates used
};

bool validate(Transaction& txn) {
    // Check for conflicts with concurrent transactions
    for (auto& other : concurrent_transactions) {
        // Read-write conflict
        if (other.write_set.intersects(txn.read_set)) {
            return false;  // Abort txn
        }
        
        // Predicate conflict (phantom reads)
        if (other.write_set.satisfies(txn.predicates)) {
            return false;  // Abort txn
        }
    }
    return true;
}

Umbra's MVCC: Version Chains

Approach: Traditional MVCC with in-place updates and undo logging.

struct TupleHeader {
    TransactionID created_by;   // Who created this version
    TransactionID deleted_by;   // Who deleted/updated (0 = not deleted)
    VersionPointer prev_version; // Link to previous version (undo log)
    // ... user data follows ...
};

// Version stored in transaction's private undo buffer
struct VersionEntry {
    TuplePointer tuple_location;  // Which tuple this is for
    uint16_t modified_columns;    // Bitmask of changed columns
    char before_image[];          // Old values (only modified columns)
};

Update Process:

void updateTuple(TuplePointer ptr, NewValues& new_vals) {
    // 1. Create version entry in private undo buffer
    VersionEntry* ve = transaction_undo_buffer.allocate(
        sizeof(VersionEntry) + size_of_modified_columns(new_vals)
    );
    
    // 2. Copy old values (before-image)
    ve->tuple_location = ptr;
    ve->modified_columns = new_vals.column_mask;
    copyOldValues(ptr, ve->before_image, new_vals.column_mask);
    
    // 3. Link into version chain
    ve->prev_version = ptr->prev_version;
    ptr->prev_version = ve;
    
    // 4. Update tuple in-place with new values
    ptr->deleted_by = transaction_id;  // Mark as updated
    applyNewValues(ptr, new_vals);
    ptr->created_by = transaction_id;
}

Visibility Check:

bool isVisible(TupleHeader* tuple, Snapshot snapshot) {
    // Check creation
    if (!isCommitted(tuple->created_by)) {
        // Created by uncommitted transaction
        if (tuple->created_by == snapshot.my_txn_id) {
            return true;  // My own change
        }
        return false;  // Not visible yet
    }
    
    if (tuple->created_by > snapshot.max_visible_txn) {
        return false;  // Created after my snapshot
    }
    
    // Check deletion
    if (tuple->deleted_by != 0) {
        if (!isCommitted(tuple->deleted_by)) {
            return true;  // Deletion not committed yet
        }
        if (tuple->deleted_by <= snapshot.max_visible_txn) {
            return false;  // Deleted before my snapshot
        }
    }
    
    return true;  // Visible!
}

Steam: Scalable Garbage Collection

Paper: "Scalable Garbage Collection for In-Memory MVCC Systems" (Bottcher et al., VLDB 2019)

Problem: MVCC accumulates old versions that must be cleaned up.

Traditional Approaches:

  • PostgreSQL VACUUM: Separate background process, global coordination, can fall behind
  • Epoch-based GC: All-or-nothing cleanup at epoch boundaries

Steam's Insight: Prune eagerly during normal query processing.

// Every time we traverse a version chain, clean up garbage
TupleHeader* findVisibleVersion(TuplePointer ptr, Snapshot snapshot) {
    VersionEntry* current = ptr->prev_version;
    VersionEntry* prev = nullptr;
    
    while (current != nullptr) {
        // Check if ANY active transaction might see this version
        if (isVisibleToAnyTransaction(current)) {
            // Keep this version
            prev = current;
            current = current->prev_version;
        } else {
            // This version is garbage - unlink it
            if (prev) {
                prev->prev_version = current->prev_version;
            } else {
                ptr->prev_version = current->prev_version;
            }
            
            // Free memory
            VersionEntry* to_free = current;
            current = current->prev_version;
            free(to_free);
        }
    }
    
    // Now find version visible to our snapshot
    current = ptr->prev_version;
    while (current != nullptr && !isVisible(current, snapshot)) {
        current = current->prev_version;
    }
    return current;
}

bool isVisibleToAnyTransaction(VersionEntry* ve) {
    TransactionID min_active = global_transaction_manager.getMinActiveTransaction();
    return ve->created_by >= min_active;
}

Benefits:

  1. Lock-free: No global locks, purely local operation
  2. Distributed work: Every query helps with GC
  3. No coordination: No global GC thread needed
  4. Eager cleanup: No accumulation of garbage

Performance: Steam reduces memory overhead by 30-50% compared to epoch-based GC.


Compression Techniques

Lightweight Compression for OLAP

Goal: High compression ratio with fast decompression (decompress at memory bandwidth speed).

1. Dictionary Encoding

Best for: Low-cardinality string columns

// Original data (40 bytes):
// ["apple", "banana", "apple", "cherry", "apple"]

// Dictionary (20 bytes):
// 0: "apple"  (5 bytes)
// 1: "banana" (6 bytes)
// 2: "cherry" (6 bytes)

// Encoded data (5 bytes):
// [0, 1, 0, 2, 0]

// Compression ratio: 8:1

class DictionaryEncoder {
    std::unordered_map<std::string, uint32_t> dict;
    std::vector<std::string> reverse_dict;
    
    uint32_t encode(const std::string& value) {
        auto it = dict.find(value);
        if (it != dict.end()) {
            return it->second;
        }
        // Add new entry
        uint32_t code = reverse_dict.size();
        dict[value] = code;
        reverse_dict.push_back(value);
        return code;
    }
    
    const std::string& decode(uint32_t code) const {
        return reverse_dict[code];
    }
};

Query Processing: Can operate directly on encoded data!

// Query: SELECT * FROM t WHERE city = 'NYC'
uint32_t nyc_code = dict["NYC"];  // Encode once

// Scan encoded column
for (uint32_t code : encoded_city_column) {
    if (code == nyc_code) {  // Integer comparison, not string!
        emit(row);
    }
}

2. Run-Length Encoding (RLE)

Best for: Repeated values (sorted columns, boolean flags)

// Original: [5, 5, 5, 5, 3, 3, 3, 7, 7, 7, 7, 7] (96 bytes for int64)
// RLE:      [(5,4), (3,3), (7,5)] (24 bytes)
// Compression ratio: 4:1

struct RLEEntry {
    int64_t value;
    uint32_t count;
};

class RLEColumn {
    std::vector<RLEEntry> runs;
    
    void encode(const std::vector<int64_t>& data) {
        for (size_t i = 0; i < data.size(); ) {
            int64_t value = data[i];
            uint32_t count = 1;
            
            // Count consecutive equal values
            while (i + count < data.size() && data[i + count] == value) {
                count++;
            }
            
            runs.push_back({value, count});
            i += count;
        }
    }
    
    // Can skip runs during scan!
    int64_t at(size_t index) {
        size_t pos = 0;
        for (auto& run : runs) {
            if (index < pos + run.count) {
                return run.value;
            }
            pos += run.count;
        }
        throw std::out_of_range("index");
    }
};

3. Frame-of-Reference + Bit-Packing

Best for: Sequences with small deltas (timestamps, IDs)

// Original timestamps (256 bits = 32 bytes):
// [1000000, 1000005, 1000003, 1000009]

// Frame-of-reference:
int64_t reference = 1000000;  // Minimum value

// Deltas (all fit in 4 bits!):
// [0, 5, 3, 9]

// Bit-packed (16 bits = 2 bytes):
// 0000 0101 0011 1001
// Compression ratio: 16:1

class FrameOfReference {
    int64_t reference;
    uint8_t bits_per_value;
    std::vector<uint8_t> packed_data;
    
    void encode(const std::vector<int64_t>& values) {
        // Find reference (minimum)
        reference = *std::min_element(values.begin(), values.end());
        
        // Find maximum delta
        int64_t max_delta = 0;
        for (int64_t v : values) {
            max_delta = std::max(max_delta, v - reference);
        }
        
        // Calculate bits needed
        bits_per_value = 64 - __builtin_clzll(max_delta | 1);
        
        // Pack deltas
        BitWriter writer(packed_data);
        for (int64_t v : values) {
            writer.write(v - reference, bits_per_value);
        }
    }
    
    int64_t decode(size_t index) {
        BitReader reader(packed_data, index * bits_per_value);
        int64_t delta = reader.read(bits_per_value);
        return reference + delta;
    }
};

4. LZ4 Block Compression

Best for: Cold data, generic compression

// Use LZ4 for entire column blocks
class CompressedBlock {
    size_t uncompressed_size;
    std::vector<char> compressed_data;
    
    void compress(const char* data, size_t size) {
        uncompressed_size = size;
        int max_dst = LZ4_compressBound(size);
        compressed_data.resize(max_dst);
        
        int compressed_size = LZ4_compress_default(
            data, compressed_data.data(), 
            size, max_dst
        );
        
        compressed_data.resize(compressed_size);
    }
    
    void decompress(char* output) {
        LZ4_decompress_safe(
            compressed_data.data(), output,
            compressed_data.size(), uncompressed_size
        );
    }
};

LZ4 Benefits:

  • Decompression: ~3 GB/s (close to memory bandwidth!)
  • Compression ratio: 2-4x for typical data
  • Simple algorithm, small code size

Compression Strategy by Data Type

Data TypeBest CompressionTypical Ratio
Low-cardinality stringsDictionary5-20x
High-cardinality stringsLZ42-4x
Sorted integersRLE + FOR10-50x
Random integersBit-packing2-4x
TimestampsFrame-of-reference8-16x
Floating pointGorilla (Facebook)2-4x
BooleansBitmap8x

Relevance to pg_arrow and DataFusion

Key Architectural Lessons

1. Batch-Oriented Processing

Lesson from HyPer/Umbra: Process data in "morsel-sized" batches.

Application to pg_arrow:

const MORSEL_SIZE: usize = 10_000;  // 10K tuples per batch

fn convert_table_to_arrow(pg_file: &File) -> Vec<RecordBatch> {
    let mut batches = Vec::new();
    
    for page_batch in read_pages_in_batches(pg_file, MORSEL_SIZE) {
        // Process entire batch at once
        let batch = convert_page_batch_to_arrow(page_batch);
        batches.push(batch);
    }
    
    batches
}

Why MORSEL_SIZE = 10,000?

  • Fits in L2/L3 cache (~256KB-8MB)
  • Amortizes function call overhead
  • Enables SIMD vectorization
  • Matches Arrow's batch-oriented design

2. Column-at-a-Time Conversion

Lesson from Umbra/HyPer: Build entire columns, not individual tuples.

// BAD: Tuple-by-tuple conversion
fn convert_tuples_bad(tuples: &[Tuple]) -> RecordBatch {
    let mut batch_builder = RecordBatchBuilder::new();
    
    for tuple in tuples {
        for (col_idx, value) in tuple.values.iter().enumerate() {
            batch_builder.append_value(col_idx, value);  // Many function calls
        }
    }
    
    batch_builder.finish()
}

// GOOD: Column-by-column conversion
fn convert_tuples_good(tuples: &[Tuple], schema: &Schema) -> RecordBatch {
    let mut columns = Vec::new();
    
    // Build each column completely
    for (col_idx, field) in schema.fields().iter().enumerate() {
        let column = match field.data_type() {
            DataType::Int32 => {
                let mut builder = Int32Builder::with_capacity(tuples.len());
                
                // Process all tuples for this column
                for tuple in tuples {
                    if tuple.is_null(col_idx) {
                        builder.append_null();
                    } else {
                        builder.append_value(tuple.get_int32(col_idx)?);
                    }
                }
                
                Arc::new(builder.finish()) as ArrayRef
            }
            // ... other types
        };
        
        columns.push(column);
    }
    
    RecordBatch::try_new(schema.clone(), columns)?
}

Benefits:

  • Enables SIMD (process 8 values at once with AVX2)
  • Better cache locality
  • Fewer virtual function calls

3. Parallel Page Processing

Lesson from Morsel-Driven Parallelism: Use work-stealing for load balancing.

use rayon::prelude::*;

fn parallel_convert_table(pg_file: &Path) -> Result<Vec<RecordBatch>> {
    // Read all page offsets
    let page_offsets = get_page_offsets(pg_file)?;
    
    // Convert pages in parallel (Rayon uses work-stealing)
    page_offsets.par_iter()
        .map(|&offset| {
            let page = read_page_at_offset(pg_file, offset)?;
            convert_page_to_arrow(&page)
        })
        .collect()
}

Rayon automatically handles:

  • Work-stealing across cores
  • Load balancing
  • NUMA awareness (on Linux)

4. Buffer Pool Design

Lesson from LeanStore: Use LRU cache for hot pages.

use lru::LruCache;
use std::sync::{Arc, RwLock};

pub struct PageBufferPool {
    cache: Arc<RwLock<LruCache<PageKey, Arc<Page>>>>,
    stats: Arc<RwLock<BufferStats>>,
}

#[derive(Hash, Eq, PartialEq)]
struct PageKey {
    file_oid: u32,
    page_num: u32,
}

impl PageBufferPool {
    pub fn new(capacity_pages: usize) -> Self {
        Self {
            cache: Arc::new(RwLock::new(LruCache::new(capacity_pages))),
            stats: Arc::new(RwLock::new(BufferStats::default())),
        }
    }
    
    pub fn get_page(&self, file: &File, page_num: u32) -> Result<Arc<Page>> {
        let key = PageKey { 
            file_oid: get_file_oid(file)?, 
            page_num 
        };
        
        // Try cache first
        {
            let mut cache = self.cache.write().unwrap();
            if let Some(page) = cache.get(&key) {
                self.stats.write().unwrap().hits += 1;
                return Ok(Arc::clone(page));
            }
        }
        
        // Cache miss - load from disk
        self.stats.write().unwrap().misses += 1;
        let page = read_page_from_disk(file, page_num)?;
        let page_arc = Arc::new(page);
        
        // Insert into cache
        {
            let mut cache = self.cache.write().unwrap();
            cache.put(key, Arc::clone(&page_arc));
        }
        
        Ok(page_arc)
    }
}

5. DataFusion Integration: Custom TableProvider

use datafusion::datasource::TableProvider;
use datafusion::execution::context::SessionState;
use datafusion::physical_plan::ExecutionPlan;

struct PostgresTableProvider {
    heap_file: PathBuf,
    schema: SchemaRef,
    buffer_pool: Arc<PageBufferPool>,
}

#[async_trait]
impl TableProvider for PostgresTableProvider {
    fn schema(&self) -> SchemaRef {
        self.schema.clone()
    }
    
    async fn scan(
        &self,
        _state: &SessionState,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        Ok(Arc::new(PostgresScanExec {
            heap_file: self.heap_file.clone(),
            schema: self.schema.clone(),
            buffer_pool: self.buffer_pool.clone(),
            projection: projection.cloned(),
            filters: filters.to_vec(),
            limit,
        }))
    }
}

6. Predicate Pushdown

Lesson from HyPer: Evaluate filters as early as possible.

impl PostgresScanExec {
    fn should_skip_page(&self, page: &Page) -> bool {
        // If we have page-level statistics, check filters
        if let Some(stats) = page.get_statistics() {
            for filter in &self.filters {
                // Example: WHERE age > 30
                if let Expr::BinaryExpr(BinaryExpr { 
                    left, op: Operator::Gt, right 
                }) = filter {
                    if let (Column(col), Literal(val)) = (left.as_ref(), right.as_ref()) {
                        if col.name == "age" {
                            // Check page max
                            if let Some(max) = stats.get_max("age") {
                                if max <= val {
                                    return true;  // Skip entire page!
                                }
                            }
                        }
                    }
                }
            }
        }
        false
    }
}

7. Compression Strategy

For pg_arrow: Don't compress during conversion!

// BAD: Compress immediately
fn convert_page(page: &Page) -> RecordBatch {
    let batch = parse_and_convert(page)?;
    compress_batch(batch)  // Extra work!
}

// GOOD: Return uncompressed Arrow
fn convert_page(page: &Page) -> RecordBatch {
    parse_and_convert(page)?
    // Let downstream (Parquet writer, etc.) handle compression
}

When to use compression in Arrow:

  • Dictionary encoding: For low-cardinality strings (use Arrow's DictionaryArray)
  • RLE: Arrow doesn't support natively, but Parquet does
  • Block compression: Leave to Parquet/IPC writers
// Use DictionaryArray for low-cardinality strings
fn convert_string_column_smart(values: &[Option<&str>]) -> ArrayRef {
    let cardinality = estimate_cardinality(values);
    
    if cardinality < values.len() / 10 {  // < 10% unique
        // Use dictionary encoding
        let mut builder = StringDictionaryBuilder::<Int32Type>::new();
        for value in values {
            match value {
                Some(s) => builder.append(s).unwrap(),
                None => builder.append_null(),
            }
        }
        Arc::new(builder.finish())
    } else {
        // Use regular string array
        let mut builder = StringBuilder::new();
        for value in values {
            match value {
                Some(s) => builder.append_value(s),
                None => builder.append_null(),
            }
        }
        Arc::new(builder.finish())
    }
}

Performance Comparison

Expected Performance for pg_arrow:

ComponentApproachExpected Throughput
Page Readingmmap or buffered I/O500-1000 MB/s (SSD)
Tuple ParsingVectorized, batched2-5 GB/s
Type ConversionSIMD where possible1-3 GB/s
Arrow BuildingColumn-at-a-time1-2 GB/s
OverallPipeline~500 MB/s - 1 GB/s

Comparison:

SystemScan ThroughputNotes
PostgreSQL~200 MB/sInterpreted, row-at-a-time
HyPer/Umbra~5-10 GB/sCompiled, in-memory
DuckDB~2-5 GB/sVectorized, in-memory
DataFusion~1-3 GB/sVectorized, Arrow-native
pg_arrow (target)~500 MB/s - 1 GB/sFile parsing overhead
┌─────────────────────────────────────────────────────────────┐
│                        pg_arrow                             │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────────────────────────────────────────────┐  │
│  │              DataFusion Integration                   │  │
│  │  PostgresTableProvider -> PostgresScanExec            │  │
│  └────────────────────┬─────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼─────────────────────────────────┐  │
│  │          Parallel Page Stream                         │  │
│  │  • Rayon work-stealing                                │  │
│  │  • Batch size: 10K tuples                             │  │
│  │  • Predicate pushdown                                 │  │
│  └────────────────────┬─────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼─────────────────────────────────┐  │
│  │          Page Buffer Pool (LRU)                       │  │
│  │  • Arc<Page> for zero-copy sharing                    │  │
│  │  • 25% of system memory                               │  │
│  └────────────────────┬─────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼─────────────────────────────────┐  │
│  │          PostgreSQL Page Parser                       │  │
│  │  • Parse header, item pointers, tuples                │  │
│  │  • Extract visible tuples (MVCC if needed)            │  │
│  └────────────────────┬─────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼─────────────────────────────────┐  │
│  │          Type Decoders                                │  │
│  │  • int4, int8, text, timestamp, etc.                  │  │
│  │  • SIMD-optimized where possible                      │  │
│  └────────────────────┬─────────────────────────────────┘  │
│                       │                                     │
│  ┌────────────────────▼─────────────────────────────────┐  │
│  │          Arrow Builder                                │  │
│  │  • Column-at-a-time construction                      │  │
│  │  • DictionaryArray for low-cardinality                │  │
│  │  • Emit RecordBatch                                   │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Key Papers and References

Foundational Papers

  1. "Efficiently Compiling Efficient Query Plans for Modern Hardware"

    • Authors: Thomas Neumann
    • Venue: VLDB 2011
    • URL: https://www.vldb.org/pvldb/vol4/p539-neumann.pdf
    • Award: VLDB 2021 Test of Time Award
    • Key Contribution: Data-centric code generation, produce/consume model
  2. "Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework"

    • Authors: Viktor Leis, Peter Boncz, Alfons Kemper, Thomas Neumann
    • Venue: SIGMOD 2014
    • URL: https://db.in.tum.de/~leis/papers/morsels.pdf
    • Key Contribution: Morsel-driven execution, work-stealing, NUMA awareness
  3. "LeanStore: In-Memory Data Management Beyond Main Memory"

    • Authors: Viktor Leis, Michael Haubenschild, Thomas Neumann
    • Venue: ICDE 2018
    • URL: https://db.in.tum.de/~leis/papers/leanstore.pdf
    • Key Contribution: Pointer swizzling for buffer management
  4. "The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases"

    • Authors: Viktor Leis, Alfons Kemper, Thomas Neumann
    • Venue: ICDE 2013
    • URL: https://db.in.tum.de/~leis/papers/ART.pdf
    • Key Contribution: Cache-efficient index structure for in-memory databases

Umbra-Specific Papers

  1. "Umbra: A Disk-Based System with In-Memory Performance"

    • Authors: Thomas Neumann, Michael Freitag
    • Venue: CIDR 2020
    • URL: https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
    • Key Contribution: Variable-size pages, LeanStore integration
  2. "Tidy Tuples and Flying Start: Fast Compilation and Fast Execution of Relational Queries in Umbra"

    • Authors: Tim Kersten, Viktor Leis, Thomas Neumann
    • Venue: VLDB Journal 2021
    • URL: https://link.springer.com/article/10.1007/s00778-020-00643-4
    • Key Contribution: Umbra IR, Flying Start backend, adaptive execution
  3. "Scalable Garbage Collection for In-Memory MVCC Systems"

    • Authors: Jan Bottcher, Viktor Leis, Thomas Neumann, Alfons Kemper
    • Venue: VLDB 2019
    • URL: https://db.in.tum.de/~boettcher/p128-boettcher.pdf
    • Key Contribution: Steam garbage collection algorithm
  4. "Memory-Optimized Multi-Version Concurrency Control for Disk-Based Database Systems"

    • Authors: Michael Freitag, Alfons Kemper, Thomas Neumann
    • Venue: VLDB 2022
    • URL: https://www.vldb.org/pvldb/vol15/p2797-freitag.pdf
    • Key Contribution: Umbra's MVCC implementation

Comparative and Analysis Papers

  1. "Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask"

    • Authors: Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, Peter Boncz
    • Venue: VLDB 2018
    • URL: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf
    • Key Contribution: Detailed comparison of compilation vs vectorization approaches
  2. "Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems"

    • Authors: Tobias Muhlbauer, Wolf Rödiger, Alfons Kemper, Thomas Neumann
    • Venue: SIGMOD 2015
    • URL: https://db.in.tum.de/~muehlbau/papers/mvcc.pdf
    • Key Contribution: HyPer's precision locking for serializability

Online Resources

  • HyPer Official Site: https://hyper-db.de/
  • Umbra Database: https://umbra-db.com/
  • CedarDB: https://cedardb.com/
  • CedarDB Technology Docs: https://cedardb.com/docs/technology/
  • Tableau Hyper Journey: https://tableau.github.io/hyper-db/journey/
  • Database of Databases - HyPer: https://dbdb.io/db/hyper
  • Database of Databases - Umbra: https://dbdb.io/db/umbra
  • TUM Database Systems Group: https://db.in.tum.de/

Summary Comparison Table

FeatureHyPerUmbraCedarDBDataFusionpg_arrow (Target)
StorageIn-memory onlyVariable-size pages (64KB-buffer)Pointer swizzlingExternal (Arrow files)PostgreSQL heap files
CompilationLLVM JITUmbra IR + Flying Start/LLVMCustom + LLVMNone (vectorized)None (streaming)
MVCCFork + COWVersion chains + Steam GCVersion chainsN/ARead-only (no MVCC)
ParallelismMorsel-drivenMorsel-drivenMorsel-drivenPartition-basedRayon work-stealing
IndexARTARTARTExternalN/A (full scan)
HTAPYes (snapshots)Yes (Colibri)YesOLAP onlyOLAP only
PostgreSQL CompatNoNoYes (wire protocol)NoData files only
Target Throughput5-10 GB/s3-8 GB/s3-8 GB/s1-3 GB/s0.5-1 GB/s

Conclusion

HyPer, Umbra, and CedarDB represent the cutting edge of database system design:

  1. Compilation eliminates interpretation overhead: 10-100x speedup possible
  2. Morsel-driven parallelism provides elastic, NUMA-aware execution
  3. Pointer swizzling eliminates buffer pool overhead
  4. Adaptive execution balances compilation latency vs. runtime performance
  5. Hybrid storage enables both OLTP and OLAP workloads

For pg_arrow, the key lessons are:

  • Batch-oriented processing (morsel-sized batches)
  • Column-at-a-time conversion (enables SIMD)
  • Parallel page processing (work-stealing via Rayon)
  • Buffer pool caching (LRU for hot pages)
  • Minimal compression (leave to downstream Parquet/IPC)

Target Performance: 500 MB/s - 1 GB/s end-to-end (PostgreSQL files → Arrow)


See Also


Document created: 2026-01-23 Last updated: 2026-01-23 Total length: ~2,300 lines of comprehensive research