Clickhouse Internals
ClickHouse Internals
Overview
ClickHouse is a column-oriented OLAP DBMS built around the MergeTree family of
storage engines. Its design is an aggressive application of three ideas: (1) columnar
on-disk storage with per-column compression codecs, (2) a sparse primary index that
indexes one row per ~8192-row granule rather than every row, and (3) an LSM-like
"merge" discipline in which every INSERT produces an immutable sorted part and a
background process merge-sorts parts together. Reads are vectorized over blocks of
columns (typically 65536 rows) through a pull-based processor pipeline.
The architecture is closest in spirit to C-Store / Vertica (Stonebraker et al., VLDB 2005) and the column-store analysis in Abadi, Madden, Hachem, "Column-Stores vs. Row-Stores: How Different Are They Really?" (SIGMOD 2008), with lightweight codecs in the lineage of Zukowski et al., "Super-Scalar RAM-CPU Cache Compression" (ICDE 2006). Replication uses a replicated log over ZooKeeper or the Raft-based ClickHouse Keeper.
INSERT ──► [in-memory block sorted by PK] ──► write immutable PART to disk
│
parts: all_1_1_0 all_2_2_0 all_3_3_0 ... ▼
SimpleMergeSelector picks a run
│
merge-sort by PK ──► all_1_3_1 (new part)
│
old parts marked outdated, GC'd
SELECT ─► parse ─► AST ─► QueryPlan ─► QueryPipeline (IProcessor DAG) ─► PipelineExecutor
│
MergeTree reader: PK index → granule ranges → marks → decompress .bin
Source layout references: src/Storages/MergeTree/, src/Processors/, src/Columns/,
src/Compression/, src/Coordination/.
1. MergeTree Part Structure (On-Disk Layout)
A part is an immutable directory. Its name encodes its lineage:
{partition_id}_{min_block}_{max_block}_{level}[_{mutation_version}]
all _ 1 _ 3 _ 1
partition_id: fromPARTITION BY(e.g.202605fortoYYYYMM(date)), or literalallwhen no partition key.min_block/max_block: range of monotonically increasing block numbers (from a ZooKeeper/Keeper counter for replicated tables, local counter otherwise). A fresh insert getsN_N; a merge of1_1and2_2yields1_2.level: number of merges that produced it (raw inserts are level 0).mutation_version: appended after anALTER ... UPDATE/DELETEmutation.
Files in a part
| File | Purpose |
|---|---|
{column}.bin | Compressed column data (one per column, or per stream for complex types) |
{column}.cmrk3 / .mrk3 | Marks: granule → byte offsets into .bin |
primary.idx (or primary.cidx) | Sparse PK index: PK tuple at each granule boundary |
minmax_{col}.idx | Min/max of each partition-key column for the whole part (partition pruning) |
partition.dat | Serialized partition key value for the part |
skp_idx_{name}.idx / .mrk3 | Data-skipping secondary indexes |
checksums.txt | Size + hash (CityHash128) of every file; integrity + replication |
count.txt | Exact row count of the part (avoids reading data for count(*)) |
columns.txt | Ordered list of columns and their types |
serialization.json | Per-column serialization kind (Default vs Sparse) |
default_compression_codec.txt | Codec to apply to columns without explicit codec |
metadata_version.txt | Schema version the part was written under |
A part can be Wide (one .bin per column, the default for large parts) or Compact
(all columns in a single data.bin + data.mrk3, chosen for tiny parts under
min_bytes_for_wide_part / min_rows_for_wide_part, default 10 MB / 1M rows). In-memory
parts (InMemory) existed in older versions but are deprecated.
.bin data file format
A .bin file is a concatenation of independently-decompressable compressed blocks.
Each block header is 25 bytes (not 16 — the checksum is 128-bit):
| Field | Size | Encoding |
|---|---|---|
| checksum | 16 B | CityHash128 of (method byte + sizes + compressed payload) |
| method | 1 B | codec id: 0x82=LZ4, 0x90=ZSTD, 0x02=none |
| compressed_size | 4 B | little-endian; includes the 9-byte (method+sizes) header |
| uncompressed_size | 4 B | little-endian, size after decompression |
| compressed_data | var | the codec payload |
A block holds roughly min_compress_block_size (65536) to max_compress_block_size
(1 MiB) bytes of uncompressed column data, and a block boundary is always forced at a
granule boundary so that any granule can be decompressed without crossing into an
unrelated block. The 16-byte checksum lets ClickHouse verify each block on read
independently of any part-level checksum.
Mark files (.mrk3 / .cmrk3)
Marks map a granule index to where its data begins. With adaptive granularity each mark is 24 bytes:
| Field | Size | Meaning |
|---|---|---|
| offset_in_compressed_file | 8 B | byte offset of the compressed block in .bin |
| offset_in_decompressed_block | 8 B | byte offset of the granule inside that block |
| rows_in_granule | 8 B | how many rows this granule contains |
Older non-adaptive marks (.mrk) were 16 bytes (no rows_in_granule, fixed 8192).
.cmrk3 is the compressed mark file (marks themselves are LZ4-compressed) used by
default since v22.x. To read granule g of a column: take marks[g], seek to
offset_in_compressed_file, decompress that block, skip offset_in_decompressed_block
bytes, read rows_in_granule values.
Granules
The default granule is 8192 rows (index_granularity). With adaptive granularity
(index_granularity_bytes, default 10 MiB) ClickHouse caps a granule so its
uncompressed footprint stays under ~10 MiB; for very wide rows a granule may hold far
fewer than 8192 rows. This keeps the I/O cost of "read one granule" roughly constant
regardless of row width — important because the index addresses granules, not rows.
2. Sparse Primary Index Deep Dive
The PK index is sparse: primary.idx stores the PK tuple of the first row of each
granule only. For a part of R rows and granule size g, the index has ceil(R/g)
entries — for 1 billion rows and g=8192 that is ~122k entries, small enough to keep
in memory (loaded once, cached in the mark/index cache).
primary.idx (PK = (toDate(ts), user_id)):
entry 0 → granule 0 first row → (2026-05-01, 42)
entry 1 → granule 1 first row → (2026-05-01, 991)
entry 2 → granule 2 first row → (2026-05-02, 17)
...
WHERE user_id = 991 AND date = '2026-05-01'
binary search entries → matching granule range [1, 2)
read marks[1] → seek .bin → decompress → scan 8192 rows
Lookup algorithm
function selectGranules(condition, index, marks):
# index entries are sorted ascending by the composite PK (lexicographic)
ranges = []
# KeyCondition turns the WHERE clause into a hyperrectangle test in PK space
for each contiguous run of granules g where
KeyCondition.mayBeTrueInRange(index[g], index[g+1]) is true:
ranges.append( MarkRange(g, g_end) )
# binary search is used when the condition is a prefix-bounded range
return ranges # then read only marks/.bin for these ranges
KeyCondition (src/Storages/MergeTree/KeyCondition.h) evaluates whether a granule's
[low, high) PK interval can satisfy the predicate. Because the index is sparse it is
lossy: a granule is read whenever it might contain a match, then rows are filtered
exactly during scan. This is the same "zone-map" idea as PostgreSQL BRIN, applied at
granule granularity.
Composite keys
Multiple PK columns form one tuple sorted lexicographically. Pruning is only effective
on a prefix: with ORDER BY (a, b, c), a predicate on a prunes well, on b alone
prunes poorly (every a-group repeats the full b range). Rule of thumb: put
low-cardinality, frequently-filtered columns first. ClickHouse blog "How ClickHouse Keeps
Data in Order to Enable Fast Queries" covers this in detail.
Cost: index scan vs full scan
- Full scan: read every granule of every column referenced → bounded by total compressed column bytes.
- Index scan: read only granules in the selected
MarkRanges. For a selective prefix predicate this can be orders of magnitude less I/O. The minimum unit read is one granule (~8192 rows), so a single-row point lookup still decompresses a whole granule — ClickHouse is not a point-lookup store.
ORDER BY and PRIMARY KEY may differ: PRIMARY KEY can be a prefix of ORDER BY,
letting the on-disk sort be finer than the indexed prefix (smaller index, same sort).
3. Data-Skipping Indexes
Secondary indexes summarize blocks of GRANULARITY N granules and let the planner skip
reading them. They are stored as skp_idx_{name}.idx + .mrk3 inside each part and are
consulted after PK pruning, before reading column data.
| Type | Stores per block | Helps when |
|---|---|---|
minmax | (min, max) of the expression | column correlated with PK / monotone-ish |
set(max_rows) | up to max_rows distinct values (drops to "all" if exceeded) | low-cardinality column, equality/IN |
bloom_filter(p) | classic Bloom filter over values, false-positive rate p | high-cardinality equality / IN |
ngrambf_v1(n, size, hashes, seed) | Bloom filter over n-grams | substring LIKE '%x%', hasToken-style |
tokenbf_v1(size, hashes, seed) | Bloom filter over whitespace/punct tokens | word search in text |
DDL: INDEX idx_url url TYPE bloom_filter(0.01) GRANULARITY 4
│
each index granule summarizes 4 data granules (4×8192 = 32768 rows).
WHERE url = 'x' → test bloom filter per index-granule → skip blocks that
definitely lack 'x'; read the rest and filter exactly.
When they help: the indexed column must be physically clustered enough that whole
blocks can be excluded. A minmax on a column uncorrelated with the sort order is
useless (every block spans the full range). Bloom filters help equality on
high-cardinality columns but add write + storage cost and only ever exclude blocks —
they never accelerate a scan that must read the block anyway. EXPLAIN indexes = 1
reports granules dropped per index, the only reliable way to confirm an index earns its
keep.
4. Compression and Codec Internals
ClickHouse separates codecs (per-column, semantic, applied first) from the generic
compressor (LZ4 or ZSTD, applied last). The pipeline is: raw values → codec chain →
generic compression → compressed block in .bin. Codecs are declared per column:
CODEC(Delta, ZSTD(3)).
| Codec | Mechanism | Best for |
|---|---|---|
LZ4 (default) | byte-level LZ77 with 64 KB window; very fast | general columns; default whole-file codec |
ZSTD(level) | entropy-coded LZ with larger window | cold/archival data, better ratio, slower |
Delta(n) | store v[i]-v[i-1] over width-n ints | timestamps, monotone counters |
DoubleDelta | second-order delta + varint, Gorilla-style | near-constant-stride sequences |
Gorilla | XOR consecutive floats, encode leading/trailing zero runs | slowly-changing gauges/metrics |
T64 | transpose 64 values into bit-planes, strip unused high bits | low-range / low-cardinality ints |
FPC, ZSTD_QAT, GCD | specialized (floats / hardware-offload / common divisor) | niche |
Delta and DoubleDelta
Delta(n): first value stored raw, subsequent stored as v[i]-v[i-1] in the same
width. Turns a monotone counter into a stream of small (often constant) values that LZ4
crushes. DoubleDelta stores the delta-of-deltas; for a fixed-stride series (timestamps
every 15 s) the second-order differences are zero and encode to a single bit each. The
varint framing follows the Gorilla paper.
DoubleDelta encode (per Pelkonen et al., VLDB 2015 timestamp scheme):
write v[0] raw; write d1 = v[1]-v[0] raw
for i >= 2:
dod = (v[i]-v[i-1]) - (v[i-1]-v[i-2])
if dod == 0: emit '0'
elif dod in [-63,64]: emit '10' + 7 bits
elif dod in [-255,256]: emit '110' + 9 bits
elif dod in [-2047,2048]: emit '1110'+ 12 bits
else: emit '1111'+ 32/64 bits
Gorilla (float XOR)
From Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database"
(VLDB 2015). XOR each float with the previous; if equal, emit one 0 bit. Otherwise
emit 1 plus a control bit choosing whether to reuse the previous (leading-zeros,
trailing-zeros) window or store a fresh one, then the meaningful XOR bits between.
Excellent for gauges that change slowly; poor for high-entropy floats.
Gorilla value encode:
xor = bits(v[i]) ^ bits(v[i-1])
if xor == 0: emit '0'
else:
lz = clz(xor); tz = ctz(xor)
if can reuse prev (lz,tz) window: emit '10' + meaningful_bits
else: emit '11' + lz(5b) + len(6b) + meaningful_bits
T64
Takes 64 consecutive integers, computes the active value range, subtracts the min, then
transposes the 64 values × b significant bits into b 64-bit bit-planes — storing
only as many planes as the range needs. A column of values in [0,15] needs 4 planes
instead of 64 full ints. The transposed bit-planes are then LZ4/ZSTD-compressed. This is
ClickHouse's analog of the bit-packing in Zukowski et al. (ICDE 2006).
Codec chaining & ZSTD dictionaries
CODEC(Delta, ZSTD(3)) runs Delta then ZSTD; CODEC(DoubleDelta, LZ4) is common for
timestamps. Order matters: the semantic codec must precede the generic compressor.
LowCardinality(T) is not a codec but a type wrapper — it dictionary-encodes the
column (a shared dictionary of distinct values + an index column), which is the most
impactful "compression" for low-cardinality strings and composes with the codecs above.
5. Merge Process
SimpleMergeSelector
Background merges keep part count bounded (too many parts slows reads and risks the
"too many parts" error). SimpleMergeSelector (src/Storages/MergeTree/) scans the
sorted list of active parts and scores every contiguous range of adjacent parts,
preferring ranges that are balanced in size, not too large in total, and weighting in
part age so small parts left behind get merged eventually. The high-level scoring:
for each contiguous range [i..j] of active parts:
sum = Σ size(parts)
max = max size(parts); cnt = j-i+1
if sum > max_total_size_to_merge: skip # don't build giant parts
# prefer ranges where no single part dominates and that aren't too few/many parts
score = sum / (max + ε) # ~cnt when balanced
# heuristic boost for ranges containing old small parts (base_for_age)
score *= ageBoost(min_age_in_range)
pick range with best score; if best score good enough, schedule a merge
The effect approximates a tiered LSM compaction: many small parts repeatedly fold into
fewer larger parts, with merged-part size bounded by max_bytes_to_merge_at_max_space_in_pool.
Merge steps
1. select range of parts (SimpleMergeSelector)
2. open a sorted reader over each part (each is internally PK-sorted)
3. k-way merge by PK using a heap (MergingSortedAlgorithm)
- engine-specific "merge algorithm" transforms the stream (see below)
4. stream merged rows through column writers → new part directory (temp name)
5. write primary.idx, marks, skip indexes, checksums.txt, count.txt
6. atomically rename temp → final part; mark source parts Outdated
7. source parts removed after old_parts_lifetime once no query references them
Merges run on the background pool (background_pool_size, default 16) with a separate
pool for large merges; selection respects available pool space so a giant merge cannot
starve small ones.
Engine-specific merge transforms
| Engine | What the merge does |
|---|---|
MergeTree | plain k-way sort merge, no row collapsing |
ReplacingMergeTree | among rows with equal sort key, keep the one with max version (or last) |
AggregatingMergeTree | combine AggregateFunction states of equal-key rows via the function's merge |
SummingMergeTree | sum numeric columns for equal-key rows |
CollapsingMergeTree | cancel a Sign=-1 row against a preceding Sign=+1 row |
VersionedCollapsingMergeTree | same, but uses an explicit version column to order pairs |
Crucially these guarantees are eventual: collapsing/replacing only happens during a
merge, so queries must use FINAL, argMax, or -If/-Merge combinators to get fully
reconciled results before merges complete.
TTL merges
TTL expressions (TTL ts + INTERVAL 30 DAY) are evaluated during merges: expired rows
are dropped, or moved (TO DISK/TO VOLUME), or aggregated (GROUP BY ... SET). Column
TTLs reset a column to its default past the TTL. A dedicated TTL-merge schedule ensures
expiry happens even without size-driven merges.
Aggregate state binary format
AggregatingMergeTree stores AggregateFunction(func, types) columns as the function's
serialized intermediate state (e.g. uniqExact → a hash set; avg → (sum, count);
quantileTDigest → a t-digest). The merge calls the function's merge(state_a, state_b)
and serialize; readers must use -Merge combinators (avgMerge, uniqMerge) to
finalize. This is what powers incrementally-maintained rollups via materialized views.
6. Vectorized Execution (Column Block Model)
ClickHouse processes data in Chunks: a Chunk is a vector of IColumn pointers plus
a row count (src/Processors/Chunk.h). Default block size is 65536 rows
(max_block_size). Operating on whole columns amortizes virtual dispatch and exposes
SIMD — the core argument of the column-store literature.
IColumn hierarchy
IColumn
├─ ColumnVector<T> contiguous PODArray<T> (UInt8..Float64, dates)
├─ ColumnString chars: PODArray<UInt8> + offsets: PODArray<UInt64>
├─ ColumnFixedString fixed-width contiguous bytes
├─ ColumnArray data: IColumn + offsets: PODArray<UInt64>
├─ ColumnNullable nested: IColumn + null_map: ColumnVector<UInt8>
├─ ColumnLowCardinality dictionary: IColumn + indexes: IColumn
├─ ColumnConst single value + size (avoids materializing constants)
└─ ColumnTuple / ColumnMap / ColumnSparse ...
ColumnSparse stores only non-default values plus their positions — chosen per part via
serialization.json when a column is mostly default (e.g. mostly-zero metrics), and
arithmetic on it stays sparse.
Function dispatch
IFunction::executeImpl(arguments, result_type, input_rows_count) takes input columns
and returns a result column. The hot loop typically does:
ColumnPtr executeImpl(cols, n):
a = cols[0]; b = cols[1]
if isColumnConst(b): return executeConstRHS(a, constval(b), n) # specialize
res = ColumnVector<R>::create(n)
for i in 0..n: # auto-vectorized by the compiler
res[i] = op(a[i], b[i]) # AVX2/AVX-512 on the contiguous arrays
return res
Specialized template instantiations per numeric type give the autovectorizer tight, branch-free loops.
SIMD usage
- Arithmetic / comparison kernels compile to AVX2/AVX-512 over
PODArray. - String search uses the Volnitsky algorithm (a bigram-hash variant of
Boyer-Moore-Horspool) for substring/
LIKE, with SIMD-accelerated scanning of the haystack. position,hasToken,multiSearchAnyhave dedicated SIMD paths.
Expression DAG & short-circuit
ExpressionActions (src/Interpreters/ActionsDAG) compiles a scalar expression into a
DAG of IFunction nodes; common subexpressions are computed once and reused. ClickHouse
also has an LLVM JIT that fuses chains of arithmetic functions into a single compiled
kernel when compile_expressions is enabled, cutting per-column dispatch (data-centric
codegen in the spirit of Neumann, VLDB 2011). Lazy/short-circuit evaluation
(short_circuit_function_evaluation) skips evaluating expensive arguments on rows masked
out by if/and/or, important for nullable and throwIf-style functions.
7. Query Pipeline (IProcessor Model)
Execution is a pull-based dataflow graph of IProcessor nodes connected by ports
(src/Processors/). Each processor declares InputPorts and OutputPorts and a
prepare()/work() protocol: prepare() inspects port states and returns a status
(NeedData, PortFull, Ready, Async, Finished); work() does the actual CPU
work when Ready.
┌──────────┐ ┌────────────────┐ ┌───────────────┐ ┌──────────┐
│ Source │──►│ FilterTransform│──►│ExpressionTrans │──►│ ... Sink │
│(MergeTree│ │ (WHERE) │ │ (SELECT exprs) │ └──────────┘
│ reader) │ └────────────────┘ └───────────────┘
└──────────┘
▲ resize/merge ports fan-in/out for parallel lanes
│
┌────┴─────┬──────────┐
stream0 stream1 stream2 (one lane per thread; granule ranges split)
Node families:
- ISource:
MergeTreeSource(reads granule ranges),RemoteSource(reads from a shard over native protocol),NullSource, generators. - ISink / IOutputFormat:
MergeTreeSink(writes parts), result-format writers. - ITransform:
FilterTransform,ExpressionTransform,AggregatingTransform,MergingAggregatedTransform,SortingTransform,MergeSortingTransform,LimitTransform,JoiningTransform, plusResizeProcessorto repartition lanes.
Execution
QueryPipelineBuilder builds the DAG from the QueryPlan; PipelineExecutor runs it.
The executor treats work()-ready processors as tasks in a graph, dispatching them
across max_threads worker threads, expanding parallel lanes where the plan allows
(e.g. many MergeTreeSources feeding a parallel AggregatingTransform, then a single
MergingAggregatedTransform). It is a push of readiness, pull of data model — no
thread blocks on I/O if other processors are runnable.
Async I/O & prefetch
For object-store / remote disks, ReadBufferFromRemoteFS issues asynchronous
prefetch for the next granules' compressed blocks while the current granule is being
decompressed and filtered, overlapping network latency with CPU. Local reads use the
page cache plus ClickHouse's own mark cache and uncompressed-block cache.
8. Distributed Query Execution
The Distributed table engine is a view over a cluster of shards (each shard a
local MergeTree-family table, possibly replicated). It does no storage of its own.
client
│ SELECT ... FROM distributed_t GROUP BY k
▼
┌─────────┐ initiator rewrites query for shards (push down WHERE,
│initiator│ partial GROUP BY), then fans out the remote query
└─────────┘
│ │ │
▼ ▼ ▼ per-shard: local read + partial aggregation
shard0 shard1 shard2
│ │ │ RemoteBlockInputStream / RemoteSource pulls
└─────┴─────┘ partial blocks back over the native protocol
│
initiator MergingAggregatedTransform → final result
Planning
The initiator splits the query into an initiator part and a remote part. The
remote part (sent verbatim as SQL plus a "stage" hint) runs on each shard up to
WithMergeableState — i.e. it does the local scan, filter, and a partial aggregation,
returning aggregate states rather than final values. The initiator then merges.
Data exchange & two-level aggregation
Inter-shard transport is the native protocol (length-prefixed columnar blocks) over the TCP port. Aggregation uses a two-level hash table when the group count is large: groups are bucketed by a hash prefix (256 buckets) so the merge step on the initiator can proceed bucket-by-bucket in parallel and shards can stream buckets independently. This is the classic partition-then-merge scheme that bounds initiator memory.
Parallel replicas
parallel_replicas lets a single shard's replicas cooperate on one query: the
coordinator hands out disjoint granule (mark) ranges to each replica as a work queue
(allow_experimental_parallel_reading_from_replicas), so replicas scan different parts of
the same data in parallel and stream partial results back. This turns replicas (normally
just for HA) into read-scaling units, task-stealing style, with the coordinator balancing
ranges dynamically rather than statically splitting by parallel_replicas_count.
9. ReplicatedMergeTree and ClickHouse Keeper
ReplicatedMergeTree replicates via a shared replicated log in ZooKeeper or
ClickHouse Keeper. There is no leader for writes in the Raft sense — any replica accepts
inserts and appends log entries; replicas converge by replaying the log.
Keeper paths
/clickhouse/tables/{shard}/{table}/
├─ log/ sequence of replication tasks (log-0000000001, ...)
├─ replicas/{replica}/
│ ├─ log_pointer last log entry this replica has copied to its queue
│ ├─ queue/ this replica's pending tasks
│ ├─ parts/ parts this replica currently has
│ └─ is_active ephemeral node = replica is alive
├─ block_numbers/ per-partition block-number allocator
├─ blocks/ insert dedup hashes (idempotent retries)
├─ mutations/ ALTER UPDATE/DELETE definitions
└─ columns / metadata current schema + version
Log entry types
| Type | Meaning |
|---|---|
GET_PART | fetch a part another replica produced (e.g. a fresh insert) |
MERGE_PARTS | merge a named set of source parts into a result part |
MUTATE_PART | apply a mutation (UPDATE/DELETE/materialize) to a part |
DROP_RANGE / DROP_PART | remove parts in a block range (partition drop, dedup) |
ALTER_METADATA | apply a schema change |
Replica lifecycle
loop:
if /log has entries beyond my log_pointer:
copy them into my queue/ ; advance log_pointer
pick an executable queue task (deps satisfied, parts present):
GET_PART → download part from a replica that has it (interserver HTTP)
MERGE_PARTS→ run the merge locally (deterministic: all replicas get same bytes)
MUTATE_PART→ apply mutation locally
on success: register result in replicas/{me}/parts ; remove queue node
Merges are assigned via the log, not run independently, so every replica produces
byte-identical merged parts (deterministic merge). A replica can instead GET_PART the
merged result if it falls behind. Inserts are deduplicated by a hash stored under
blocks/, making retries idempotent.
Part fetch protocol
GET_PART downloads the part's files (*.bin, *.mrk3, primary.idx, checksums.txt,
...) from a healthy replica over the interserver HTTP port (default 9009), validating
each file against checksums.txt. With zero_copy_replication on shared object storage,
only metadata is fetched and the data blobs are referenced in place.
ClickHouse Keeper
ClickHouse Keeper (src/Coordination/) is a drop-in ZooKeeper replacement built on the
NuRaft Raft library. It speaks the ZooKeeper client protocol and stores the same
znode tree, but uses Raft for linearizable consensus (leader election + replicated log +
snapshots) instead of ZAB. It is typically co-deployed with ClickHouse, lowering
operational overhead and improving write latency/throughput for the coordination
workload (many small create/set/multi ops). Compaction-style log + periodic
snapshot keep its on-disk state bounded.
10. Materialized Views and Projections
Materialized Views
A ClickHouse MV is an insert trigger, not a maintained cache. After a block is
inserted into the source table, ClickHouse runs the MV's SELECT over just that
block and inserts the result into the MV's target table:
INSERT INTO source (block B)
│
├─► write B to source's MergeTree
└─► for each attached MV:
result = MV.SELECT applied to B # block-at-a-time
INSERT result INTO MV.target_table
The target is usually an AggregatingMergeTree / SummingMergeTree holding partial
states, so rollups are maintained incrementally and finalized at read time with -Merge
combinators. The MV never re-reads history — it only ever sees newly-inserted blocks,
which is why backfills require POPULATE or manual inserts and why joins inside MVs only
see the freshly inserted side. The to-table can be shared by multiple MVs.
Projections
A projection is an alternate physical representation stored inside the same part
directory (projections/{name}/), giving each part a second sort order and/or a
pre-aggregation:
part all_1_3_1/
├─ {column}.bin ... (base data, ORDER BY a)
└─ projections/by_b/
├─ {column}.bin ... (same rows re-sorted by b, or pre-aggregated)
└─ primary.idx
Because the projection lives in the part, it is always consistent with base data and is maintained automatically through merges (no separate refresh). Two kinds:
- Normal projection: same rows, different
ORDER BY— accelerates predicates on a non-PK column. - Aggregate projection:
SELECT ... GROUP BYstored as aggregate states — answers rollup queries by reading far fewer rows.
Projection selection
During planning the optimizer checks each projection: it is usable iff it covers all
columns the query needs (or all needed aggregates) and is expected to read less data.
If multiple qualify, the one minimizing estimated granules read wins; otherwise the query
falls back to base data. EXPLAIN projections = 1 shows which projection was chosen and
why. The tradeoff is write amplification and storage: every projection is rebuilt on
every merge.
Key Papers & Sources
- Pelkonen, Franklin, Teller, Cavallaro, Huang, Meza, Veeraraghavan, "Gorilla: A Fast, Scalable, In-Memory Time Series Database," VLDB 2015 — DoubleDelta + Gorilla codecs.
- Zukowski, Heman, Nes, Boncz, "Super-Scalar RAM-CPU Cache Compression," ICDE 2006 — lightweight bit-packing / FOR codecs underlying T64.
- Abadi, Madden, Hachem, "Column-Stores vs. Row-Stores: How Different Are They Really?," SIGMOD 2008 — late materialization, vectorized columnar execution rationale.
- Stonebraker et al., "C-Store: A Column-oriented DBMS," VLDB 2005 — architectural ancestor (read-optimized + writable store + merges).
- Neumann, "Efficiently Compiling Efficient Query Plans for Modern Hardware," VLDB 2011 — data-centric codegen, relevant to ClickHouse's LLVM expression JIT.
- Shvachko, Kuang, Radia, Chansler, "The Hadoop Distributed File System," MSST 2010 — context for distributed columnar storage at scale.
- O'Neil, Cheng, Gawlick, O'Neil, "The Log-Structured Merge-Tree (LSM-Tree)," Acta Informatica 1996 — the merge discipline MergeTree adapts.
- ClickHouse blog, "How ClickHouse Keeps Data in Order to Enable Fast Queries" — sparse primary index walkthrough.
- ClickHouse technical posts by Alexey Milovidov (architecture, codecs, Keeper).
- ClickHouse source:
src/Storages/MergeTree/,src/Processors/,src/Columns/,src/Compression/,src/Coordination/.