Bookmarks Tech Insights
Tech Insights from Bookmarks
Curated technical bookmarks covering engineering case studies, systems techniques, data storage, and programming languages. Extracted from Chrome bookmarks (March 2026).
Progress: 60/200+ articles detailed | Last batch: 2026-05-26
Legend: entries with
> **Key insights:**blocks have been read and summarized. Run another batch anytime with: "get details for 10 more articles"
Case Studies
Company engineering blogs, postmortems, architecture deep-dives.
Database & Storage Infrastructure
-
How Uber Conquered Database Overload: From Static Rate-Limiting to Intelligent Load Management -- Uber's evolution from static rate limiting to adaptive database load shedding
Key insights:
- Stateless quota-based rate-limiting failed at scale (Redis dependency, can't track thousands of partitions); shifted shedding to storage layer where context is complete
- Concurrency (in-flight ops) chosen over QPS as primary overload signal — Little's Law
Concurrency = Throughput × Latencymaps directly to resource usage - CoDel adapts queue policy: FIFO under normal load, LIFO under pressure ("newer requests still have a chance to succeed"); prevents wasted work on stale requests
- Cinnamon adds priority tiers (t0-t5): user-facing work protected at t1 while background jobs shed first — priority-aware on top of CoDel's priority-agnostic base
- PID controller treats overload as "dimmer switch" not binary reject; smooths recovery vs static thresholds that cause thundering herd
- Unified engine results: +80% throughput (5400 vs 3000 QPS), -70% P99 latency (1.0s vs 3.1s upserts), -93% goroutine count (10K vs 150K peak), -60% heap (1GB vs 5-6GB spikes)
- BYOS framework: pluggable signals (follower lag, write bytes, mem) feed unified decision loop without core rewrite
- Scorecard layer: per-tenant deterministic concurrency limits isolate noisy neighbors independently of system-wide shedding
- Regulators detect "low-fidelity" overload (large write payloads, partition hotspots, mem pressure) missed by concurrency metric alone
Key insights:
- Uber's Docstore/Schemaless handle tens of millions of req/s across 170M+ MAU; minor overloads cascade across microservices
- Phase 1 (failed): quota-based rate limiting with Redis; fundamentally flawed cost model (full table scan = same cost as single row read)
- Phase 2: CoDel (Controlled Delay) queuing with LIFO under pressure + Scorecard engine for per-tenant concurrency limits
- Phase 3 (Cinnamon): priority-aware load shedder with 6 tiers (t0-t5), PID-based controller for dynamic queue timeout/inflight adjustment
- Phase 4: unified "Bring Your Own Signal" (BYOS) engine with pluggable signals (e.g., follower commit lag)
- Key technique: Little's Law — use concurrency (inflight ops) as overload signal, not QPS
- Results vs token bucket: 80% throughput increase (5400 vs 3000 QPS), 70% P99 latency reduction (1.0s vs 3.1s), 93% fewer goroutines (10K vs 150K peak), 60% lower heap (1GB vs 5-6GB)
- Design principle: place control logic in storage layer where system state is authoritative; fail-fast over queuing
-
One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet -- Uber's column-level encryption for Parquet data at rest
Key insights:
- Single column-encryption mechanism solves three orthogonal needs: access control, retention/deletion, encryption-at-rest — avoids three separate systems
- Schema-driven: encryption metadata flows through Hive Metastore (HMS) not per-file RPC to tag store — eliminates excessive remote calls
- Per-column independent keys: each column encrypted with own key; access is "do you hold key K?" — permission enforcement at crypto layer, not app code
- Crypto-shredding for retention: deleting the key turns ciphertext into garbage; no need to rewrite petabyte tables to expire one column
- AES-CTR chosen over AES-GCM: 3-4.5× faster in single-thread Java 9; integrity provided by Parquet checksums at row-group level
- Production overhead with 60% columns encrypted: +5.7% write, +3.7% read — small enough to enable by default
- Parquet-1817 plugin factory enables Spark/Hive/Presto/Flink compatibility without per-engine modification
- Auto-onboarding: tag changes propagate to ingestion pipelines; no manual table-by-table onboarding across PB-scale lake
- Mask-on-deny: users without key get null values instead of hard failure — legacy pipelines keep working
Key insights:
- One encryption mechanism solves three problems: column-level ACL (key permissions = access control), data retention (crypto-shredding — delete master key to render data irrecoverable without rewriting files), and encryption-at-rest
- Double-envelope key hierarchy: Data Encryption Keys (DEKs, per file/column) → Key Encryption Keys (KEKs, cached in Spark executors) → Master Encryption Keys (MEKs, in KMS); KMS contacted only once per MEK per executor, not per file
- Schema-driven auto-onboarding: tagging metadata propagated into Parquet schema itself; crypto retriever plugin reads tags at write time — no per-file RPC to tagging service
- Two algorithm modes: AES-GCM (authenticated encryption, 5.7% write / 3.7% read overhead) vs AES-GCM-CTR (metadata-only auth, 3–4.5× faster than full AES-GCM)
- Key rotation modifies only file footer (re-wrap DEKs/KEKs with new MEKs), not data pages — avoids re-encrypting column data
- Encryption transparent to Parquet optimizations: columnar projection, predicate pushdown, encoding, compression all continue to work on encrypted files
- Backfilling petabytes of historical data was hardest operational challenge; built 20× faster encryption tooling for re-encryption
- Access denial enforced at format level across all query engines (Spark, Hive, Presto); optionally null-mask sensitive values instead of hard failure
-
How Uber Indexes Streaming Data with Pull-Based Ingestion in OpenSearch -- Pull-based streaming data indexing at Uber
Key insights:
- Core idea: replace OpenSearch's push-based translog with native pull from Kafka/Kinesis; cluster focuses on indexing, Kafka owns durability
- Each OpenSearch shard maps 1:1 to a stream partition; StreamPoller + IngestionPlugin interface handles source-specific consumer logic
- Blocking queue decouples consumer and processor for throughput; optional document-ID partitioning parallelizes writes
- IngestionEngine replaces translog with a no-op; stores
_BatchStartPointer(min offset across active writers) with every Lucene commit for recovery - Recovery: init → retrieve last
_BatchStartPointer→ rewind consumer → replay; prevents data loss and duplicate indexing on replica promotion - External versioning supports out-of-order delivery: users set doc version in message; at-least-once processing + versioning = consistent views
- Error policies: Drop (discard + advance) or Block (retry indefinitely)
- Two replication modes: Segment Replication (primary ingests, replicas download via remote store — efficient but slight lag) vs All-Active (every shard ingests independently — zero lag, higher CPU)
- Regional clusters consume from globally replicated Kafka topics; each region holds a full copy for failover
-
Uforwarder: Uber's Scalable Kafka Consumer Proxy -- Push-based Kafka consumer proxy for event-driven microservices at scale
Key insights:
- Replaces pull-based Kafka client SDKs with gRPC push interface; centralizes offset management so application services need no Kafka library
- Scale: 1000+ downstream consumer services, trillions of messages/day, multiple PB/day of data
- Out-of-order commit tracker prevents head-of-line blocking: stuck message routes to DLQ while the commit pointer advances independently
- Context-aware routing via Kafka headers: infrastructure-level decisions (region, env, isolation) replace app-level filter code
- Auto-rebalancer reacts to CPU/memory/throughput signals continuously, redistributing partitions during traffic spikes without manual intervention
- DelayProcessManager enables per-partition pause/resume — selective backpressure isolates slow consumers without freezing the whole stream
- Eliminates bespoke delay/retry semantics in each service; one proxy implements the patterns once, all consumers inherit them
- Trade-off: extra gRPC hop adds latency vs direct Kafka client; justified by operational simplification at thousand-service scale
- Pattern: Kafka-proxy-as-platform is the natural successor to per-team Kafka client libraries when consumer count crosses ~100
-
Automating RDS Postgres to Aurora Postgres Migration (Netflix) -- Netflix's automated large-scale PostgreSQL migration to Aurora
Key insights:
- Fleet of ~400 PostgreSQL clusters; manual migration unscalable — built fully automated self-service workflow requiring zero database credentials and zero application code changes
- Chose Aurora Read Replica approach over snapshot-based: continuous async replication keeps replica in sync, enabling validation while production traffic flows; trades implementation complexity for shorter downtime
- Data Access Layer (DAL) architecture: apps → forward proxy (mTLS) → Data Gateway (Envoy reverse proxy) → database; cutover is config change in proxy layer, not app change
- Quiescence: instruct users to halt app traffic, then enforce at infra layer by detaching RDS security groups + instance reboot — forcibly terminates all connections without needing DB credentials
- Replication lag validation subtlety: OldestReplicationSlotLag never settles at zero — oscillates 0↔64MB every ~5 min due to WAL segment rotation (archive_timeout=300s); 0 moment confirms full catch-up
- Lag formula:
pg_current_wal_lsn() - restart_lsn; new WAL segment advances current position by one segment (64MB) before Aurora consumes it - Cutover: promote Aurora read replica to standalone writable cluster, update Envoy Data Gateway routing config — all client connections transparently rerouted
- Full ecosystem parity: parameter groups, read replicas, replication slots all migrated to preserve functional equivalence
-
Stripe's DocDB: Zero-Downtime Data Movement for Trillion-Dollar Payments -- Stripe's document database powering zero-downtime payment processing
-
Pinterest's CDC-Powered Ingestion Slashes Database Latency from 24 Hours to 15 Minutes -- Pinterest replacing batch ingestion with CDC for near-real-time data pipelines
Key insights:
- Old system: multiple independent batch pipelines with full-table dumps; 24+ hour latency despite only ~5% of rows changing daily; no row-level delete support
- New stack: Debezium/TiCDC → Kafka → Flink → Spark → Iceberg; two table types: CDC tables (append-only ledgers, sub-5-min latency) and Base tables (snapshots via Spark MERGE INTO, 15-min to 1-hour cadence)
- Standardized on Merge-on-Read (MoR) over Copy-on-Write: MoR writes deltas to separate files, resolves at query time — reduces write amplification and storage costs at petabyte scale
- Hash-based primary key bucket partitioning via Iceberg enables parallel upserts; ~100 buckets reduce per-task overhead
- At-least-once delivery with natural deduplication: MERGE INTO is idempotent on primary key (last-writer-wins), no explicit dedup infrastructure needed
- Bootstrap pipeline loads historical data initially; maintenance jobs handle compaction and snapshot expiration
- Config-driven onboarding supports MySQL, TiDB, KVStore; thousands of active pipelines across petabyte-scale data
- Results: latency 24h → 15min, compute costs slashed by processing only changed 5% of rows
-
Contributing to Debezium: Fixing Logical Replication at Scale (Zalando) -- Zalando fixing Debezium CDC logical replication under heavy load
Key insights:
- Core conflict: Debezium's offset store and PostgreSQL's replication slot diverge in position tracking; connector fails with "Saved offset is before replication slot's confirmed lsn" forcing full re-syncs
- Root cause: Debezium 2.7.4+ hard-coded
withAutomaticFlush(false), disabling JDBC driver's keepalive LSN flush that Zalando depended on to prevent WAL pile-up on low-activity databases - Contribution 1 (
lsn.flush.mode, PR #6881): three modes —manual,connector(default),connector_and_driver(both flush, preventing WAL growth on idle tables) - Contribution 2 (
offset.mismatch_strategy, PR #6948): four strategies —no_validation,trust_offset,trust_slot(PostgreSQL slot authoritative),trust_greater_lsn(bidirectional sync using max LSN) - Zalando's architecture differs: Patroni + custom Postgres Operator with ephemeral MemoryOffsetBackingStore, trusting slots as source of truth; most users trust persistent Kafka offset store instead
- Scale: 100+ Kubernetes clusters processing hundreds of thousands of events/second; zero detected data loss over nearly two years with billions of events processed
trust_greater_lsnenables self-healing from slot/offset mismatches, reducing manual intervention in production- Shipped in Debezium 3.4.0.Final (December 2025)
-
ClickPy at 2 Trillion rows: Scaling ingestion -- ClickHouse scaling Python package analytics to 2 trillion rows
Key insights:
- 2.21 trillion rows of Python package downloads from 2011+; pipeline: BigQuery → GCS → ClickPipes → staging DB → production DB
- ClickPipes replaced hand-rolled cron+ClickLoad: built-in retries, backoff, failure handling, and pipeline state tracking vs manual retry logic
- Null Engine + Materialized View pattern: ClickPipes writes to Null engine table (data doesn't persist), single MV handles schema normalization and type conversion before writing to main table
- Hot swap migration: cloned 14 tables + MVs to staging, ran both pipelines in parallel comparing daily row counts, then clean cutover
- Schema optimizations: LowCardinality strings for country/type/installer, Enum8 for CI field, Tuple nesting for file metadata, derived fields via splitByChar+arraySlice
- 13 separate materialized views pre-compute aggregations by different dimensions (daily, by version, by installer, by country)
- Historical data repair via lightweight DELETEs on multi-trillion-row tables; daily-grouped MVs auto-repopulate, non-daily MVs require drop/re-ingest/recreate cycle
- Discovered silent historical discrepancies between BigQuery source and ClickHouse only through systematic comparison
-
A 2.5x faster Postgres parser with Claude Code -- Multigres engineering a faster PostgreSQL parser
Key insights:
- Pure Go implementation replaces pg_query_go's cgo wrapper: eliminates cross-compilation pain, platform-specific builds, cgo runtime overhead
- Ports the real Postgres grammar verbatim, not a simplified variant — avoids perpetual catch-up with PG syntax updates
- AI excels at translation (PG source → Go yacc) but errs on invention (deparsing logic without reference); discipline of using existing artifacts matters
- Project state lived in markdown files (checklists, phase docs, session summaries) — Claude's own memory was insufficient for multi-week project
- Speedup from 1 year → 8 weeks came from expert code review catching systematic AI errors (wrong type signatures, symptom-fixing, missing edge cases) — not autonomous generation
- 71.2% coverage via porting PG's own regression suite (thousands of decade-spanning queries) — validates "Postgres-compatible grammar" claim
- Benchmarks: 2-3× faster per query (1.6µs vs 3.1µs simple SELECT), 2.5× faster full suite (145ms vs 366ms)
- Mechanical work (translation, test code, AST node generation) delegated to AI; architectural work (grammar debug, design) kept with humans
- Lesson: "fast output means nothing if output is wrong" — every grammar rule manually compared to PG source, every test failure investigated
Key insights:
- Pure Go PostgreSQL parser (no cgo) — rejected pg_query_go because cgo creates cross-compilation complexity, platform-specific builds, and per-call overhead on hot-path parsing
- Performance: simple SELECT 1.6μs vs 3.1μs (2×), complex SELECT 3.2μs vs 11.0μs (3.5×), CREATE TABLE 7.7μs vs 26.4μs (3.5×); full regression suite 145ms vs 366ms = 2.5× faster
- 287,786 lines across 304 files ported from PostgreSQL grammar to Go in 8 weeks (1 engineer + Claude); previous MySQL parser (Vitess) took over a year with a team
- Key AI insight: "Claude is much better at translating existing logic than inventing new logic correctly" — grammar translation (has reference) had low error rate; deparsing (no reference) required much more debugging
- Coordination system critical: markdown checklists tracking AST struct ports, grammar rules, test coverage (71.2%); session documents for cross-conversation continuity
- Expertise verification caught recurring Claude mistakes: wrong types "fixed" via unnecessary conversion functions, grammar rules subtly accepting invalid SQL
- Bottleneck shifted from implementation speed to decision quality and verification rigor
- Ported PostgreSQL's own regression tests (thousands of queries) for edge case validation
-
VACUUM FULL Locked Our Database for 14 Hours on Black Friday -- Production incident: Postgres VACUUM FULL during peak traffic
Key insights:
- Trigger: 84% dead tuples in
orderstable; engineer ranVACUUM FULLat 2:14 AM on Black Friday; 14-hour lockout → ~$340K lost revenue - VACUUM FULL takes
ACCESS EXCLUSIVElock — blocks all SELECT/INSERT/UPDATE/DELETE; rewrites entire table row-by-row; ~4h on 180GB table - Key difference from regular VACUUM: regular VACUUM marks dead tuples reusable without locking; VACUUM FULL rewrites, reclaims disk at OS level
- Cannot be cleanly cancelled (leaves partial rewrites); not transactional;
pg_cancel_backend()ineffective - Duration is roughly constant regardless of bloat ratio — the test run's 4h estimate was misleading
- Fix 1: tune autovacuum — 5% scale_factor instead of 20%, higher cost limits, naptime=10s
- Fix 2: adopt
pg_repack— rebuilds tables/indexes withoutACCESS EXCLUSIVE, online operation - Fix 3: partition time-series data; drop old partitions instead of deleting rows
- Process: require CTO approval for VACUUM FULL, prohibit during peak, add bloat monitoring
- Design lesson: 10–20% bloat is acceptable; disk is cheaper than downtime
- Trigger: 84% dead tuples in
-
Our Database Had 500 Million Rows, Deleting 100 Million Took 6 Days -- Lessons on bulk delete performance in large production databases
Key insights:
- MVCC overhead: PostgreSQL marks rows deleted (dead tuples) rather than removing them immediately; dead tuples consume disk and degrade scans
- Single DELETE: massive lock contention, WAL flood, all indexes updated per row — killed after 6h with zero rows removed
- Batched DELETE degradation: batch 1 took 2s, batch 100 took 23s, batch 300 took 60+s — subquery re-scans increasingly bloated table
- VACUUM after batches: found 5M dead tuples after partial deletion; regular VACUUM doesn't reclaim OS disk; VACUUM FULL causes outages
- Index maintenance multiplies I/O:
created_atand other indexes require update per row deleted - Winning approach: create new table with
PARTITION BY RANGE(created_at), insert only retained rows, atomic swap during maintenance window, drop old table — avoids fighting MVCC entirely - Design lesson: partition by time at schema design time;
pg_partmanfor automation; then DROP PARTITION takes milliseconds vs days of DELETE
-
When an Aurora PostgreSQL Major Upgrade Fails -- Debugging a hidden view blocking Aurora PostgreSQL upgrade
Key insights:
- Aurora PG 15→17 in-place upgrade halted during
pg_restorewith:ERROR: column reference 'query_id' is ambiguous - Root cause: custom monitoring view
pg_stat_activityenricbuilt onpg_stat_get_activity()using explicit PG15 column list; PG16+ expanded the function's output columns, causingquery_idname collision - The view existed across multiple databases — removing it from one DB wasn't enough;
pg_upgradehit identical incompatible definitions in others - Diagnosis:
SELECT * FROM pg_catalog.pg_views WHERE viewname = 'pg_stat_activityenric'in every database - Fix: drop the view from every database before upgrade; recreate using PG16+-compatible column references post-upgrade
- Lesson: custom views on internal PostgreSQL system functions (
pg_stat_*) require compatibility audits before major version upgrades; avoid explicit column lists tied to system function output
- Aurora PG 15→17 in-place upgrade halted during
-
Unlocking 3x Write Performance: Cloud SQL MySQL Optimizations -- Google Cloud tripling MySQL write throughput
(article unavailable — fetch failed)
-
How We Solved a Critical Race Condition in Banking Systems -- Debugging concurrency bugs in production banking
Platform & Infrastructure
-
Debugging a FUSE deadlock in the Linux kernel (Netflix) -- Kernel-level FUSE deadlock root cause analysis
Key insights:
- Netflix uses FUSE filesystems for container image layers; deadlock caused containers to hang indefinitely on file operations
- FUSE architecture: kernel VFS → FUSE kernel module → userspace daemon; requests queued in kernel, daemon reads /dev/fuse, processes, writes response back
- Deadlock scenario: FUSE daemon itself triggers a VFS operation on the same FUSE filesystem while handling a request — kernel holds inode lock waiting for daemon response, daemon blocks waiting for inode lock
- Debugging methodology: crash dumps, /proc/PID/stack for blocked threads, ftrace to trace kernel lock acquisition chains
- Root cause in specific kernel code path where page cache invalidation during FUSE writeback took inode mutex, then re-entered FUSE for metadata — circular dependency
- Fix required kernel patch to avoid holding inode mutex across FUSE round-trips; contributed upstream to Linux kernel
- Key lesson: userspace filesystem daemons must never re-enter the same filesystem they serve, or kernel must not hold locks across FUSE calls
-
Migrating Millions of Concurrent Websockets to Envoy (Slack) -- Slack's WebSocket infrastructure migration to Envoy proxy
Key insights:
- Old setup: HAProxy across multiple AWS regions; required "hot restarts" on every backend endpoint change, complex lifecycle management
- Why Envoy: dynamically configured clusters/endpoints (no reloads), zone-aware routing, passive health checking, panic routing
- Migration strategy: parallel Envoy stack alongside HAProxy, gradual weighted DNS shift (10% -> 25% -> 50% -> 75% -> 100%) over 6 months
- Config managed via Chef libraries generating Envoy YAML programmatically; intentionally supported only used features initially
- Extracting "important" HAProxy config from accumulated tech debt was hardest part; undocumented behavioral dependencies needed replication
- Subtle issues: broke daily active user metrics temporarily; "load balancer behavior is complex" with no shortcut around debugging
- Lacked pre-migration automated tests; discovered expected behaviors through service owner consultation
- Result: complete HAProxy replacement with zero customer impact; subsequently exceeded previous peak load with no issues
-
How Dropbox Designed ATF: an Async Task Framework -- Dropbox's distributed async task scheduling system
Key insights:
- Six components: Frontend (RPC) → Task Store (Edgestore) → Store Consumer → SQS → Controller → Executor + heartbeat status controller
- At-least-once execution: tasks retry until Success or FatalFailure; pull-based polling (controllers/executors long-poll) reduces coupling vs push
- Scale: 9000 async tasks/sec, 100+ use cases across 28 teams; 95% start within 5 s of schedule time
- Tasks claim exclusive "Claimed" state to prevent overlap; HSC kills executors after 3 failed heartbeats — zombie protection
- Per (lambda, priority) pair gets dedicated SQS queue (95 total); lambda owners control their own worker clusters and capacity
- Idempotence mandatory in user lambdas — framework explicitly does not solve dedup; pushes correctness burden to callback authors
- Exponential backoff for retriable failures; timeouts at enqueue/claim/heartbeat each trigger automatic retry independently
- Isolation via dedicated clusters, queues, and quotas per lambda — prevents resource contention between independent task types
- Edgestore (Dropbox's metadata DB) backs task state; SQS handles work distribution — clean split of state-of-truth vs work queue
Key insights:
- Six components: Frontend (RPC), Task Store (Edgestore metadata), Store Consumer (polling), Queue (AWS SQS), Controller (per-worker polling), Executor, Heartbeat/Status Controller
- Pull-based model: controllers and executors long-poll for work rather than being pushed, reducing coupling
- Scale: 9,000 async tasks/sec, 100+ use cases across 28 engineering teams; 95% of tasks begin within 5 seconds of scheduled time
- At-least-once execution: tasks retry until Success/FatalFailure; requires idempotent lambdas since tasks may execute multiple times
- No concurrent execution: tasks claim exclusive state; HSC kills executors after 3 failed heartbeats to prevent overlap
- Each lambda-priority pair gets dedicated SQS queue (95 total); lambda owners control their worker clusters, deployments, capacity
- Exponential backoff for retriable failures; timeouts at enqueue, claim, and heartbeat stages trigger automatic retries
- Isolation: dedicated clusters, queues, and scheduling quotas per lambda prevent resource contention
-
How Spotify Built Its Data Platform To Understand 1.4 Trillion Data Points -- Spotify's data platform for processing trillions of events
-
How Tailscale works -- Architecture of Tailscale's WireGuard-based mesh VPN
Key insights:
- Separation of concerns: centralized coordination server (control plane: auth, key distribution, ACL, network maps) + full mesh of WireGuard tunnels (data plane: peer-to-peer encrypted UDP)
- Key exchange via Noise IK over X25519; coordination server is shared drop box for WireGuard public keys — never sees plaintext traffic
- DERP (Detoured Encrypted Routing Protocol): custom relay over HTTP replacing TURN; relays encrypted WireGuard packets; every connection starts via DERP, upgrades to direct UDP after NAT traversal succeeds
- Custom DISCO protocol for NAT traversal: NaCl box authenticated UDP path probing; achieves >90% direct P2P connection rate, DERP relay rarely needed for sustained data
- End-to-end encryption regardless of path: DERP relays forward opaque ciphertext, never possess decryption keys (Curve25519, ChaCha20-Poly1305)
- ACLs defined centrally (JSON/HuJSON policy language), pushed to each node in network map; nodes enforce locally in WireGuard filter rules — cryptographically enforced (no key = no connection)
- MagicDNS: automatic human-readable hostnames + Let's Encrypt TLS certificates for every device in tailnet without manual cert management
- Hybrid topology: hub-and-spoke control (persistent connections to coordination server) + full mesh data (direct WireGuard tunnels, no central bottleneck)
-
How WebSockets Cost Recall.ai $1M on AWS -- Postmortem on expensive WebSocket architecture on AWS
Key insights:
- Meeting bots used WebSockets over localhost to transport raw video from headless Chromium to encoder — seemed reasonable for IPC but catastrophically inefficient at scale
- WebSocket fragmentation: Chromium fragments messages >131KB into frames; single 1080p raw frame (3.1MB) = 24 fragments with reassembly overhead
- WebSocket masking: spec mandates XOR masking on all client-to-server data — extra pass over every byte at 150MB/s throughput (p99 bot bandwidth)
- CPU profiling revealed dominance of
__memmove_avx_unaligned_ermsand__memcpy_avx_unaligned_erms— excessive memory copying throughout transport - Evaluated alternatives: TCP/IP rejected (1500-byte MTU fragmentation + kernel-space copying); Unix domain sockets rejected (user-to-kernel transitions)
- Solution: custom lock-free multi-producer single-consumer ring buffer in shared memory; three pointers (write, peek, read) enabling zero-copy reads
- Implementation details: atomic operations for thread-safety, named semaphores for signaling, variable-sized frame support, Chromium sandbox-compatible
- Impact: bot CPU 4 cores → 2 cores (50% reduction) = over $1M annual AWS savings; scale context: 1TB video/second across infrastructure
-
How Okta Scaled From 12 to 1,000 Kubernetes Clusters With Argo CD -- Okta's Kubernetes fleet scaling with GitOps
-
Pinterest's Moka: Kubernetes Rewriting Rules of Big Data Processing -- Pinterest migrating big data workloads to Kubernetes
Key insights:
- Moka = Pinterest's EKS-based unified big-data platform replacing Hadoop YARN clusters; runs Spark, Flink, Ray on Kubernetes with single control plane
- YuniKorn scheduler used instead of stock kube-scheduler: hierarchical queues, gang scheduling, fair sharing — restores YARN-like multi-tenancy semantics
- Fluent Bit + OpenTelemetry pipeline replaces YARN log aggregation; per-pod structured logging shipped to central store
- ARM Graviton support adds ~20% cost reduction for batch workloads vs equivalent x86 instances
- Karpenter for autoscaling: bin-packs jobs onto right-sized spot nodes; faster than Cluster Autoscaler's ASG-based provisioning
- Migration approach: dual-write to YARN and Moka, validate parity, cut over per-workload; avoided big-bang switch
- Container image caching critical at scale: pre-warmed Spark images on nodes eliminates pull latency during gang scheduling
- Lesson: Kubernetes as big-data substrate is viable but requires non-default scheduler + dedicated logging/observability stack
-
Reducing Onboarding from 48 Hours to 4: Amazon Key's Event-Driven Platform -- Amazon Key's event-driven architecture redesign
Key insights:
- Migrated from synchronous REST orchestration to event-driven via single EventBridge bus shared across accounts; cross-account event routing replaces direct service calls
- Onboarding time: 48 hours → 4 hours (12× reduction); driven by self-service event subscriptions instead of bespoke integration code per partner
- CDK-based infrastructure automation: each consumer defines event filters declaratively; rules + targets + IAM provisioned in single deployment
- Throughput: ~2000 events/sec sustained, P90 latency ~80ms end-to-end across multi-account hops, 99.99% delivery success
- Schema registry enforces contract evolution; producers can't break consumers via uncoordinated payload changes
- DLQ + replay tooling per consumer enables independent failure recovery without affecting peer subscribers
- Tradeoff: debugging eventual-consistency flows harder than sync request/response; invested in distributed tracing (X-Ray) as compensation
- Pattern reusable: single shared bus + cross-account access + schema registry is the production blueprint for EventBridge at scale
-
How Slack Achieved Operational Excellence for Spark on Amazon EMR -- Slack's Spark operational improvements on EMR
-
We Moved from AWS to Hetzner, Cut Costs 89% -- Real-world cost comparison: AWS to bare metal
Key insights:
- AWS monthly: 6× t3.medium ($1200) + RDS db.t3.large ($850) + LB ($180) + data transfer ($650) + S3 ($120) + CloudWatch ($380) + NAT Gateway ($220) + misc ($600) = $4,200/month
- Hetzner monthly: 6× CAX11 equivalent ($280) + managed PG ($90) + LB ($15) + 1TB bandwidth included + 500GB storage ($25) = $410/month (+ Cloudflare $20) = ~$470/month
- Savings: $45,600/year (89% reduction); Hetzner CAX11 has dedicated CPU + NVMe vs t3.medium's shared CPU
- Zero-downtime migration: week 1 infra setup → week 2 DB migration (export/import + replication) → week 3 gradual DNS shift 10→50→100% → week 4 AWS shutdown
- Problems hit: new Hetzner IPs flagged as spam (SPF/DKIM warmup needed), 100K req/s DDoS attack (required Cloudflare), manual backup scripting, self-managed Grafana+Prometheus
- Lost: managed services (ElastiCache, SQS, Lambda, EventBridge), global regions (limited to DE/FI/US), auto-scaling, built-in DDoS protection, AWS support
- Gained: predictable billing, dedicated CPU, included bandwidth, full control
-
Migrating 40 Lambdas to Containers, AWS Bill Down 73% -- Cost and architecture tradeoffs: Lambda to containers
Networking & Load Balancing
-
Examining Load Balancing Algorithms with Envoy -- Comparison of load balancing strategies (round-robin, least-request, ring hash, Maglev)
(article unavailable — SSL certificate error)
-
High Availability Load Balancers with Maglev (Cloudflare) -- Google's Maglev consistent hashing for L4 load balancing
Key insights:
- Maglev scheduler: consistent hashing on 5-tuple (protocol, src IP, src port, dst IP, dst port) → same backend selected by any LB without shared state
- HA via statelessness: routers use BGP + ECMP hashing to distribute across multiple LB instances; all LBs apply identical Maglev hash → traffic always reaches correct backend even after LB failover
- Graceful maintenance: operator withdraws BGP session, traffic transparently shifts to remaining LBs with zero disruption
- Ungraceful failure: BGP keepalive timeout triggers router to terminate session; BFD could reduce delay but incompatible with L2 aggregation/VXLAN
- Direct Server Return (DSR) via Foo-Over-UDP encapsulation: return traffic bypasses LBs entirely — LBs only process inbound
- IPVS configured with Maglev scheduler at kernel level; stateless by design eliminates connection synchronization between LBs
-
Andromeda: Performance, Isolation, and Velocity at Scale (Google, NSDI'18) -- Google's production network virtualization stack
Serverless & Compute
-
Cloud Computing Without Containers (Cloudflare) -- V8 isolate-based serverless as a container alternative
Key insights:
- V8 Isolates replace containers/VMs as isolation boundary: each tenant runs in a lightweight V8 execution context (same sandbox as Chrome tabs), not a full process/container/VM
- Sub-millisecond cold starts (many under 1ms) vs hundreds of ms for containers or seconds for VMs; eliminates cold start as a meaningful concern
- Memory overhead ~1-5 MB per isolate vs ~35+ MB per container; enables thousands of tenants per process — critical for economic viability at 200+ edge PoPs
- Security model: V8's battle-tested sandbox (no cross-isolate memory access, no syscalls, CPU/memory caps) + process-level seccomp + separate isolate groups as defense in depth
- No filesystem, no network sockets, no native code: API surface restricted to Service Workers spec (fetch, crypto, streams, KV bindings) — eliminates path traversal, SSRF, native code exploit classes
- Anycast routing: code runs at nearest PoP (all 200+ locations simultaneously), no region selection; single-digit-ms latency to end users globally
- Per-request billing model enabled by near-zero isolate startup cost — fundamentally different economics vs per-container-hour
- Tradeoff: no long-lived connections or persistent in-memory state; must use external services (Durable Objects, Workers KV, R2) for stateful workloads
- WASM support extends model beyond JavaScript: Rust/C/C++/Go via WASM in same isolate sandbox with same cold-start properties
-
Eliminating Cold Starts 2: Shard and Conquer (Cloudflare) -- Sharding strategy to eliminate serverless cold starts
Key insights:
- Problem: complex Workers with 10MB scripts now have cold starts longer than TLS handshakes (up to 400ms CPU time); direct optimization insufficient
- Solution: consistent hash ring maps script IDs to "home" shard servers; requests routed to the server most likely to have a warm instance
- Optimistic routing: requests sent without pre-approval; if shard server refuses, returns client's own "lazy capability" (Cap'n Proto RPC loopback reference) — stops sending bytes immediately
- Cap'n Proto distributed object model: context stacks (ownership overrides, resource limits, feature flags) serialize for cross-machine transmission; trace data consolidates via capabilities
- Results: 10× reduction in eviction rate globally; Enterprise warm request rate improved from 99.9% to 99.99%; cold starts dropped from 0.1% to 0.01%
- Only 4% of enterprise traffic actually sharded — power-law distribution means targeting low-traffic Workers (most likely to be evicted) yields disproportionate benefit
- Latency overhead sub-1ms for cross-server proxying vs typical cold start duration — net positive tradeoff
- Key insight: accepting minimal per-request IPC overhead eliminates cold starts entirely for tail-latency-sensitive workloads
-
R2 SQL: A Deep Dive into Our New Distributed Query Engine (Cloudflare) -- Distributed SQL engine on top of R2 object storage
Key insights:
- Three-layer Iceberg metadata pruning: partition (manifest list) → file (manifest column stats) → row-group (Parquet footer stats) — eliminates data before any read
- Streaming pipeline: planner emits work units as soon as available; executor consumes concurrently — no "plan complete then execute" barrier
- ORDER BY-aware manifest ordering: planner walks files in user's sort order, enabling early termination when top-K heap's threshold exceeds remaining metadata high-water mark
- Row group as primary work unit: 1 multi-GB Parquet file = N parallel partitions, each with own CPU cache locality
- Built on DataFusion (Rust): vectorized execution, filter pushdown, row-group-level parallelization out of the box
- Columnar projection: only referenced columns transferred from R2 → massive reduction in network egress and decompression cost
- Arrow IPC over gRPC for worker→coordinator results; zero-copy on both ends inside the worker
- Serverless: runs on Workers + R2, no provisioned cluster; coordinator selected per query via internal API; Argo Smart Routing handles connectivity
- "Bite-sized pieces" model = power-of-two parallelism that adapts to query selectivity without explicit reshaping
Key insights:
- Two-phase architecture: Query Planner (metadata-driven pruning) + distributed Query Execution across Cloudflare's global network
- Serverless: runs on Workers + R2, no provisioned clusters; coordinator-worker model
- Multi-layer filtering: partition-level (manifest list), file-level (column stats), row-group-level (Parquet footers)
- Streaming pipeline: manifests processed in ORDER BY sequence, enabling early termination when results are guaranteed complete
- Built on Apache DataFusion (Rust): vectorized execution, filter pushdown, row-group parallelization
- Each Parquet row group treated as independent partition for parallel processing with CPU cache efficiency
- Arrow IPC format for inter-process communication between workers and coordinator via gRPC
- Columnar Parquet reading: only needed columns read, massively reducing data transfer from R2
-
R2 SQL Aggregations (Cloudflare) -- Adding GROUP BY/SUM to R2's distributed SQL engine
Key insights:
- Workers emit partial-aggregate states, not raw rows; "multiple pre-aggregates can be merged" enables horizontal scaling
- Scatter-gather works for simple aggregations (no HAVING/ORDER BY): coordinator receives small partial states, bounded memory regardless of input size
- High-cardinality GROUP BY (IPs, user IDs) breaks scatter-gather → triggers hash-based shuffle on GROUP BY columns; deterministic partitioning needs no central coordinator
- Synchronization barrier: workers buffer outbound shuffle data + await coordinator ACK before next stage — guarantees complete dataset per worker after shuffle
- Post-shuffle workers hold full per-group data → apply HAVING + local ORDER BY independently; coordinator only does final k-way merge
- LIMIT pushdown: coordinator merges streams until top-K found, then halts upstream; back-pressures workers to stop early
- Memory boundedness: pushing HAVING and sort down to workers prevents coordinator from becoming bottleneck even at PB scale
- Cardinality is the design dimension: low-card → scatter-gather (cheap), high-card → shuffle (correct); engine picks at plan time from stats
- Pattern reusable in any object-store SQL engine: Iceberg metadata + DataFusion + Arrow IPC shuffle = scalable analytics without long-lived cluster
-
The Principles of Extreme Fault Tolerance (PlanetScale) -- Design principles for highly fault-tolerant database infrastructure
Key insights:
- Three core principles: Isolation (physically/logically independent parts), Redundancy (replicated + isolated copies), Static Stability (last-known-good state on failure)
- Data plane (queries, storage) operates independently from control plane (management); control plane failures don't disrupt queries
- Each cluster: primary + minimum 2 replicas across 3 availability zones; synchronous replication (commit persists on replica before primary ACK)
- Weekly failover testing on every customer database as changes ship; ensures failover mechanisms remain practiced and reliable
- Progressive rollouts: changes ship gradually via feature flags and release channels; limits blast radius of operator errors
- Critical query path has minimal dependencies; external failures (Docker registry, control plane outages) don't impact active queries
- Automated failover handling: instance, zonal, and regional failures trigger failover with query buffering to minimize disruption
-
PlanetScale Postgres Operations Philosophy -- Operational design principles for managed Postgres
Key insights:
- Three-node mandatory minimum (primary + 2 replicas) across AZs; no single-node deployments offered even at lowest tier — fault tolerance baseline non-negotiable
- Synchronous replication via Postgres
synchronous_commit = remote_applyto at least one replica; commit fence waits for replica apply (not just receive) before client ACK - 10-second target failover: orchestrator detects primary failure → promotes most-caught-up replica → updates routing → in-flight queries buffered
- Dual connection paths: PgBouncer transaction pooler for high concurrency + direct unpooled for prepared statements / advisory locks /
SET LOCAL - No CPU autoscaling: scaling triggers replica swap with larger instance — predictable cost, no thrash, but requires headroom planning
- Vacuum and autovacuum tuning intentionally conservative: prevents wraparound emergencies on long-running multi-TB tenants
- Backups: continuous WAL archiving to S3 + nightly base backups; PITR to any second within retention window
- Philosophical bias: prefer "boring, predictable" operations over "elastic, dynamic" — fewer moving parts = fewer failure modes
-
Aurora DSQL: Serverless, Scalable, Global OLTP (Marc Brooker, CMU) -- Aurora DSQL architecture deep-dive
Postmortems
-
Supabase Incident on February 12, 2026 -- Supabase production incident postmortem
Key insights:
- Root cause: deployment inadvertently enabled AWS VPC Block Public Access in "block-bidirectional" mode regionally — disabled all internet gateways across 20+ subnets in us-east-2
- Total regional outage: all services (DB, Auth, APIs, Edge Functions, Storage, Realtime) down for 3h42m; VPC-peered customers unaffected
- 14-minute detection lag: outage started 21:12 UTC, first alert at 21:26 — blind spot let cascading failures propagate
- Investigation misdirected by elevated Management API errors → team chased AWS provider issue, not network; single CloudTrail
ModifyVpcBlockPublicAccessOptionsline "did not jump out" - Pre-prod environment lacked us-east-2 → week of test deploys revealed nothing; environment parity gap is the structural fault
- Correlation breakthrough at 3h required matching deployment timestamp (21:12) with outage onset + cross-team infrastructure engagement
- Access control gap: monitoring service deployment had no guardrails preventing account/region-scoped AWS resource modifications
- Comms failures stacked: status page lagged, dashboard banners didn't appear, social channels silent for hours
- Forward fix: non-customer services moved to separate AWS accounts, blocklist for problematic resource types, external connectivity probes, full pre-prod parity across all regions
-
Post-mortem of Shai-Hulud Attack (PostHog) -- PostHog production attack postmortem
-
Railway: Diagnosing System Failure with Logs, Metrics, Traces, and Alerts -- Postmortem-driven approach to observability
Language Adoption
-
WhatsApp Deploys Rust-Based Media Parser to Block Malware on 3B Devices -- WhatsApp replacing C/C++ parsers with Rust at massive scale
Key insights:
- ~160K LOC of C++ media-parsing code replaced by ~90K LOC of Rust (~44% reduction); deployed to all 3B devices via WhatsApp client
- "Kaleidoscope" = Rust-based malware/threat-detection engine running alongside parser; flags malicious media before decode reaches OS codecs
- Memory-safety class of bugs (use-after-free, OOB read, double-free in image/video parsing) — historically the dominant exploit surface in messengers — eliminated by Rust ownership model at compile time
- Binary-size overhead measured at ~200 KB on Android — explicitly judged acceptable for the safety guarantee; APK budget engineering required to stay within tolerance
- Cross-platform: same Rust crate compiled for Android (NDK), iOS, Windows, macOS — reduces parser-divergence bugs across client platforms
- Differential fuzzing harness ran Rust + C++ parsers on same inputs to validate bitwise-identical output before cutover
- Pattern: pick the high-blast-radius security-critical layer (media parsing) as first Rust beachhead in a giant C++ codebase, not greenfield modules
- Confirms Microsoft/Google trend: 70% of CVEs are memory-safety; Rust-at-parser-boundary is the highest-leverage mitigation
-
Ladybird Adopts Rust -- Ladybird browser project's strategy for incremental Rust adoption
Key insights:
- Phased coexistence, not rewrite: Rust modules live behind well-defined C++ interop boundaries; C++ stays primary language
- LibJS chosen as first target: lexer + parser + AST + bytecode generator — self-contained, huge test coverage (test262), low coupling
- Byte-for-byte compatibility required: 52,898 test262 + 12,461 Ladybird regression tests must produce identical output, zero perf regression
- Translated Rust deliberately non-idiomatic: preserves C++ register-allocation patterns so both compilers emit identical bytecode opcodes
- AI-assisted (Claude Code, Codex) but human-steered: "hundreds of small prompts" + adversarial review, not autonomous generation
- 25,000 lines ported in ~2 weeks vs estimated months — productivity gain comes from AI as smart translator + human as architect/reviewer
- Core team gatekeeps porting: contributors must coordinate before starting to prevent duplicate work and divergent design choices
- Avoids the "rewrite trap": each ported module proves itself via test parity before next is started; never a half-Rust/half-C++ broken state
- Pattern matches WhatsApp's Rust strategy: target security/perf-critical, self-contained modules first; don't try to convert the world
-
Banned C++ in Chromium -- Why Chromium bans large portions of the C++ standard library
-
We Trusted Rust With the 3 Components That Could Not Fail -- Production Rust for mission-critical components
Key insights:
- Three components chosen for Rust: parsing, routing, boundary — selected not for language preference but because these were the parts "we could not afford to be wrong about"
- Under +38% request surge: other components saw CPU plateau and P99 jump from 210ms → 4.8s; Rust components maintained identical latency, unchanged memory, 0.00% error rate
- Key failure modes avoided: queue growth, allocator fragmentation, synchronized retry storms — all emerged in non-Rust components under pressure
- Core insight: "Correct" architecturally ≠ "safe" under stress; Rust's compile-time guarantees caught failure modes that testing couldn't
- Written alongside C++ differential fuzzing for parity validation before transition
-
Apache Iggy's Migration to Thread-per-Core Architecture Powered by io_uring -- Thread-per-core + io_uring migration for high-throughput messaging
Key insights:
- Tokio's work-stealing executor hit a ceiling: task migrations caused cache invalidations, regular file I/O blocked threads despite epoll readiness
- io_uring is completion-based (submit op, kernel drives to completion) vs epoll's readiness-based model; heavily batches syscalls reducing context switches
- Chose compio runtime over monoio/glommio for active maintenance and decoupled driver/executor architecture
- "Work stealing to work steering": one thread per CPU core, no shared state, reduced lock contention
- Pitfall: RefCell borrows across .await points cause runtime panics; solved with ECS-style component splitting (State, Storage)
- Hybrid consistency: shared strongly-consistent resources + sharded eventually-consistent ones via left-right concurrent data structure
- Results: P99 latency -60% (4.52ms to 1.82ms, 32 partitions), P9999 -57%; fsync mode: +18% throughput, -16% P95 latency
- Gap identified: POSIX APIs don't expose io_uring capabilities (request chaining, registered buffers); ecosystem lacks DST-friendly pluggable components
Techniques
Algorithms, performance, OS internals, networking, compilers.
CPU & Performance Optimization
-
Understanding CPU Microarchitecture to Increase Performance -- CPU pipelines, branch prediction, cache hierarchies, perf-aware code
-
Software Optimization Resources (Agner Fog) -- Definitive manuals on C++ and assembly optimization, microarchitecture
-
Optimizing C++ (Agner Fog) -- Comprehensive C++ performance optimization guide
-
Abseil Performance Hints -- Google's Abseil library tips for high-performance C++
-
Optimizations Past Their Prime (Abseil) -- Which classic optimizations no longer help on modern hardware
Key insights:
- Runtime CPU feature dispatch is wasteful once an ISA extension is universal: checking for
popcnton every modern x86_64 burns cycles for an always-yes answer - Inline asm blocks compiler optimization: hand-written
popcntasm prevented LLVM from fixing a known false-dependency bug — the "fast" path stayed slow __builtin_popcountovertook hand-tuned asm once compilers emitpopcntdirectly + constant-fold + inline aggressively- Redundant null checks (
CHECK_EQre-checkingstr_ != nullptr) can't be eliminated by optimizer once the abstraction stack hides the invariant - Wrapping
std::string*inCheckOpStringhid pointer relationships → optimizer lost the ability to reason about control flow - Debug builds sometimes outperformed release: layers of dead optimization had become counterproductive overhead
- Idiomatic code ages better than clever code: clear portable C++ stays optimizer-friendly as hardware evolves; intrinsics rot
- General rule: an optimization "valuable in 2010" deserves re-benchmarking; the cost-benefit can flip silently as compilers + CPUs improve
- Counter-intuitive corollary: removing old optimizations is itself an optimization worth doing
- Runtime CPU feature dispatch is wasteful once an ISA extension is universal: checking for
-
How Michael Abrash Doubled Quake Framerate -- Classic assembly-level optimization from Quake development
-
I/O Is No Longer the Bottleneck -- How NVMe SSDs shifted the bottleneck from I/O to CPU
Key insights:
- Sequential read: 1.6 GB/s cold cache, 12.8 GB/s warm cache on modern NVMe
- Hand-optimized AVX2 word-counting: only 1.45 GB/s (warm) = 11% of sequential disk speed
- Standard C
wc -w: 245 MB/s (6.5x slower than disk); vectorized C: 330 MB/s (4.8x slower) - Branch prediction in inner loops prevents compiler auto-vectorization; manual SIMD required
- Hash map cache misses create additional CPU bottlenecks beyond raw throughput
- Key takeaway: single-threaded CPU processing is now the real constraint, not storage I/O
- Implication: system design should optimize for computation efficiency, not just I/O patterns
-
Best Practice Guide: Modern Processors and Accelerators (PRACE) -- NUMA, cache hierarchies, vectorization, and HPC optimization
-
Sub-NUMA Clustering vs Hemisphere/Quadrant Modes -- Intel SNC and NUMA topology modes for memory-performance tuning
-
Performance and Benchmarking (Chapter 1) -- Foundations of performance measurement: metrics, methodology, pitfalls
-
Tech Column: Cache, NoC, Performance Optimization -- Cache design, network-on-chip, hardware-software co-optimization
-
Perf Ninja: Low-Level Performance Analysis Course -- Hands-on CPU microarchitecture performance tuning course
-
Inside High-Frequency Trading Systems: The Race to Zero Latency -- Architecture and latency optimization patterns in HFT
-
I Made Zig Compute 33 Million Satellite Positions in 3 Seconds -- SIMD and cache-friendly optimization in Zig
Key insights:
- Zig's
@Vector(4, f64)SIMD primitive is portable: LLVM backend picks AVX/NEON/etc. — no per-arch intrinsics in user code - Branchless hot path uses
@selectmasked-selection: compute both branches, pick per-lane — avoids branch-mispredict cost in tight propagation loop comptimeprecomputation bakes gravity/polynomial constants into the binary; no runtime init — gave scalar baseline 5.2M propagations/sec start- Cache-tiling at 64 time-points per satellite batch keeps time data hot in L1/L2 across 13,000 sats; opposite of naïve sat-major iteration
- SoA layout:
ElementsV4holds each orbital element as its own@Vector(4, f64)— "pre-splatting" eliminates broadcast ops inside hot loop - Custom polynomial atan2 (LLVM has no vectorized atan2): ~1e-7 rad accuracy = ~10mm at LEO, well below SGP4's km-scale error budget
- Final perf: 11-13M propagations/sec native SIMD (2× scalar), 7M/sec via Python bindings, full 13,000-sat catalog in 3.3 s
- Lesson: algorithmic parallelism (lane organization, cache tiling, SoA) dominates raw hardware — same chip, 2× from layout alone
- Zig as systems language: comptime + native SIMD + no FFI overhead makes it competitive with hand-written C/Rust for numerics
- Zig's
Concurrency & Parallelism
- Is Parallel Programming Hard? (Paul McKenney's perfbook) -- Comprehensive reference: parallel programming, memory ordering, RCU, lock-free algorithms
- The ABA Problem in Concurrency -- ABA problem in lock-free data structures and solutions
- Multi-Core By Default (Ryan Fleury) -- Designing software for multi-core from the start
- Memory Management Reference -- Allocators, GC algorithms, and memory management techniques
Hashing & Data Structures
- Looking at Randomness and Performance for Hash Codes -- Empirical hash function quality and performance trade-offs
- wyhash: The Fastest Quality Hash Function -- Extremely fast, high-quality hash function for production
- Sort Research in Rust -- Benchmarking sort algorithms (pdqsort, timsort, etc.) in Rust
- Workshop on Filter Data Structures (SPAA 2023) -- Bloom, cuckoo, quotient filters and modern filter structures
- Undergraduate Upends a 40-Year-Old Data Science Conjecture -- Breakthrough disproof of Kannan-Lovasz-Simonovits conjecture
Linux Kernel & eBPF
-
Interactive Map of Linux Kernel -- Visual map of Linux kernel subsystems
-
Linux Kernel Schedulers -- CFS, SCHED_FIFO, SCHED_DEADLINE overview
-
Sched: Rewrite MM CID Management (Thomas Gleixner) -- Kernel scheduler patch: 15% PostgreSQL improvement
-
Cache and TLB Flushing Under Linux -- Cache/TLB coherence APIs
-
Memory Allocation Guide (Linux Kernel) -- Slab allocator, kmalloc, vmalloc, GFP flags
-
Announcing systing 1.0 -- New Linux kernel tracing/debugging tool
Key insights:
- eBPF-based system tracer by Josef Bacik (btrfs maintainer); output writes directly to DuckDB Parquet for SQL post-analysis instead of bespoke trace formats
- Timeline view: per-task scheduling state (running/runnable/blocked) overlaid with stack traces at sched_switch + sched_wakeup events
- Stuck on networking issue → systing identified syscall-level blocker via kretprobe timing → cut 12 s tail latency to 2 s after fix
- MCP integration: Claude (or any LLM client) can query the DuckDB trace via SQL, ask "which threads were blocked longest and on what" — natural-language perf forensics
- Kretprobe-based regression detection: compares per-function latency distributions across runs; flags 99th-percentile shifts that average masks
- Designed to replace ad-hoc combinations of perf + bpftrace + flamegraph + custom scripts for everyday kernel-side debugging
- DuckDB choice deliberate: columnar Parquet trace files are durable, shareable, and analyzable offline without re-running workload
- Positions DuckDB-backed traces as a general pattern for systems observability — same idea seen in eBPF profiler ecosystems
-
AI Helped Uncover a 50-80x Improvement for Linux io_uring -- Major io_uring performance improvement
-
All My Favorite Tracing Tools: eBPF, QEMU, Perfetto -- Survey of tracing/profiling tools for systems performance
-
eBPF on Hard Mode -- Advanced eBPF usage patterns and pitfalls
Key insights:
- Unprivileged eBPF: limited to 4096 instructions, no subprograms/loops/back edges; only socket filters and cgroup socket buffers
- Full capability requires CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON
- BTF (BPF Type Format) required for advanced features: subprograms and callbacks need explicit type signatures
- Writing without libbpf/LLVM means manually constructing instruction arrays — "bytecode rawdogging"
- String matching via
strncmphelper needs read-only maps with BPF_F_RDONLY_PROG flags and freezing - KFunc calls use BTF ID-based invocation, requiring runtime extraction from /sys/kernel/btf/vmlinux
- Verifier transforms dead code into infinite loops (ja -1); ALU constants rewritten as Spectre mitigation
- Verifier output is essential debugging tool: logs reveal register states and instruction processing metrics
- Kernel version sensitivity: verifier gets smarter each release, creating compatibility risks for bytecode-level programs
-
eBPF Ring Buffer vs Perf Buffer -- Comparing eBPF event output mechanisms
-
ePass: Verifier-Cooperative Runtime Enforcement for eBPF -- Novel eBPF safety combining verifier and runtime enforcement
-
Profiling in Production: eBPF Continuous Profiling -- Always-on production profiling with minimal overhead
-
profile-bee: Rust-based eBPF CPU Profiler -- Lightweight eBPF profiler with stack unwinding
-
BPF Instruction Set Specification -- Formal eBPF ISA specification
-
Building eBPF/XDP L2 DSR Load Balancer from Scratch -- Hands-on XDP/eBPF load balancer
-
Building eBPF/XDP IP-in-IP DSR Load Balancer -- IP-in-IP encapsulation variant
Networking
-
How NAT Traversal Works -- STUN, TURN, ICE, and NAT hole-punching techniques
Key insights:
- Stateful firewalls permit inbound UDP only after matching outbound traffic; two peers must send packets simultaneously for hole-punching
- STUN: "what's my endpoint from your point of view?" reveals public IP:port mapping created by NATs
- NAT taxonomy: Endpoint-Independent Mapping (EIM, "easy", consistent ports) vs Endpoint-Dependent Mapping (EDM, "hard", varies by destination)
- Birthday paradox optimization for symmetric NATs: open multiple ports on one side, probe random ports on other — statistically faster than exhaustive scan
- Port mapping protocols (UPnP IGD, NAT-PMP, PCP) allow explicit port forwarding requests, "making one NAT vanish from the data path"
- Tailscale's DERP: simultaneous fallback relay and upgrade helper to peer-to-peer connections
- ICE core algorithm: "try everything at once, and pick the best thing that works"
- Hairpinning: NATs often fail to route between internal devices using external addresses; problematic with CGNAT
- IPv6 eliminates many issues but mixed deployments require NAT64, DNS64, CLAT compatibility layers
-
QUIC: A UDP-Based Multiplexed and Secure Transport (RFC 9000) -- QUIC transport protocol specification (HTTP/3 foundation)
-
HyStart++: Modified Slow Start for TCP (RFC 9406) -- Improved TCP slow-start algorithm
-
Stream Control Transmission Protocol (RFC 9260) -- SCTP: multi-streaming, multi-homing transport
-
WebRTC for the Curious: Real-time Networking -- Jitter buffers, congestion control, real-time transport
-
Network Protocols, Sans I/O -- Protocol state machines decoupled from I/O
-
Networking Protocol Sequence Diagrams -- Visual sequence diagrams for TCP, IP, ARP, DHCP
-
TUN/TAP Interface Tutorial -- Virtual network interfaces for tunneling
-
How Container Networking Works: Bridge Network from Scratch -- Linux namespaces, veth pairs, and bridges
Containers & Virtualization
- How Container Filesystem Works: Building a Docker-like Container -- Overlay filesystems and container image internals
- FUSE - Filesystem in Userspace (Linux Kernel docs) -- Kernel-side FUSE architecture and request handling
- virtio specification v1.2 -- OASIS standard for para-virtualized I/O devices
- gVisor: Sandboxed Container Runtime -- Google's user-space kernel for container isolation
- crosvm: Chrome OS Virtual Machine Monitor -- Google's Rust-based VMM for Chrome OS / Android
- Building the Virtualization Stack with rust-vmm -- Reusable Rust crates for custom VMMs (Firecracker, Cloud Hypervisor)
- How Terminals Work -- Terminal emulators, TTY subsystem, and PTY internals
Compilers & Toolchain
-
LLVM Architecture (AOSA Book) -- Chris Lattner on LLVM's modular compiler architecture
Key insights:
- Three-phase design: frontend (parsing/AST) -> optimizer (mid-level transforms) -> backend (codegen); enables N languages x M targets without N*M implementations
- LLVM IR is a "first-class language with well-defined semantics" in 3 forms: textual .ll, in-memory data structures, binary bitcode
- IR is fully self-contained (unlike GCC's GIMPLE): no reference to frontend/backend data structures; enables text-based pipelines and external tools
- Modular pass architecture: independent optimization passes (inlining, constant prop, etc.) can be mixed/reordered; PassManager resolves dependencies
- Library-based design: clients link only needed functionality; "collection of useful compiler technology" not a monolithic compiler
- Target Description Language (.td): declare registers/instructions/constraints once; tblgen auto-generates assemblers, disassemblers, instruction selectors
- Bitcode serialization enables link-time optimization (LTO) and install-time optimization across translation units
- Individual passes testable in isolation via IR load -> run pass -> verify output; BugPoint automates test case reduction
- Separation of concerns: frontend devs need only IR semantics; backend authors work independently; lowers contribution barriers
-
LLVM Documentation -- Official LLVM docs: IR, passes, backends, tooling
-
LLVM Inliner Pass Deep Dive -- LLVM function inlining pass analysis
-
LLVM Machine Code Analyzer on Godbolt (Arm) -- Instruction scheduling and pipeline throughput analysis
-
How Compiler Explorer Works in 2025 (Matt Godbolt) -- Architecture behind godbolt.org
-
Compiler Engineering in Practice -- Part 1 -- Practical compiler engineering series
-
CS 6120: Advanced Compilers (Cornell, Self-Guided) -- SSA, optimization passes, dataflow analysis
-
ACM India Winter School on Compiler Design -- IIT Madras compiler design materials
-
Clang Hardening Cheat Sheet - Ten Years Later -- Clang/LLVM compiler flags for binary hardening
-
Finding and Understanding Bugs in C Compilers (Csmith, PLDI'11) -- Random C program generation for compiler testing
-
Test-Case Reduction for C Compiler Bugs (C-Reduce, PLDI'12) -- Automated test case minimization
-
Reflections on Trusting Trust (Ken Thompson) -- Classic on compiler trust chains
Debuggers & Profiling
- The GDB JIT Interface -- Registering JIT-compiled code with GDB for debugging
- RAD Debugger (Epic Games) -- Native graphical debugger, open source
- Demystifying Debuggers (Ryan Fleury) -- How debuggers work at the OS/CPU level
Distributed Systems Theory
-
Hedging: A Simple Tactic to Tame Tail Latency -- Request hedging patterns for P99 latency reduction
Key insights:
- Hedging sends duplicate requests to alternate backends after a timeout threshold (e.g., 20ms); use whichever responds first
- Requires idempotent operations to prevent side effects from duplicate execution
- Google BigTable: 96% reduction in tail latency with only 2% increase in total requests
- Google MapReduce: backup tasks reduced overall runtime by 44%
- Grafana Tempo: 45% reduction in tail latency
- Simulation (20K requests): P99 87.88ms to 19.13ms (-78%), P100 278.62ms to 19.94ms (-93%), mean 12.13ms to 9.71ms (-20%), load overhead only 6.8%
- Most effective when multiple backend instances exist and rare server slowdowns cause tail latency
- Threshold selection is critical: too aggressive wastes resources, too conservative misses the window
-
Keeping CALM: When Distributed Consistency is Easy -- CALM theorem: monotonic programs can be eventually consistent
-
Distributed Transactional Systems Cannot Be Fast -- Fundamental lower bounds on distributed transaction latency
-
Shinjuku: Preemptive Scheduling for Microsecond-scale Tail Latency (NSDI'19) -- Microsecond-scale preemptive scheduling for datacenter RPCs
-
uCache: A Customizable Unikernel-based IO Cache (FAST'26) -- Unikernel-based I/O caching layer
-
Cuttlefish: Coordination-free Distributed State Kernel -- Nanosecond-latency distributed state without coordination
-
Distributed System Algorithms Reference -- Curated distributed systems algorithms with explanations
-
On System Design (ACM) -- Classic ACM paper on principles of system design
Misc Techniques
- Write Your Own Virtual Machine (LC-3) -- Step-by-step guide to building an LC-3 VM
- Writing an OS: Baby Steps -- Bare-metal OS development from bootloader to protected mode
- FreeRTOS Context Switch Implementation -- How FreeRTOS implements task context switching
- UTF-8 Everywhere -- Technical argument for UTF-8 as the universal encoding
- Full-Blown Cross-Assembler in a Bash Script -- Multi-target cross-assembler entirely in Bash
- Introduction to IA-32e Hardware Paging -- x86-64 page table internals
- ELF Binaries on Linux: Understanding and Analysis -- ELF format internals
- How to Write Shared Libraries (Ulrich Drepper) -- Definitive guide to ELF shared libraries, PLT/GOT, dynamic linking
- Shared Libraries in Windows and Linux -- Comparing dynamic linking and symbol resolution across OSes
- Dijkstra's in Disguise -- How many algorithms reduce to shortest path problems
Data Storage
Databases, storage engines, file formats, replication, caching.
PostgreSQL
-
The Internals of PostgreSQL (interdb.jp) -- Free book: buffer manager, WAL, MVCC, executor, query processing
Key insights (Ch.9 WAL):
- XLOG records written to WAL buffer in memory, then flushed synchronously to WAL segment files on transaction commit
- LSN (Log Sequence Number) = location where record is written on the transaction log; unique identifier for each XLOG record
- Checkpoint writes a special XLOG record containing the REDO point = "location to write the XLOG record at the moment when checkpoint started"
- Full-page writes (FPW, default on): first modification after checkpoint writes header + entire page as "backup block" — torn page protection
- Recovery replays XLOG records sequentially from REDO point; record replayed only if record LSN > page LSN, otherwise skipped
- PostgreSQL XLOG = REDO log only; no UNDO log support (unlike Oracle/MySQL InnoDB)
- Backup blocks can restore pages corrupted during background writer operations (torn writes)
- Checkpoint processing and database recovery are tightly coupled and inseparable
-
Learning PostgreSQL Internals (Paul Ramsey) -- Curated list of PostgreSQL internals resources
-
PostgreSQL Hacking Workshop -- Hands-on PostgreSQL source code workshop
-
PostgreSQL Internals - Indexes, WAL, MVCC, Locks and Queries -- Concise reference on core Postgres internals
-
PostgreSQL Recovery Internals -- WAL replay, crash recovery, timeline handling
-
PostgreSQL High-Availability Architectures -- Streaming replication, Patroni, PgBouncer patterns
-
PostgreSQL Performance: Latency in Cloud and On Premise -- Benchmarking latency across deployment environments
-
Unlocking High-Performance PostgreSQL: Key Memory Optimizations -- shared_buffers, work_mem, OS page cache tuning
Key insights:
- PG never reads directly from disk to client: data page → shared_buffers → caller; the buffer cache is the central perf knob
- Default
shared_buffers = 128MBis inadequate; production dedicated boxes want 20-25% of RAM, ceiling ~40% before OS page-cache competition hurts work_memis per-operation not per-session: 5 parallel workers × work_mem = 5× allocation; the dominant OOM trigger when tuned naivelypg_stat_databasecache-hit-ratio +EXPLAIN (ANALYZE, BUFFERS)together pinpoint which queries spill — measure before tuning- Small system (< 64 GB) work_mem formula: ≈ 0.25% of RAM (~3 MB / GB) — aggressive enough to suppress sort spills
- Large system (≥ 64 GB) safer formula:
max(162MB, 0.125% RAM + 80MB)— prevents exponential growth under parallelism shared_buffersrequires restart;work_memcan be set per session/role/transaction — fine-grained tuning without downtime- Over-sizing
shared_bufferscompetes with OS page cache and increases dirty-page flush volume per checkpoint — write spikes - Tuning order: measure cache hit ratio → fix shared_buffers → measure per-query spills → tune work_mem at session/role level, never globally aggressive
-
Importance of Tuning Checkpoint in PostgreSQL -- Checkpoint tuning for write-heavy workloads
Key insights:
- Checkpoints guarantee heap + index files reflect all writes before that LSN — establish the REDO recovery point
- Full-page images (FPI) on first modification after checkpoint create predictable I/O spike — protects against torn pages but hurts steady-state perf
- Benchmark: 5-min → 60-min
checkpoint_timeoutcut WAL volume from 12 GB → 2 GB (6×) and FPI writes from 1.47M → 161K (9×) - Production rule:
checkpoint_timeout≥ 30 min; default 5 min is far too aggressive for write-heavy workloads max_wal_sizetoo small undoes timeout setting — triggers WAL-volume-driven checkpoints early, restoring the FPI cascadecheckpoint_completion_target = 0.9spreads dirty-page writes across 90% of interval — eliminates synchronous I/O cliff at boundary- Recovery-speed misconception: PG replays WAL at ≥64 MB/s; even hour-long checkpoints recover in minutes, not hours — long intervals are safe
- Bgwriter complements checkpointer: continuously trickles dirty pages so checkpoints have less to flush
- Trade-off: longer intervals = more WAL retained for recovery + larger replay window vs much lower steady-state write amplification
-
Upgrading 200GB Postgres Within 10 Minutes in Heroku -- Fast major-version PostgreSQL upgrades
-
Mastering Logical Replication in PostgreSQL -- Comprehensive logical replication guide
-
Listen to Database Changes through the Postgres WAL -- WAL-based change data capture
-
PostgreSQL Materialized Views -- When and how to use materialized views
-
You Don't Need Elasticsearch: BM25 Is Now in Postgres -- Full-text search with BM25 ranking in Postgres
-
10 Elasticsearch Production Issues and How Postgres Avoids Them -- Elasticsearch pain points vs PostgreSQL alternatives
-
Postgres 18 Features I Will Actually Use in Production -- PostgreSQL 18 most impactful new features
-
PostgreSQL Developer Options: debug_io_direct -- Direct I/O developer option bypassing OS page cache
-
PostgreSQL Inval Reliability for Inplace Updates -- Cache invalidation correctness for inplace tuple updates
-
Scale PostgreSQL Horizontally with PgDog -- PostgreSQL proxy for horizontal sharding
-
Go + Postgres with sqlc: The Zero-ORM Stack -- Type-safe SQL in Go as used at Cloudflare
-
Explain Plan Visualizer by Datadog -- Interactive tool for visualizing PostgreSQL EXPLAIN output
MySQL & InnoDB
- The Basics of InnoDB Undo Logging and History System -- InnoDB MVCC undo log chain and purge system
- InnoDB Architecture (MySQL 8.1) -- Buffer pool, redo/undo, tablespaces, doublewrite
Storage Engines & Key-Value Stores
-
Log-Structured Merge Trees (Interactive) -- Visual explanation of LSM tree internals
-
Build Your Own KV Storage Engine -- Deletes, Tombstones, Compaction -- Hands-on KV engine with LSM-style compaction
-
CockroachDB Pebble: Binary Fuse Filters -- Binary fuse filters (faster than Bloom) in CockroachDB's LSM engine
Key insights:
- Xor-based structure: fingerprints satisfy f[h1(k)] XOR f[h2(k)] XOR f[h3(k)] = k using 3 independent hash functions across consecutive segments
- Construction via hypergraph "peeling" algorithm: find positions with degree 1, solve iteratively until all keys processed
- ~24 bits per key during construction (12-24MB for typical L6 sstables with 500K-1M keys)
- Superior false positive rates: 8-bit binary fuse achieves ~1/256 FP vs 1/88 for traditional 10-bits-per-key Bloom
- Supports custom bitpacking: 4, 8, 12, or 16-bit fingerprint variants
- Query accesses 3 segments (potentially >1 cache line), but CPU parallelizes independent lookups; cold-cache only 1-2% slower than Bloom on M1
- Construction 2-3x slower than Bloom for short keys; gap reduces with longer keys (faster XXH3 hashing)
- Memory-conscious pooling: sync.Pool reuse for small/medium filters, limited concurrency for large, no reuse for very large
- PR adds full implementation without enabling anywhere yet; staged rollout planned
- TPCC benchmarks: Bloom queries = 0.2% CPU; binary fuse substitution estimated "about a wash" including construction overhead
-
bf-tree: Concurrent Larger-than-Memory Range Index (Microsoft Research) -- Modern concurrent B-tree variant in Rust
-
From Building Houses to Storage Engines (TidesDB) -- Lessons from building a storage engine from scratch
-
What Does a Database for SSDs Look Like? (Marc Brooker) -- SSD-optimized database storage engine design
Key insights:
- Challenges WAL-centric durability: replication across machines provides superior durability; local WAL unnecessary
- SSD transfer sweet spot: 32kB — below wastes throughput (IOPS-limited), above doesn't improve (throughput-limited); random access now viable
- Large pages (1MB+) optimized for spinning disks create false sharing on SSDs with poor spatial locality
- Updated five-minute rule: cache pages expected to be accessed within ~30 seconds (not 1986's economics)
- "Commit transactions to a distributed log" across AZs rather than local system durability
- Cross-AZ latency only at commit boundaries; batch coordination to leverage modern datacenter bandwidth
- Use strong hardware clocks for consistent reads across replicas without coordination overhead
- Default to SNAPSHOT isolation (not serializable) to avoid per-write coordination
- Preserve core relational model, SQL, atomicity, strong consistency — the abstractions remain valuable
-
The Quest for One Million IOPS at LanceDB -- Storage I/O benchmarking and optimization
-
HelixDB: Graph-Vector Database in Rust -- Combined graph + vector database in Rust
-
I Built Google Bigtable in Go -- Simplified Bigtable showing core SSTable/memtable concepts
Apache Arrow & Parquet
-
Apache Arrow C++ Cookbook -- Practical Arrow array/table examples in C++
-
A Practical Dive Into Late Materialization in arrow-rs Parquet Reads -- Late materialization to skip unnecessary I/O
Key insights:
- Late materialization: defer data column decoding until after predicates filter rows, minimizing I/O and CPU
- "LM-pipelined" strategy: sequentially evaluate predicates, build sparse row masks, then decode only surviving rows
- RowSelection abstraction: RLE for large skips, bitmasks for tiny gaps; adaptive switching based on avg run length (threshold: 32)
- RowSelection::and_then combines successive filters via linear-time zipper algorithm, no data copies
- Page pruning: skip entire Parquet pages when metadata confirms no selected rows, eliminating decompression
- Dual-layer caching (shared global + local pinned) prevents double-decoding when columns serve both filter and projection
- Zero-copy conversions for fixed-width types: decoded vectors handed directly to Arrow buffers
- Fuzz testing validates coordinate transformations between relative/absolute row offsets across batch boundaries
- Transforms Parquet reader into "mini query engine" with selective I/O efficiency
-
parquet-linter: A Better Parquet Is Parquet Itself -- Validating and optimizing Parquet file layout
-
Hardwood: Minimal Dependency Parquet Implementation -- Clean Parquet implementation for learning
Query Engines & OLAP
-
Building Index-Backed Query Plans in DataFusion -- Adding index support to DataFusion's query planner
-
Optimizing SQL CASE Expression Evaluation (DataFusion) -- CASE expression optimization
-
Optimizing Repartitions in DataFusion -- Eliminating redundant repartitions
-
Extending SQL in DataFusion: from ->> to TABLESAMPLE -- DataFusion SQL extensibility
-
Apache DataFusion Comet Overview -- Native vectorized Spark execution on DataFusion/Arrow
-
Efficient String Compression for Modern Database Systems (CedarDB) -- String compression in analytical workloads
Key insights:
- Three-tier approach: Uncompressed, Single Value, Dictionary compression, plus FSST (Fast Static Symbol Table)
- FSST replaces frequently occurring substrings with fixed-size 1-byte tokens; up to 256 codes (255 reserved as escape)
- Symbol selection: greedy, based on frequency x symbol_size compression gain; symbol table fits in L1 cache (~1ns access)
- Two-phase: build symbol table from sampled data, then tokenize full dataset
- ClickBench: 20% total data reduction, 35% string-specific; TPC-H: 40% total, ~60% string reduction
- Cold runs: up to 40% speedup for I/O-bound queries; hot runs: up to 2.8x slowdown for decompression-heavy queries
- Penalty threshold: 40% compression bonus required to justify FSST over dictionary encoding alone
- Combined FSST + dictionary: efficient predicate evaluation on keys while achieving better compression than dictionaries alone
- Compressed data treated as immutable, eliminating costly dictionary reordering
-
How ClickHouse Makes Top-N Queries Faster with Granule-Level Data Skipping -- Granule-level skipping for Top-N acceleration
Key insights:
- Granule = smallest processing unit (~8192 rows); min/max metadata from data-skipping indexes used to eliminate granules before reading
- Static Top-N: skip granules upfront using metadata; Dynamic Top-N: threshold filtering as execution progresses
- Converts Top-N into metadata-driven pruning problem: compare current Top-N threshold against granule boundaries
- Static gains: 5x faster (0.044s to 0.009s), 610x less data (100M rows to 164K), I/O from 1.2GB to 4.95MB
- Dynamic gains: 10x faster (0.325s to 0.033s), 7.7% of data read, I/O from 9.42GB to 520MB
- 50-billion-row tables: Top-N in under 0.2 seconds
- Composable with streaming execution, read-in-order, and lazy materialization
- Especially powerful for object storage / disaggregated compute where avoiding I/O saves network bandwidth
-
Modern OLAP Systems -- Survey of modern analytical database architectures
-
Jack of All Trades: Query Federation in Modern OLAP (FOSDEM 2026) -- StarRocks on query federation
-
Time-series and Analytical Databases (QuestDB P99) -- Time-series database internals and query optimization
-
QuestDB: Parallel ORDER BY with High-Cardinality GROUP BY -- Parallelized Top-N for high-cardinality aggregations
Distributed Databases & Replication
-
ScyllaDB Ring Architecture -- Consistent hashing ring, token ranges, data distribution
-
LeasGuard: Raft Leases Done Right -- Correctness analysis of Raft lease-based reads
Key insights:
- Core idea: "the log is the lease" — committing a log entry implicitly grants/extends a lease until timeout; no separate lease-management messages
- Lame-duck failure mode of prior schemes: a leader that can't append entries can still send lease-extend pings, deadlocking writes; LeasGuard fixes by tying lease to write progress
- Decouples elections from leases: followers no longer refuse election votes based on stale leader's lease — faster recovery after crash
- Leverages Raft's Leader Completeness property: a newly elected leader's own log tells it when the predecessor's lease expired; minimal clock-sync requirement
- Deferred-commit optimization: new leader accepts and replicates writes immediately, but defers committing until prior lease expires — eliminates write-queueing pause during transition
- Inherited lease reads: both old and new leaders can serve consistent reads during transition by checking whether query results depend on "limbo" entries
- Local timer with bounded drift suffices for most ops; only inherited-lease reads require synchronized clocks with known error bound
- TLA+ specification verified Read-Your-Writes; the inherited-lease optimization itself emerged from the formal model
- Pattern: making the safety invariant (write progress) drive the liveness mechanism (lease) eliminates an entire class of split-brain bugs
-
pg_crdt: CRDTs in PostgreSQL (Supabase) -- Automerge-based CRDT extension for PostgreSQL
-
Gossip, Paxos, Microservices in Go, and CRDTs at SoundCloud -- Distributed systems primitives in production
-
Why Isn't "majority" the Default Read Concern in MongoDB? -- MongoDB read concern tradeoffs and consistency
Messaging & Streaming
- Kafka Can Be So Much More -- Kafka beyond messaging: event store, streaming platform
- RabbitMQ vs Kafka vs Pulsar -- Architecture comparison of message brokers
- Tansu: Kafka-compatible Broker with S3/PostgreSQL/Iceberg Backends -- Kafka-protocol broker backed by S3, PostgreSQL, SQLite, Iceberg
Patterns & Architecture
-
Revisiting the Outbox Pattern (Gunnar Morling) -- Transactional outbox for reliable event publishing
Key insights:
- Core purpose: atomically update local DB and notify downstream services via Kafka without distributed transactions
- Polling-based approach: simple but problematic — DB load spikes, poor ordering when concurrent transactions involved
- Log-based CDC (superior): tail DB transaction log for outbox events in commit order; propagation within "two-digit milliseconds"
- PostgreSQL shortcut: pg_logical_emit_message() writes events directly to WAL without materializing an outbox table
- Log-based CDC preserves transactional ordering that polling cannot guarantee
- Idempotency: track monotonically increasing sequence values (DB LSNs) rather than UUIDs to detect/discard duplicates
- Backfill via watermark-based snapshotting (DBLog paper): chunked processing with deduplication for existing data
- Debezium: open-source CDC tool for outbox implementation; Quarkus provides CDI event abstractions
- Outbox > 2PC: service only needs its DB online, not also the message broker; better availability
- Pattern "deserves a very central spot in the toolbox"; DB overhead typically insignificant with log-based implementations
-
Building a Durable Execution Engine With SQLite -- SQLite as durable execution foundation
-
Database-Backed Workflow Orchestration (QCon SF) -- Databases as workflow orchestration layer
-
How Is Data Stored? (Making Software) -- Visual explainer of on-disk storage fundamentals
-
Why JSON Isn't a Problem for Databases Anymore -- Columnar approaches to semi-structured JSON data
Surveys & References
-
Readings in Database Systems, 5th Edition (Red Book) -- Bailis, Hellerstein, Stonebraker's curated database readings
-
Databases in 2025: A Year in Review (Andy Pavlo) -- Annual database industry trends
Key insights:
- PostgreSQL is now infrastructure, not differentiator: Databricks bought Neon ($1B), Snowflake bought CrunchyData ($250M), Microsoft launched HorizonDB — every cloud vendor sells managed PG
- Three serious distributed-PG efforts launched in 2025: Multigres (Vitess co-creator Sugu), Neki (PlanetScale), PgDog — first credible attack on PG horizontal-scaling gap since Citus/PG-XL
- Model Context Protocol became universal DB feature: every major DBMS shipped MCP support so LLMs can query without custom glue; security model still immature
- Vector DB hype cycle peaked and declined: VC dollars rotated to LLM companies; vector search reverted to "feature inside Postgres/Mongo" rather than standalone product category
- Five new columnar formats launched (Vortex, F3, FastLanes, Amudai, AnyBlox) but interop is broken: 94% of existing Parquet files use only 2013-era v1 features — legacy compat dominates innovation
- MongoDB sued FerretDB over patent + trademark infringement of "drop-in replacement" claim — first major DB API litigation since Oracle/Java
- Massive M&A: DataStax → IBM ($3B), Confluent → IBM (~$11B), Informatica → Salesforce ($8B), Fivetran + dbt merger
- Notable deaths: Fauna, PostgresML, Hydra, Voltron Data ($110M funded) — GPU-accelerated DBs keep failing commercially despite repeated attempts
- Pattern: commodity CPU + great optimizer beats specialized hardware; market consolidates around PG as the lingua franca
-
Are Database Researchers Making Correct Assumptions? (Murat Demirbas) -- Questioning OLTP benchmarking assumptions
Key insights:
- Interactive transactions are rarer than literature assumes: 39% of apps have none; in apps that do, only 9.6% of workload involves interactivity — validates deterministic-DB assumption
- Strictly interactive (require mid-flight external input/side effect) is 0.5% — deterministic systems' expressivity loss touches almost nothing real
- Read/write-set inferability holds for 90% of apps: ≥58% of transactions have statically determinable sets — supports deterministic locking premise
- The 27% of transactions querying by secondary attribute (not PK) blocks static lock prediction; mostly simple single-statement cases though
- Corpus bias: study covers Django + TypeORM ORMs only — heavily skewed toward web apps, excludes most enterprise systems (SAP, Oracle EBS, etc.)
- DBA/analyst terminal transactions ignored: ad-hoc human-initiated queries are operationally critical but absent from any ORM corpus
- "Convertible to one-shot with minimal code change" claim lacks empirical engineering-cost validation
- Title overpromises: paper is really about deterministic DB research's narrow niche; classic MVCC/2PL systems never depended on these assumptions
- Pattern for the reader: benchmark realism matters more than benchmark count — every workload study inherits the bias of its corpus
-
Cloudspecs: Cloud Hardware Evolution -- How cloud hardware evolution impacts database design
-
The Fastest Database You've Never Heard Of -- High-performance database architecture profile
-
SIGMOD 2026 Accepted Papers -- Full SIGMOD 2026 paper list
-
FOSDEM 2026 Databases Track -- FOSDEM 2026 database talks
-
TigerBeetle Intro (presentation) -- Deterministic high-throughput financial transaction database
-
Log-Structured File Systems (Rosenblum & Ousterhout) -- Seminal LFS paper from Stanford
-
Databricks Lakebase: A New Era of Databases -- Merging data lake and database workloads
-
SQL Server 2025 General Availability -- SQL Server 2025 new features
Programming Languages
Rust, C/C++, Go, Zig, language internals, embedded, systems programming.
Rust
-
Rust Language Cheat Sheet -- Comprehensive syntax and concept reference
-
The Algebra of Loans in Rust -- Formal algebraic analysis of the borrow checker
Key insights:
- A "loan" = borrow event tied to a memory place; restrictions persist both during and after the loan's lifetime
- Three-phase analysis: (1) ops on the reference itself, (2) on the borrowed place while loan active, (3) after loan expires
- Reference types form a partial order: &T allows reborrowing to shared; &own T permits moving out; pinning restricts both
- Most loan types (mut, own, pinned) prevent all concurrent access; only &T and &pin T permit parallel shared borrows
- Uninitialization as explicit state: &own T and &uninit T treat places as uninitialized after expiry
- Pinning creates persistent constraints beyond lifetime: prevents moves/deallocation without running Drop
- &uninit T and &own T enable bidirectional conversion (initialization promotes, moving out demotes)
- Three composable tables predict allowed operations based on reference type + loan state — a decision procedure for borrow-checker extensions
- Explores speculative extensions: async pinning, non-forgettable types, in-place initialization guarantees
-
Borrow Checking, Escape Analysis, and the Generational Hypothesis -- Borrow checker and GC theory connections
-
How Rust Does Async Differently (and Why It Matters) -- Zero-cost async model vs goroutines/green threads
-
Rust Experimental Coroutines RFC -- Stackless coroutines/generators, foundation for async/await
-
Rust impl vs dyn -- When to use static vs dynamic dispatch
-
Don't Unwrap Options: Better Ways in Rust -- Idiomatic Option/Result handling patterns
Key insights:
- Avoid unwrap() in production: defers error handling, causes runtime panics, "one unwrap attracts another" making codebase fragile
- Top recommendation: let-else syntax (Rust 1.65+) —
let Some(v) = f() else { return Err(...); };clearly highlights the happy path - ok_or/ok_or_else: convert Option to Result with descriptive error messages; use ok_or_else with closures to avoid expensive operations
- Match expressions: explicit pattern matching on Some(value)/None works reliably for all cases
- Consider changing return types: if absence = error condition, return Result instead of Option to enable natural ? operator
- Anti-pattern: using ? on Option in Result-returning functions fails; requires explicit ok_or() conversion
- anyhow crate: provides .context() method for applications, but unsuitable for libraries (error type matching limitations)
- Distinguish semantically: Option for expected value absence, Result for error conditions
-
Effectively Using Iterators In Rust -- Practical Rust iterator patterns
-
Writing Rust the Elixir Way -- Lunatic runtime: Erlang-style actors in Rust with WASM isolation
-
Emitting Safer Rust with C2Rust -- Automated C-to-Rust translation lifting passes
-
From Rust to Beyond: The C Galaxy -- FFI between Rust and C
-
Rust bindgen: Bindings for Non-System Libraries -- Generating Rust FFI bindings for C/C++ libraries
-
qstr: Cache-Efficient Stack-Allocated String Types -- Small-string optimization with stack allocation
-
compio: Thread-per-Core Runtime with io_uring/IOCP -- Cross-platform async runtime using io_uring on Linux
-
Warper: Rust-Powered React Virtualisation -- Rust/WASM for high-performance list virtualization
Rust Embedded & Kernel
- Coding Guidelines for Rust in the Linux Kernel -- Official kernel Rust coding style and safety abstractions
- Rust Embedded: The Smallest no_std Program -- Minimal bare-metal Rust binary
- Embedded Rust: Singletons Pattern -- Rust ownership for safe peripheral access
- RTIC: Real-Time Interrupt-driven Concurrency -- Zero-cost concurrent embedded Rust
- Tock OS Design -- Rust-based embedded OS with capability-based security
- FreeRTOS-rust Crate -- Rust bindings for FreeRTOS
- Microsoft LiteBox: Rust-Based Sandboxing Library OS -- Microsoft's Rust library OS for lightweight sandboxing
C & C++
- C++ Core Guidelines -- Stroustrup and Sutter's C++ best practices
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems -- Modern C++ in resource-constrained embedded contexts
- 11 C Language Features I Ignored at First -- Designated initializers, compound literals, _Generic
- C++ DataFrame -- Pandas-like DataFrame in C++ with continuous memory
- The Case for Writing Network Drivers in High-Level Languages -- Writing Linux network drivers in Rust/Go
Go
- Go by Example -- Hands-on Go through annotated examples
- Go Maps in Action -- Official Go blog on map internals
- Understanding Escape Analysis in Go -- Stack vs heap allocation decisions
Zig
- Introduction to Zig (Book) -- Comprehensive free online Zig book
- Error Payloads in Zig -- Zig's error handling model
- Zig Can Come for Rust's Performance Crown -- Performance comparison between Zig and Rust
Language Internals & Runtimes
- Internals of CPython -- CPython interpreter deep dive
- Exploring CPython's Internals -- Official Python developer guide to CPython source
- V8 TurboFan JIT -- V8 JavaScript engine's optimizing JIT compiler
- The Path to Mojo 1.0 -- Mojo ownership model, lifetime semantics, systems-level features
- GPU Puzzles in Mojo -- Interactive GPU programming exercises
Systems Programming References
- matklad's Links Collection -- Curated by the rust-analyzer author: compilers, editors, Rust internals
- mcyoung Posts -- Compilers, linkers, systems programming
- Linux Kernel Development, 3rd Edition (Robert Love) -- Essential Linux kernel programming reference
- Advanced Programming in the UNIX Environment, 3rd Edition -- Stevens & Rago's classic UNIX systems programming
- System Calls (Beej's Guide) -- Network programming system call reference
- TUM Systems Programming Course (io_uring, eBPF, networking) -- Linux systems programming materials
- TUM Advanced Systems Programming Course -- Kernel modules, device drivers, DPDK, RDMA
- How to Create Jump Tables via Function Pointer Arrays -- Function pointer dispatch for embedded systems
See Also
- Database Systems Survey — In-depth coverage of many systems referenced in the bookmarks (Neon, DuckDB, ClickHouse, TigerBeetle)
- Kafka Internals — Detailed treatment of Kafka architecture bookmarked in the Case Studies section
- io_uring Internals — Deep dive into io_uring referenced across multiple bookmarked articles
- Rust Low-Level Programming — Unsafe Rust patterns related to the Rust bookmarks in the Programming Languages section