Home X Github About

Cockroachdb Optimizer Rules

CockroachDB Optimizer Rules: Deep Research Note

Context

CockroachDB's SQL optimizer (pkg/sql/opt) uses a declarative DSL called Optgen to define transformation rules on query expression trees. These rules are written in .opt files and compiled into Go code by the optgen tool. The system separates logical simplification (normalization) from physical plan enumeration (exploration), enabling a cost-based optimizer that is both maintainable and extensible.

This document catalogs every .opt file, lists every rule, and explains why each rule exists. It also covers the memo data structure, physical properties system, cost model, statistics pipeline, join ordering algorithm, plan caching, and DistSQL execution integration.


Architecture Overview

SQL AST
  │
  ▼
optbuilder  →  builds initial expression tree
  │
  ▼
norm.Factory  →  applies NORM rules during construction (single canonical form)
  │
  ▼
MEMO  →  stores equivalence classes of expressions
  │
  ▼
xform.Explorer  →  applies XFORM rules (generate alternative plans)
  │
  ▼
xform.Coster  →  assigns cost to each alternative
  │
  ▼
Best plan selected

Two Rule Categories

Normalization (norm)Exploration (xform)
WhenDuring expression constructionDuring optimization pass
GoalCanonical, simplified formEnumerate cost-relevant alternatives
Cost-awareNoYes (generates candidates for costing)
OutputOne result per expressionMany alternatives per group
TagNormalizeExplore

Rule Syntax

# Comment explaining WHY
[RuleName, Tag, OptionalPriorityTag]
(MatchPattern $var:TypeOrWildcard ...)
=>
(ReplacementExpression ...)
  • $x:* — bind anything to $x
  • $x:(Type) — bind only if node is Type
  • & (Func $x) — additional boolean guard
  • ^ (Func $x) — negation guard
  • HighPriority / LowPriority — control rule ordering within a pass

Operator Definitions (ops/)

These files define the expression tree node types — not rules per se, but the vocabulary that rules operate on.

ops/relational.opt — ~71 operators

Defines relational (table-returning) operators: Scan, Select, Project, InnerJoin, LeftJoin, RightJoin, FullJoin, SemiJoin, AntiJoin, IndexJoin, LookupJoin, MergeJoin, ZigzagJoin, GroupBy, ScalarGroupBy, DistinctOn, UpsertDistinctOn, EnsureUpsertDistinctOn, EnsureDistinctOn, Union, Intersect, Except, UnionAll, IntersectAll, ExceptAll, Limit, Offset, Max1Row, ProjectSet, WindowExpr, WithScan, Barrier, RecursiveCTE, Values, FakeRel, etc.

Why: Every SQL plan node type must be registered here so optgen can generate typed match/replace code for it.

ops/scalar.opt — ~173 operators

Defines scalar (value-returning) expression operators: all comparison ops (Eq, Ne, Lt, Gt, Le, Ge), arithmetic (Plus, Minus, Mult, Div, Mod), logical (And, Or, Not), string (Like, ILike, SimilarTo, RegMatch, Concat), casts, functions, aggregates (Sum, Count, Avg, Min, Max, StringAgg, etc.), subquery ops (Exists, Any, All), Case/When/Else, Tuple, Array, Coalesce, IfErr, AggFilter, AggDistinct, etc.

Why: Scalar expressions compose into filters, projections, and join conditions; they each need defined operator types for pattern matching.

ops/mutation.opt — ~16 operators

Insert, Update, Delete, Upsert, CreateTable, CreateIndex, AlterTable, etc.

Why: DML operations have special optimization considerations (FK checks, uniqueness checks, returning clauses).

ops/statement.opt — ~39 operators

DDL and control-flow: Explain, ExplainOpt, ShowTrace, Call, AlterRange, etc.

ops/enforcer.opt — 2 operators

Sort, Distribution — physical enforcers that impose ordering/distribution properties.

Why: When a plan requires sorted output and the child doesn't provide it, the optimizer injects a Sort enforcer. Same for distribution in multi-region plans.

ops/cycle.opt — 2 operators

RecursiveCTE, CycleDetector — operators for WITH RECURSIVE support.


Normalization Rules (norm/rules/)

Applied eagerly during expression construction. Each fires at most once per expression node; together they reduce the expression to a unique canonical form.


norm/rules/bool.opt — 19 rules

Purpose: Simplify boolean logic expressions algebraically.

RuleWhy Needed
NormalizeNestedAndsFlatten A AND (B AND C) to left-deep tree so other rules can traverse conjuncts uniformly
SimplifyTrueAndTRUE AND xx. Eliminates vacuous conjunct
SimplifyAndTruex AND TRUEx. Symmetric
SimplifyFalseAndFALSE AND xFALSE. Short-circuit
SimplifyAndFalsex AND FALSEFALSE. Symmetric
SimplifyTrueOrTRUE OR xTRUE. Short-circuit
SimplifyOrTruex OR TRUETRUE. Symmetric
SimplifyFalseOrFALSE OR xx. Eliminates vacuous disjunct
SimplifyOrFalsex OR FALSEx. Symmetric
SimplifyNotTrueNOT TRUEFALSE
SimplifyNotFalseNOT FALSETRUE
NegateComparisonNOT (a = b)a <> b. Pushes NOT inward to enable further simplification
EliminateNotNOT NOT xx. Double negation elimination
SimplifyAndRemoves duplicate conjuncts in AND (deduplication)
SimplifyOrRemoves duplicate disjuncts in OR
ExtractRedundantConjunct(A AND B) OR (A AND C)A AND (B OR C). Factors out common conjuncts to reduce filter complexity
SimplifyAndsFlattens list-form And with True/False handling
FlattenAndConverts nested And tree to flat conjunction list
FlattenOrConverts nested Or tree to flat disjunction list

norm/rules/scalar.opt — 32 rules

Purpose: Simplify scalar expressions: comparisons, coalesce, subqueries, type casts.

RuleWhy Needed
CommuteVarCanonicalize const op varvar op const so index constraint code only needs to handle one form
CommuteConstSame but for const groups
EliminateCoalesceCOALESCE(x) (single arg) → x
SimplifyCoalesceRemoves NULL first args, folds constant non-NULL first arg
EliminateIfErrIFERROR(non-error-expr)non-error-expr when inner can't error
EliminateCastCAST(x AS T) when x already has type T → x
EliminateExistsProjectEXISTS(Project(x))EXISTS(x). Projections don't affect existence
EliminateExistsGroupByEXISTS(ScalarGroupBy(x)) → scalar true if non-empty (scalar agg always returns a row)
SimplifyEqualityConstant folding for trivially equal/unequal expressions
NormalizeNotAnyToAllNotNOT (a = ANY(x))a <> ALL(x). Enables index usage
NormalizeAnyEqAlla = ALL(x) when x has single value → a = x[0]
EliminateAnyArraya = ANY(ARRAY[c]) with single const → scalar equality
FoldInNullx IN (NULL, ...) handling: IN with all NULLs → NULL
NormalizeInConstSort IN list, remove duplicates; enables better constraint generation
EliminateExistsLimitEXISTS(Limit(x, 0))FALSE
InlineWithInlines CTE when used exactly once and safe to do so
FoldNullCastCAST(NULL AS T) → typed NULL constant
EliminateUnnecessaryCastRemoves no-op casts between compatible types
SimplifyCaseRemoves WHEN branches that are always false, short-circuits to ELSE
SimplifyCaseWhenNullCASE WHEN NULL THEN x ELSE y ENDy
NormalizeInToContainsRewrites IN to JSON/array containment when applicable
(+ more)Various comparison normalization and null propagation rules

norm/rules/comp.opt — 26 rules

Purpose: Normalize comparison operators, especially for index constraint generation.

RuleWhy Needed
NormalizeCmpPlusConsta + 1 = 5a = 4. Moves constants to RHS so index constraints work
NormalizeCmpMinusConsta - 1 < 5a < 6. Same reason
NormalizeCmpConstMinus1 - a < 5a > -4. Flips inequality direction
NormalizeCmpDivConsta / 2 = 4a = 8. Division to RHS
NormalizeTupleEquality(a, b) = (1, 2)a = 1 AND b = 2. Decomposes for per-column constraint extraction
NormalizeTupleInConst(a,b) IN ((1,2),(3,4)) → conjunctive form; enables constraint generation
FoldCmpBetweena BETWEEN 1 AND 5a >= 1 AND a <= 5. Desugar BETWEEN
NormalizeGtToLt5 > aa < 5. Ensures variable always on left
NormalizeGeToLeSame for >=/<=
EliminateNullEqualityx = NULLFALSE (since NULL ≠ NULL in SQL equality)
NormalizeEqToIsa IS DISTINCT FROM b handling
SimplifyComparisonConstant folding for compile-time-evaluable comparisons
FoldIsNullIS NULL on non-nullable expr → FALSE; important for NOT NULL constrained columns
FoldIsNotNullIS NOT NULL on non-nullable → TRUE
NormalizeInequalityVarious inequality normalizations
(+ more)Type coercion normalizations, redundancy elimination

Key insight: Most comp rules exist to transform expressions into forms that the constraint solver (pkg/sql/opt/constraint/) can extract index constraints from. Without these, a filter like a + 1 = 5 would not generate a span [/5 - /5].


norm/rules/fold_constants.opt — 19 rules

Purpose: Evaluate constant expressions at optimization time (compile-time constant folding).

RuleWhy Needed
FoldNullUnaryop(NULL)NULL for ops that propagate nulls
FoldNullBinaryNULL op x or x op NULLNULL
FoldNullCompareNULL = xNULL (not FALSE — SQL three-valued logic)
FoldNullInConditionNULL in filter → treated as FALSE in boolean filter context
FoldBinary1 + 23. Evaluates binary ops on literal constants
FoldUnary-3-3 (literal). Evaluates unary ops on constants
FoldComparison1 < 2TRUE. Evaluates comparisons on literals
FoldCastCAST(1 AS FLOAT)1.0 at opt time
FoldFunctionPure functions on constants: length('abc')3
FoldIndirectionARRAY[1,2,3][2]2. Array indexing on constant arrays
FoldColumnAccess.field access on constant tuple
FoldEqualsAnyNullx = ANY(NULL)NULL
FoldEqualsAnyScalarx = ANY(single value)x = value
FoldAnyNullANY(NULL)NULL
FoldNullTupleInaccessibleTuple field access on NULL tuple → NULL
FoldInx IN (1, 2, 3) when x is constant → evaluate at opt time
(+ more)Various specific constant-folding cases

Why critical: Eliminates branches of the plan that can never execute, reduces predicate complexity, enables downstream rules to fire.


norm/rules/select.opt — 22 rules

Purpose: Push and simplify filters on Select nodes.

RuleWhy Needed
SimplifySelectFilters (HighPriority)Remove TRUE conjuncts, fold FALSE, flatten ANDs. Must run first
ConsolidateSelectFilters (LowPriority)Merge x >= 5 AND x <= 10 into Range op for better selectivity estimation
DeduplicateSelectFilters (LowPriority)Remove exact duplicate filter predicates
EliminateSelectSelect(input, [])input. Empty filter = no filtering
MergeSelectsSelect(Select(x, f1), f2)Select(x, f1 AND f2). Merge nested selects
PushSelectIntoProjectSelect(Project(x, cols), filter)Project(Select(x, filter), cols) when filter only refs input cols. Enables pushing filter below projection
RemoveNotNullConditioncol IS NOT NULL on NOT NULL column → TRUE. Eliminates redundant null check
InlineSelectProjectInlines Project columns into Select filter when safe
PushFilterIntoJoinLeftMove filter referencing only left side of join to left input
PushFilterIntoJoinRightMove filter referencing only right side of join to right input
PushFilterIntoJoinLeftAndRightFor filters that apply to both sides (e.g., joining col = col), push copies to both
MapFilterIntoJoinLeftUse functional dependencies to map filter to left side
MapFilterIntoJoinRightSame for right side
PushSelectCondLeftIntoJoinLeftFilterSpecialized push for complex join conditions
PushSelectIntoGroupByPush HAVING-equivalent filters below GroupBy when they only reference grouping cols
PushSelectIntoValuesFilter on constant VALUES rows — evaluate at opt time
PushSelectIntoInlinableProjectPush through project that can be inlined
EliminateJoinUnderSelectEliminate join made redundant by filter
SimplifySelectFiltersWithComputedColDerive filter implications from computed column definitions
InlineConstVarx = 5, x > 3 — substitute constant into other filters
PushSelectIntoWindowPush partition-column filters into window operators

Why this file matters: Filter pushdown is the #1 optimization for reducing rows processed. The further down the tree a filter is pushed, the fewer rows join and project operations see.


norm/rules/project.opt — 12 rules

Purpose: Simplify and eliminate unnecessary projections.

RuleWhy Needed
EliminateProjectProject(x, cols) when cols == output(x) → x. No-op projection removal
MergeProjectsProject(Project(x, c1), c2)Project(x, c2 computed from c1). Merge nested projections
EliminateProjectNoopOrderingRemove Project that only exists to enforce ordering already satisfied
SimplifyProjectionsRemove unused computed columns from projection list
EliminateComputedColRemove computed column if result is never referenced
InlineProjectInProjectInline inner projection computations into outer when cols used exactly once
NormalizeProjectExprsNormalize expressions in projections (reuse norm scalar rules)
EliminateProjectCyclesHandle projection cycles in recursive CTEs
PushProjectIntoJoinLeftMove projection below join to reduce row width during join
PushProjectIntoJoinRightSymmetric
EliminateProjectUnderScanRemove project on scan when scan already projects exact cols
SimplifyProjectSetSimplify SRF (set-returning function) projections

norm/rules/join.opt — 31 rules

Purpose: Normalize joins — eliminate, simplify, push filters, handle null-rejection.

RuleWhy Needed
EliminateJoinNoRowsJoin(x, Values[0 rows]) → empty. Join with empty input = empty
EliminateLeftJoinNoRowsLeftJoin(x, empty)Project(x, ...) with NULLs for right cols
EliminateFullJoinNoRowsBoth sides empty → empty
SimplifyJoinFiltersRemove TRUE from join ON clause
SimplifyJoinNotNullEqualitySimplify null-safety predicates on NOT NULL columns
EliminateSelfJoinJoin(x, x, x.pk = x.pk)x when provably the same rows
SimplifyLeftJoinFiltersMove LeftJoin ON condition to WHERE when it rejects NULLs (converts to InnerJoin)
SimplifyFullJoinFiltersSame for FullJoin
ConvertAntiToLeftJoinAntiJoin → LeftJoin + filter, enables more join ordering options
ConvertSemiToInnerJoinSemiJoin to InnerJoin + DistinctOn when safe
DecorrelateJoinRemove correlated subquery by converting to join (normalization side)
PushFilterIntoJoinPush outer Select filter into join ON clause
SimplifyJoinFiltersOnKeyWhen join is on full key, ON simplifications possible
MapEqualityFilterUse equalities to substitute equivalent expressions in filters
EliminateJoinWithKeyJoin on unique key where result provably not more rows → simplify
PushJoinIntoIndexJoinRewrite to IndexJoin form when beneficial
EliminateCrossJoinCrossJoin(x, Values[()])x
NormalizeJoinAnyFilterANY subqueries in join conditions → semi/anti joins
NormalizeJoinExistsFilterEXISTS in join condition → SemiJoin
NormalizeJoinNotExistsFilterNOT EXISTS → AntiJoin
NormalizeJoinInFiltercol IN (subquery) → SemiJoin
NormalizeJoinNotInFiltercol NOT IN (subquery) → AntiJoin
(+ more)Null-rejection rewrites, filter mapping via functional deps

norm/rules/decorrelate.opt — 33 rules

Purpose: Remove correlated subqueries by converting them to joins.

Correlated subqueries (where inner query references outer query cols) are expensive — they must re-execute for every outer row. Decorrelation converts them to equivalent joins that execute once.

RuleWhy Needed
DecorrelateSelectSelect(x, correlated-filter)x JOIN subquery
DecorrelateProjectCorrelatedScanPull correlated scan out of subquery into join
DecorrelateGroupByCorrelatedScanSame for aggregation subqueries
UnnestSelectInExistsEXISTS(Select(x, f)) → SemiJoin form
UnnestProjectInExistsEXISTS(Project(x, cols)) → SemiJoin
UnnestGroupByInExistsExists with aggregation
UnnestSelectInAnyANY(Select(x, f)) → SemiJoin
UnnestScalarInExistsScalar subquery = EXISTS(...) handling
InlineWithInline a CTE used in correlated position
HoistSelectExistsHoist correlated EXISTS out of Select filters
HoistSelectAnyHoist correlated ANY
HoistSelectSubqueryHoist scalar correlated subquery to outer join
HoistProjectExistsHoist correlated EXISTS from Project
HoistProjectAnyHoist correlated ANY from Project
HoistProjectSubqueryHoist scalar subquery from Project
HoistJoinSubqueryHoist from Join ON clause
HoistValuesSubqueryHoist from VALUES expressions
HoistGroupBySubqueryHoist from aggregation expressions
HoistWhereOrHandle OR of correlated subqueries
(+ 14 more)Additional hoisting and decorrelation patterns

Why critical: Without decorrelation, queries like SELECT * FROM t WHERE id IN (SELECT id FROM s WHERE s.col = t.col) run in O(n) subquery executions. With decorrelation → SemiJoin, it runs in O(1) join.


norm/rules/groupby.opt — 20 rules

Purpose: Simplify and eliminate aggregations.

RuleWhy Needed
EliminateDistinctDistinctOn(x) on a key column → x (already unique)
EliminateGroupByProjectGroupBy with empty agg list on key → identity
EliminateGroupBySingleRowGroupBy on input with at most 1 row → ScalarGroupBy or eliminate
EliminateGroupByConstColsRemove constant cols from GROUP BY (don't affect grouping)
RemoveGroupByColumnColumn in GROUP BY that's functionally determined by key → remove
EliminateAggregationCOUNT(*) on 1 row → 1; MIN/MAX(const) → const
SimplifyCountRowsCOUNT(*)COUNT_ROWS() which has an optimized path
ConvertCountToCountRowsCOUNT(not-null-col)COUNT_ROWS() since nulls can't occur
EliminateDistinctAggregationSUM(DISTINCT col) where col is key → SUM(col)
ReplaceMinWithLimitMIN(col ORDER BY col)Limit(Sort(x), 1) — often faster via index
ReplaceMaxWithLimitSame for MAX
EliminateMaxMinGroupByKeyMAX(col) GROUP BY colcol (trivially equal)
FoldGroupByHavingPush simple HAVING filters into WHERE (before aggregation)
SimplifyGroupByKeyRemove redundant grouping columns using functional dependencies
(+ more)Null handling in aggregates, functional dep simplification

norm/rules/prune_cols.opt — 23 rules

Purpose: Remove unused columns from every operator in the tree.

Every rule here follows the pattern: if the parent only uses a subset of child's output columns, pass that requirement down so the child generates fewer columns. This reduces tuple width throughout the plan.

RuleWhy Needed
PruneScanColsScan only those table columns actually needed
PruneSelectColsPropagate column pruning through Select
PruneProjectColsRemove projection exprs whose output cols are unused
PruneProjectSetColsSRF column pruning
PruneInnerJoinLeftColsInnerJoin: only output cols actually needed from left
PruneInnerJoinRightColsSame for right
PruneLeftJoinRightColsLeftJoin: right side cols that aren't output → prune (keeping NULL-fill cols)
PruneSemiJoinRightColsSemiJoin right side: only cols needed for join condition
PruneAntiJoinRightColsAntiJoin right side
PruneGroupByColsGroupBy: only keep needed grouping + aggregate result cols
PruneDistinctOnColsDistinctOn column pruning
PruneWindowColsWindow function partition/order col pruning
PruneUnionColsSet operation: only output needed cols from each side
PruneLimitColsLimit operator column pruning
PruneOffsetColsOffset operator column pruning
PruneOrdinalityRemove ordinality col if not needed
PruneMutationColsINSERT/UPDATE/DELETE: only fetch cols actually used
PruneWithScanWith (CTE) scan column pruning
(+ more)Various operator-specific prune rules

Why critical: Reducing column count reduces tuple memory, network bandwidth (in distributed plans), and enables further optimizations (narrower indexes can be used).


norm/rules/limit.opt — 13 rules

RuleWhy Needed
EliminateLimitZeroLIMIT 0 → empty result set immediately
EliminateLimitOneGroupByLIMIT 1 on input with key → may eliminate GroupBy
FoldLimitsLIMIT(LIMIT(x, 5), 3)LIMIT(x, 3). Inner limit always ≥ outer
EliminateLimitWithMaxRowsRemove LIMIT when input provably has fewer rows than limit
PushLimitIntoProjectPush LIMIT below Project to stop projecting unnecessary rows
PushLimitIntoValuesEvaluate LIMIT on constant VALUES at opt time
PushLimitIntoScanConvert LIMIT to Scan's internal row limit (avoids reading extra rows from storage)
PushLimitIntoOffsetLIMIT(OFFSET(x, o), l) → add offset to scan limit
EliminateOffsetZeroOFFSET 0 → identity
EliminateLimitUnderMax1RowMax1Row already guarantees ≤ 1 row; LIMIT n ≥ 1 is redundant
PushLimitIntoUnionAllPush LIMIT into each side of UNION ALL (can short-circuit)
(+ more)Ordering preservation under limit

norm/rules/ordering.opt — 6 rules

RuleWhy Needed
SimplifyLimitOrderingRemove ORDER BY cols not needed given the required output ordering
SimplifyGroupByOrderingRemove unnecessary ordering requirements from GroupBy
SimplifyWindowOrderingSame for window functions
SimplifyDistinctOnOrderingDistinctOn ordering simplification
SimplifyExplainOrderingRemove redundant ordering from Explain
NormalizeOrderingCanonicalize ordering expressions (consistent column references)

norm/rules/inline.opt — 9 rules

RuleWhy Needed
InlineProjectConstantsReplace references to projected constant cols with the constant directly
InlineSelectProjectInline projection into select filter (avoids extra Project node)
InlineJoinConstantsLeftConstants from left input inlined into join ON condition
InlineJoinConstantsRightSame for right
InlineWithInline CTE body when CTE referenced exactly once
InlineWithScanWhen WithScan is provably single-use, inline
InlineConstantEqualitiescol = 5 AND col > 35 > 3 → TRUE — substitute constant into other exprs
DetectValuesContradictionDetect contradiction in VALUES rows, eliminate those rows
(+ more)Expression inlining for constant propagation

norm/rules/reject_nulls.opt — 6 rules

RuleWhy Needed
RejectNullsLeftJoinLeft join ON condition rejects NULLs from right → effectively InnerJoin
RejectNullsRightJoinSymmetric
RejectNullsFullJoinFullJoin → LeftJoin or InnerJoin based on null-rejecting WHERE predicates
RejectNullsGroupByGroup by key where NULLs are filtered → simplify null handling
RejectNullsProjectPropagate null-rejection through projection
MarkUnionDistinctNullsRejectedUNION with NOT NULL context — mark appropriately

Why needed: LEFT/FULL JOIN semantics require NULL-filling when there's no match, but if the query then filters those NULLs away (e.g., WHERE right.col IS NOT NULL), the join is effectively an INNER JOIN. Converting to INNER JOIN enables more join orderings.


norm/rules/set.opt — 8 rules

RuleWhy Needed
EliminateUnionAllLeftUNION ALL(empty, x)x
EliminateUnionAllRightUNION ALL(x, empty)x
EliminateExceptAllEXCEPT ALL(x, empty)x
EliminateExceptAllRightEXCEPT ALL(empty, x) → empty
EliminateIntersectAllINTERSECT ALL(x, empty) → empty
EliminateUnionUNION(x, x) where x same expression → x
EliminateDistinctDISTINCT on key col → not needed
SimplifyUnionAllDistinctUNION into UNION ALL + DistinctOn when provably equivalent

norm/rules/numeric.opt — 9 rules

RuleWhy Needed
FoldPlusZerox + 0x
FoldZeroPlus0 + xx
FoldMinusZerox - 0x
FoldMultOnex * 1x
FoldOneMult1 * xx
FoldDivOnex / 1x
FoldNegNeg--xx
NormalizeUnaryMinus-(-x)x
FoldModOnex % 10

norm/rules/mutation.opt — 3 rules

RuleWhy Needed
SimplifyInsertOrUpdateFiltersSimplify filters in upsert operations
EliminateMutationFKChecksRemove FK checks for cols that didn't change in UPDATE
EliminateMutationUniqueChecksRemove uniqueness checks for non-updated unique columns

norm/rules/with.opt — 3 rules

RuleWhy Needed
InlineWithInline CTE when referenced once and safe (no side effects, not recursive)
EliminateWithRemove unused CTE entirely
HoistWithExpressionHoist With above operators to maximize sharing

norm/rules/window.opt — 5 rules

RuleWhy Needed
SimplifyWindowOrderingRemove redundant ordering cols from window function OVER clause
EliminateWindowNoopProjectWindow function that produces same col as input → identity
PushSelectIntoWindowPush partition-col filter below window computation
ReduceWindowPartitionColsRemove redundant partition cols using functional deps
ReduceWindowOrderingColsSame for ordering cols

norm/rules/agg.opt — 1 rule

RuleWhy Needed
ReplaceAggsWithConstantWhen grouping is on a key and aggregate input is constant, fold to constant

norm/rules/barrier.opt — 1 rule

RuleWhy Needed
EliminateBarrierRemove optimization barrier when no longer needed (barriers block certain rewrites)

norm/rules/max1row.opt — 1 rule

RuleWhy Needed
EliminateMax1RowWhen input provably has ≤ 1 row (from key/cardinality analysis), remove Max1Row wrapper

norm/rules/project_set.opt — 1 rule

RuleWhy Needed
EliminateProjectSetProjectSet (SRF cross join) with no actual SRFs → plain projection

norm/rules/cycle.opt — 2 rules

RuleWhy Needed
SimplifyCycleDetectorRemove cycle detector wrapper when it's provably not in a recursive CTE context
EliminateCycleDetectorRemove entirely when no recursive reference is possible

Exploration Rules (xform/rules/)

Applied by xform.Explorer during the optimization pass. Unlike norm rules, exploration rules generate additional alternatives in the memo without replacing the original. The cost model then selects the best one.


xform/rules/scan.opt — 2 rules

RuleWhy Needed
GenerateIndexScansFor each secondary index on a table, generate an alternative Scan using that index. Without this, only the primary index would be considered
GeneratePartialIndexScansFor each partial index whose predicate is implied by the query filter, generate a Scan. Partial indexes are smaller → faster scans

xform/rules/select.opt — 9 rules

RuleWhy Needed
GenerateConstrainedScansFor each index, attempt to extract index constraints from the SELECT filter and generate constrained Scan variants. Core of index-accelerated queries
GenerateInvertedIndexScansFor each inverted index (JSON, array, FTS), generate inverted Scan. Needed for containment/overlap queries
GeneratePartialIndexScansIndex-specific partial scan generation for each eligible partial index
GenerateZigzagJoinsWhen multiple indexes each satisfy part of the filter, generate ZigzagJoin (intersect via index merge). Often faster than single-index scan + filter
GenerateInvertedIndexZigzagJoinsZigzagJoin variant for inverted indexes
GenerateSkipScanIndexGenerate skip-scan (jumping over index groups) for multi-column indexes where leading col is unconstrained
SplitScanIntoUnionScansSplit single constrained scan into UNION ALL of multiple scans (for better parallelism on disjoint ranges)
SplitDisjunctionIntoUnionScansOR predicates → UNION ALL of scans, each handling one disjunct
GenerateLocalityOptimizedScanMulti-region: generate a scan that prefers local region replicas first

xform/rules/join.opt — 20 rules

RuleWhy Needed
ReorderJoinsGenerate all valid join orderings (using join order builder). The optimizer then picks the cheapest order based on cardinality estimates
CommuteLeftJoinLeftJoin(A, B)RightJoin(B, A). Symmetric form opens more ordering options
CommuteSemiJoinSemiJoin(A, B)InnerJoin(A, DistinctOn(B)) when safe. Opens reordering
ConvertSemiToInnerJoinAlternative semi→inner conversion without distinct
GenerateLookupJoinsFor each index on the right side of a join, generate LookupJoin variant. LookupJoin is often faster when right side is small/indexed
GenerateLookupJoinsWithVirtualColsSame but involving virtual (computed) columns in the index
GenerateInvertedJoinsGenerate join that uses inverted index for containment/proximity predicates
GenerateMergeJoinsWhen both sides can be sorted on join key, generate MergeJoin (avoids hash table, good for large sorted inputs)
GenerateHashJoinExplicitly generate HashJoin alternative (usually the default, but explicit for memo)
GenerateInvertedIndexJoinsInverted index join variant
GenerateLocalityOptimizedAntiJoinAnti-join preferring local replicas in multi-region
GenerateLocalityOptimizedLookupJoinLookup join preferring local replicas
GenerateLocalityOptimizedSemiJoinSemi-join with locality preference
GenerateGeoInvertedJoinsGeo-spatial inverted index joins
SplitDisjunctionOfJoinTermsOR in join ON clause → UNION ALL of two joins, each handling one disjunct
PairGeoInvertedJoinsPair geo inverted joins for compound geo predicates
(+ more)Parametrized join variants, cross-join reductions

xform/rules/groupby.opt — 10 rules

RuleWhy Needed
GenerateStreamingGroupByIf input is ordered on grouping cols, generate streaming (sort-based) GroupBy instead of hash-based. Often faster as avoids hash table
GenerateIndexScanGroupByIf an index provides the needed order for grouping, generate Scan + StreamingGroupBy
GenerateDistinctOnStreamingGroupByStreaming DistinctOn using sorted input
GenerateStreamingDistinctStreaming DISTINCT on sorted input
GenerateStreamingDistinctOnStreamingGroupByCombined streaming distinct + groupby
SplitGroupByOrdinalitySplit groupby by ordinality for ordered output
GenerateEfficientMinMaxMIN(col) / MAX(col) where col has an index → Limit(IndexScan, 1). Avoids full scan
GenerateEfficientGroupByMinMaxGROUP BY key, MIN(val) → IndexScan with limit per group
GenerateLocalityOptimizedGroupByMulti-region: prefer local nodes for grouping
GenerateLimitedGroupByWhen GROUP BY followed by LIMIT, propagate limit into groupby

xform/rules/limit.opt — 12 rules

RuleWhy Needed
GenerateLimitedScansLIMIT n on a Scan → generate Scan with internal row limit (stops reading early)
GenerateLimitedIndexScansSame for index scans
PushLimitIntoLookupJoinPropagate limit into lookup join (stop after n results)
PushLimitIntoUnionAllPush limit into each side of UNION ALL
PushLimitIntoDistinctOnLimit + DistinctOn → limit within distinct computation
GenerateOrderedTopKORDER BY + LIMIT → TopK operator (heap-based, avoids full sort)
GenerateStreamingTopKTopK when input partially ordered
GenerateIndexScanTopKORDER BY col LIMIT n → forward/reverse IndexScan with limit
GenerateLimitedHashGroupByLIMIT after GROUP BY → limit hash groupby output early
GenerateLimitedStreamingGroupBySame for streaming groupby
GenerateLimitedDistinctOnLIMIT + DistinctOn combination
GenerateLocalityOptimizedLimitMulti-region limit with local preference

xform/rules/project.opt — 1 rule

RuleWhy Needed
GenerateProjectedInvertedIndexScansWhen projection output requires an inverted index, generate the scan variant

xform/rules/set.opt — 1 rule

RuleWhy Needed
GenerateUnionAllExchangeUNION ALL in distributed setting → add Exchange operators to distribute work, potentially parallelizing across nodes

xform/rules/insert.opt — 1 rule

RuleWhy Needed
GenerateFastPathInsertWhen INSERT satisfies fast-path conditions (no FK checks needed, simple values), generate optimized FastPathInsert that skips normal write pipeline overhead

xform/rules/generic.opt — 2 rules

RuleWhy Needed
GenerateParameterizedJoinFor generic plans (placeholder values unknown at opt time), convert Select(Scan, placeholder-filter) to InnerJoin(Values[placeholders], Scan, filter). Allows lookup join to be planned without knowing actual values
ConvertParameterizedLookupJoinToPlaceholderScanRefine parameterized join into a placeholder-based scan for specific execution

xform/rules/cycle.opt — 2 rules

RuleWhy Needed
GenerateCycleDetectedPlanGenerate a recursive CTE execution plan with cycle detection enabled
GenerateCycleDetectedScanVariant that uses index scan for cycle detection in recursive CTEs

Summary Statistics

CategoryFilesRules/Operators
Operator definitions (ops/)6~303 operators
Normalization rules (norm/rules/)25~371 rules
Exploration rules (xform/rules/)10~60 rules
Total41~734

Key Design Principles

  1. Normalization reduces search space. By canonicalizing expressions first, exploration rules see fewer distinct forms to enumerate. E.g., NormalizeNestedAnds ensures And trees are always left-deep — exploration rules only need to match one shape.

  2. Exploration generates alternatives, not replacements. Exploration rules add new expressions to a memo group (equivalence class). The original expression stays. Cost model picks winner.

  3. Priority tags control rule ordering. HighPriority rules (like SimplifySelectFilters) run before others in the same file. LowPriority rules run after, allowing higher-priority rules to create opportunities they can then exploit.

  4. Custom functions bridge DSL and Go. Complex matching/construction that can't be expressed in Optgen syntax is implemented in *_funcs.go files and called from rules with & (FuncName ...).

  5. Functional dependencies drive many rules. The FD (functional dependency) system tracks which columns are determined by others. Many rules use FDs to prove that grouping columns are redundant, that joins can be simplified, etc.

  6. Index constraint extraction is the payoff of comp/select normalization. Rules in comp.opt that rewrite a + 1 = 5a = 4 exist specifically so the constraint solver can extract the index span [/4 - /4] and turn a full scan into a point lookup.

  7. Multi-region rules are additive. Rules like GenerateLocalityOptimizedScan add region-aware alternatives; if the query is not multi-region, these alternatives simply cost more and are never selected.


Memo Data Structure

The memo (pkg/sql/opt/memo/memo.go) is the core data structure of the optimizer. It efficiently stores a forest of logically equivalent query plans by exploiting the fact that many sub-plans are shared across alternatives.

Structure

Memo
  └── groups: []memoGroup
        └── memoGroup
              ├── relProps: *relationalProps   (shared by all exprs in group)
              ├── scalarProps: *scalarProps
              └── exprs: []memoExpr
                    └── memoExpr
                          ├── op: operator
                          ├── children: []groupID   (references to other groups)
                          ├── physProps: *physicalProps
                          └── private: interface{}  (operator-specific data)

Group identity: Each group is identified by a fingerprint = (operator, child group IDs). Before inserting a new expression, the factory computes its fingerprint. If a group with that fingerprint already exists, the expression joins that group. If not, a new group is created.

Equivalence invariant: All memo-expressions in a group are logically equivalent — they produce the same rows (modulo ordering). Normalization rules fire at construction time (via norm.Factory) so non-normalized forms never enter the memo.

Expression extraction (binding): Transformation rules don't traverse raw Go data structures. Instead, they use a binding step: patterns are matched against the memo, extracting only the parts the rule needs. Pattern leaves match opaquely (entire subtree as a group reference); pattern trees recursively extract for manipulation.

Exploration vs Normalization in the Memo

PhaseWhenAdds to memo?Replaces?
NormalizationDuring constructionYes (new group)Replaces root (single canonical form)
ExplorationDuring optimizationYes (new expr in existing group)Never — original stays

norm.Factory.ConstructXxx() builds normalized expressions and inserts them, firing norm rules eagerly. xform.Explorer.exploreGroup() iterates over groups, applies exploration rules, adds new alternatives to existing groups.

CockroachDB uses Cascades (Graefe 1995) style optimization — interleaved exploration and costing — rather than Volcano's separate expansion and costing phases.

OptimizeGroup(group, requiredProps):
  exploreGroup(group)           // fire all applicable xform rules, add alternatives
  for each expr in group.exprs:
    cost = CostExpr(expr, requiredProps)
    track minimum cost expr
  return bestExpr

CostExpr(expr, requiredProps):
  for each child group:
    OptimizeGroup(child, derivedRequiredProps)  // recursive
  cost += operatorCost(expr, childCardinalities)
  if !childSatisfiesRequiredProps:
    inject enforcer, add enforcer cost

Branch-and-bound pruning: If an expression's lower-bound cost already exceeds the best known cost for the group, its subtree is pruned. This is critical for large join search spaces.


Logical Properties

Logical properties are attached to memo groups (shared by all expressions in the group). Computed once per group when first constructed.

Relational Logical Properties

Defined in pkg/sql/opt/props/logical.go:

PropertyTypeMeaning
OutputColsColSet (bitmap)Columns produced by expression
OuterColsColSetColumns referenced but not produced (free variables — indicates correlated subquery)
NotNullColsColSetColumns provably never NULL
CardinalityCardinalitySetMin/max row estimate (exact where possible)
FuncDepsFuncDepSetFunctional dependencies (see below)
StatsStatisticsRow count, column stats, histograms

Scalar Logical Properties

PropertyTypeMeaning
InputColsColSetColumns consumed
OuterColsColSetFree variables (correlated)
HasVolatileFuncboolContains NOW(), random() etc — blocks caching
HasSubqueryboolContains subquery — triggers decorrelation

Functional Dependencies (FD Set)

pkg/sql/opt/props/func_dep.go. The FD set tracks:

  1. Strict keys — column set S is a key if no two rows have identical values on S. All cols must be NOT NULL. Derived from PRIMARY KEY, UNIQUE NOT NULL indexes.
  2. Lax keys (weak keys) — uniqueness on non-NULL values only. From UNIQUE indexes with nullable columns.
  3. Determiners{A} → {B}: column A functionally determines B. From computed column definitions.
  4. Equivalency groups{A = B = C}: columns proven equal via join or filter. Enables predicate inference across joins.
  5. Constant columns{A = const}: column has been equated to a constant. From col = 5 filters.

How FDs drive rules:

  • EliminateGroupBySingleRow: if grouping columns contain a key → at most one group → GroupBy unnecessary.
  • SimplifyGroupByKey: if a non-key column is functionally determined by other grouping cols → remove it.
  • MapFilterIntoJoinLeft/Right: if a.x = b.x via join condition and a.x has FD to another col → predicate can be mapped.
  • Decorrelation: outer column FDs help prove that hoisted subquery join is a key-join → distinct is not needed.

Physical Properties System

Physical properties describe how data is presented, not what data — they live outside relational algebra.

Tracked Physical Properties

PropertyDescriptionEnforcer
PresentationColumn ordering and naming in outputNo-op reproject
OrderingRow sort order (ORDER BY)Sort operator
DistributionWhich KV ranges / regions own rowsDistribution enforcer

Required vs Provided

Required properties flow top-down: parent specifies what it needs from child. Originate from:

  • ORDER BY → ordering requirement
  • DISTINCT → key requirement (handled via norm rules, not physical props)
  • Merge join → requires both inputs sorted on join keys
  • Lookup join → requires left input sorted on prefix matching index

Provided properties flow bottom-up: each operator specifies what physical properties its output satisfies given its child's properties. E.g.:

  • IndexScan(idx) provides ordering = idx.columns (ascending or descending).
  • HashJoin provides no ordering (output is unordered).
  • MergeJoin provides merged ordering on join key.
  • Sort(input, ordering) provides the specified ordering.

Mismatch → Enforcer injection: When a required property is not provided by any alternative in a group, the optimizer introduces an enforcer:

Sort(child, ordering) -- satisfies ordering requirement
Distribution(child)   -- routes data to required gateway node

Enforcers have costs (Sort = O(n log n) I/O + CPU; Distribution = network transfer).

Multi-Region Distribution Properties

For REGIONAL BY ROW tables, each row has a crdb_region hidden column. The optimizer:

  1. Adds Distribution required property at query root (gateway region).
  2. Generates GenerateLocalityOptimizedScan — scans local region first with LIMIT.
  3. If local scan produces insufficient rows, a follow-up scan covers other regions.
  4. GenerateLocalityOptimizedAntiJoin, GenerateLocalityOptimizedLookupJoin extend this to join operators.

This turns cross-region queries into cheap local-first queries for most workloads.


Cost Model

pkg/sql/opt/xform/coster.go. Costs are dimensionless "time units" — relative, not absolute.

Cost Components

Cost(expr) = SeqIOCost + RandIOCost + CPUCost + NetworkCost

SeqIOCost  = rows × scanFactor × seqIOCostFactor
RandIOCost = rows × randomIOCostFactor  (for random-access index lookups)
CPUCost    = rows × cpuCostFactor × perRowWork
NetworkCost = rows × rowWidth × networkCostFactor  (for distributed plans, 0 for local)

Default cost factors (normalized, not real seconds):

  • seqIOCostFactor = 1.0
  • randomIOCostFactor = 4.0 (random I/O ≈ 4× sequential)
  • cpuCostFactor = 0.01 (CPU cheap vs I/O)
  • networkCostFactor = 1.0 (same as seq I/O by default)

Per-Operator Cost Logic

OperatorDominant Cost Factor
Scan (full)SeqIO × estimated row count
IndexScan (constrained)RandIO × selectivity × rows
LookupJoinRandIO × left_rows × matches_per_row
HashJoinCPU × (left_rows + right_rows) + possible spill IO
MergeJoinCPU × rows (requires pre-sorted inputs; amortizes Sort cost)
SortCPU × n log n (dominated by comparison cost)
IndexJoinRandIO × rows (one random KV read per row)
ZigzagJoinRandIO × (rows_in_index1 + rows_in_index2) × selectivity

Spill Cost

When hash join or sort input exceeds work_mem equivalent, the optimizer adds disk spill cost:

spillCost = (rows × rowWidth / diskBlockSize) × seqIOCostFactor × spillFactor

Network Cost in Distributed Plans

For queries spanning multiple nodes, each data shuffle (hash join redistribution, aggregation scatter-gather) incurs:

networkCost = rowBytes × networkCostFactor × hops

Gateway node is cheapest (0 network); remote node data costs proportional to transfer size.

Optimal Substructure

Cost model satisfies optimal substructure: Cost(tree) = Cost(root) + Σ Cost(subtrees). This enables dynamic programming — optimize each group independently, then compose. Critical invariant for Cascades to work correctly.


Statistics & Cardinality Estimation

pkg/sql/opt/memo/statistics_builder.go + pkg/sql/stats/.

What Gets Collected

CREATE STATISTICS / auto-stats collects per table:

  • Row count: exact count at scan time
  • Null count: per column
  • Distinct count: HyperLogLog (HLL-TailCut+ variant). Parallel-friendly: HLL sketches merge across nodes.
  • Histogram: equi-depth histogram on first column of each index. Also collected for any explicitly requested column.

Histogram format: up to 200 buckets, each bucket stores (upper_bound, row_count, distinct_count, null_count). Equi-depth = each bucket covers approximately equal number of rows.

Multi-Column Statistics

By default collects multi-column stats on the prefix columns of each index (columns that appear together in WHERE clauses for that index). Purpose: estimate correlation between columns — e.g., (country, city) tuple cardinality is much lower than distinct(country) × distinct(city) if city is nested within country.

Automatic Stats Refresh

pkg/sql/stats/automatic_stats.go:

  • Trigger: probabilistic after each mutation. After a batch of mutations on table T: P(refresh) = (rows_mutated / total_rows) / threshold where threshold ≈ 0.20 (20% changed).
  • Background job: global serialization, one stats collection at a time. If node fails, another node adopts.
  • Throttling: adaptive sleep inserted every 10K rows scanned. Sleep duration scales with CPU utilization.
  • Sampling: reservoir sampling for histograms — sample S rows from table scan, compute histogram from sample.

Statistics Forecasting

When ≥3 historical stats collections exist and row counts fit a linear trend, CockroachDB generates forecasted statistics extrapolating to the current time. Used between refreshes when table is changing rapidly.

forecasted_row_count = last_row_count + slope × (now - last_collection_time)
forecasted_distinct_count[col] = scale proportionally

Forecasting prevents stale stats from causing catastrophic plan regressions during bulk loads.

Selectivity Estimation

statistics_builder.go propagates histograms up the operator tree:

Select(input, filter):
  sel = selectivity(filter, input.stats)
  output.rowCount = input.rowCount × sel
  output.histogram[col] = input.histogram[col].ApplySelectivity(sel, filter.bounds)

Join(left, right, condition):
  if equijoin on col:
    sel = 1 / max(distinct(left.col), distinct(right.col))
  output.rowCount = left.rowCount × right.rowCount × sel
  // histogram intersection for equijoin columns

Selectivity for range predicates: computed by scanning histogram buckets that overlap the predicate range, summing their row counts.

Null handling: null count tracked separately. IS NULL selectivity = null_count / row_count. IS NOT NULL = 1 - (null_count / row_count).

Distinct count propagation through joins: for join output column from left, distinct(col) = min(distinct(left.col), output.rowCount).


Join Ordering

pkg/sql/opt/xform/join_order_builder.go. The ReorderJoins exploration rule delegates to this.

Algorithm

CockroachDB uses a DPhyp (Dynamic Programming on Hypergraphs) variant — the same algorithm used by HyPer, Umbra, DuckDB. Enumerate all subsets of joined relations, find minimum-cost join tree for each subset.

DPhyp:
  for each subset S of relations (bottom-up, size 1 → n):
    for each valid partition (S1, S2) of S where S1 ∪ S2 = S:
      if there exists a join predicate between S1 and S2:
        cost = Cost(BestPlan(S1)) + Cost(BestPlan(S2)) + joinCost(S1, S2)
        BestPlan(S) = min(BestPlan(S), cost)

Hyperedges: join predicates that reference more than 2 tables (e.g., from OR conditions or transitive predicates) are represented as hyperedges. DPhyp handles these without enumerating invalid cross products.

Cardinality Limit & Fallback

reorderJoinsLimit session variable (default 8). When join count > limit:

  • Greedy fallback: iteratively pick the pair (left, right) with cheapest immediate join cost. O(n²) instead of O(2ⁿ).
  • Greedy avoids cross products but may miss globally optimal orderings.

For n ≤ 8: full DPhyp, optimal for tree-structured queries and near-optimal for cyclic queries. DP table entries = O(2ⁿ), manageable for n ≤ 8.

What DPhyp Considers

  • All join types: inner, left, full, semi, anti (with correctness constraints on associativity/commutativity)
  • All plan shapes: left-deep, right-deep, bushy
  • Join implementations: hash join, merge join, lookup join (index), zigzag join
  • Cross joins excluded unless no predicate exists (forced by schema)

Cardinality Estimation for Join Ordering

Row count for join output = left.rows × right.rows × joinSelectivity. Join selectivity:

  • Equijoin on indexed column: 1 / max(distinct(left.col), distinct(right.col))
  • Non-equijoin: default selectivity 0.333 (heuristic)
  • FK join (foreign key relationship detected): selectivity = 1 / distinct(pk_side) — usually very precise

Plan Caching & Prepared Statements

pkg/sql/plan_opt.go, pkg/sql/querycache/.

Five Planning Stages

1. Parse      → SQL string → AST
2. OptBuild   → AST → relational expression tree (semantic analysis, name resolution)
3. Normalize  → norm.Factory applies all norm rules → canonical memo
4. Explore    → xform.Optimizer applies xform rules + costs → best plan
5. ExecBuild  → optimized plan → execinfrapb.ProcessorSpec tree (DistSQL specs)

What Gets Cached

ProtocolCached StateReuse Condition
Simple query (same SQL string)Final explored plan (stage 4 output)Identical SQL string, same schema version
PREPARE once, execute N timesNormalized memo through stage 3Same schema; re-runs stages 4+5 per execution
Generic plans (v23.1+)Full plan including exploration (stage 4)Parameter-independent plan; checked after 5 custom executions

AssignPlaceholders

Between stages 3 and 4 for prepared statements: substitute $1, $2, ... with actual values. This triggers additional normalization (e.g., FoldBinary on col + $1 with $1=5 → col + 5 → a = 4 after comp rules). The explore stage then sees concrete values — better selectivity estimates, better plans.

Generic plans skip AssignPlaceholders — the plan is optimized with placeholder types but no values. The optimizer generates parameter-independent execution. Chosen when: plan shape doesn't change across values (e.g., simple point lookup) OR custom plan cost is consistently ≥ generic plan cost.

Replanning Triggers

Cache invalidation occurs when:

  • Schema change: column added/removed, index created/dropped, table renamed/dropped.
  • Database context change (SET database).
  • Session setting affecting planning (SET optimizer_*, SET reorder_joins_limit).
  • Statistics refresh on a table used by the query (triggers replan in background via sql.planner.replanQuery()).

Query Fingerprinting

Queries are fingerprinted by normalizing the SQL string: strip literal values, normalize whitespace, replace $N with ?. Fingerprint → statement bundle in system.statement_statistics. Used for plan regression detection and EXPLAIN BUNDLE.


DistSQL Execution Integration

pkg/sql/distsql_physical_planner.go, pkg/sql/colflow/.

Pipeline: Optimizer → Execution

Optimizer best plan (memo tree)
  ↓ execbuilder.Builder.Build()
planNode tree (scanNode, filterNode, joinNode, ...)
  ↓ DistSQLPlanner.PlanAndRun()
PhysicalPlan (ProcessorSpec DAG)
  ↓ FlowSpec per node
  ↓ SetupFlowRequest RPC to each node
colflow.VectorizedFlow (operator DAG)
  ↓ BatchFlowCoordinator.Run()
results → client

ProcessorSpec Types

pkg/sql/execinfrapb/processors.proto:

SpecCorresponds ToNotes
TableReaderSpecScan / IndexScanSpecifies key spans, index, column IDs
FiltererSpecSelectPost-processor on TableReader
HashJoinerSpecHashJoinLeft/right types, equality cols, join type
MergeJoinerSpecMergeJoinOrdering spec for both inputs
SorterSpecSortOutput ordering, mem limit
AggregatorSpecGroupBy / AggAgg functions, grouping cols, partial vs final
ProjectSetSpecSRF projectionSet-returning functions
NoopCoreSpecFinal merge at gatewayUsed to collect distributed results

Physical Planning Decisions

DistSQLPlanner.checkSupportForPlanNode() decides: local execution vs distributed.

  • Local: simple queries on single node, admin statements, DDL.
  • Distributed: queries where table ranges span multiple nodes. Scan processors placed on nodes owning the data ranges (locality-aware placement).

For joins: if one side fits in memory and the other is distributed → broadcast join (copy small side to all nodes). Otherwise → hash-shuffle join (redistribute both sides by join key hash).

Vectorized Execution Engine

pkg/sql/colflow/, pkg/sql/colexec/:

  • All operators implement colexecop.Operator with Next() coldata.Batch returning columnar batches of 1024 rows.
  • Columnar format: each column is a typed Go slice + validity bitmap (for NULLs). No row-oriented iteration.
  • Cross-node data transport: colrpc.Outbox serializes batches as Apache Arrow IPC format, sends via gRPC stream. colrpc.Inbox deserializes.
  • execgen tool generates type-specialized operator implementations for all type combinations at compile time — no runtime reflection.

Explain Output

EXPLAIN (DISTSQL, VERBOSE) SELECT ...

Returns per-node processor graph. EXPLAIN ANALYZE (DISTSQL) adds runtime row counts, memory usage, and time per processor — essential for diagnosing plan quality issues.


Apply Operator & Decorrelation Internals

The decorrelate.opt rules translate correlated subqueries into joins. The mechanism:

Apply Operator

A correlated subquery is initially represented as an Apply operator:

Apply(left: R, right: S(outer_cols), join_type)
  -- right side references columns from left; must re-execute per left row
  -- O(|R|) subquery executions

Decorrelation Algorithm (Unnesting)

Rules in decorrelate.opt transform Apply into regular Join by hoisting the correlated expression:

Step 1: Identify outer column references in right subtree
Step 2: For each correlated operator in right (Select, Project, GroupBy, etc.):
  - "Hoist" the correlated computation above the Apply
  - Replace inner correlation reference with a new column
Step 3: Apply becomes InnerJoin / SemiJoin / AntiJoin
  - Join condition = original correlation predicate
  - Right side is now a standalone expression (no outer column refs)

Example:

SELECT * FROM t WHERE id IN (SELECT id FROM s WHERE s.col = t.col)
-- Initial: Apply(t, Select(Scan(s), s.col = t.col), SemiJoin)
-- After DecorrelateSelect:
--   SemiJoin(t, Scan(s), t.col = s.col)
-- t.col = s.col is now a regular join predicate; executes once

When decorrelation fails: subquery references multiple levels of outer scope, or contains volatile functions, or uses complex aggregation patterns that can't be unnested. Fallback: apply operator with subquery cache (memo per distinct outer-col value).