Skip to content

planner: enable cd-c algorithm by default for join reorder | tidb-test=pr/2676#66349

Open
guo-shaoge wants to merge 39 commits intopingcap:masterfrom
guo-shaoge:post_cdc_impl
Open

planner: enable cd-c algorithm by default for join reorder | tidb-test=pr/2676#66349
guo-shaoge wants to merge 39 commits intopingcap:masterfrom
guo-shaoge:post_cdc_impl

Conversation

@guo-shaoge
Copy link
Collaborator

@guo-shaoge guo-shaoge commented Feb 24, 2026

What problem does this PR solve?

Issue Number: close #66088

Problem Summary:

  1. enable new join reorder impl by default
  2. clean all the cases

What changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • Bug Fixes

    • Improved query execution plan optimization for complex joins, particularly with outer join reordering and index range predicate handling.
    • Fixed join predicate ordering in query plans to ensure correct optimization and execution.
  • Tests

    • Enhanced test coverage for query optimization scenarios and join reordering strategies.
    • Updated test assertions for improved reliability of plan validation tests.

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@pantheon-ai
Copy link

pantheon-ai bot commented Feb 24, 2026

I've accepted your request and will start reviewing the pull request. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

Open in Web
Learn more about Pantheon AI

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/planner SIG: Planner labels Feb 24, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign 0xpoe for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@guo-shaoge guo-shaoge changed the title planner: enable cd-c algorithm by default for join reorder [WIP] planner: enable cd-c algorithm by default for join reorder Feb 24, 2026
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2026
@guo-shaoge
Copy link
Collaborator Author

remainder: add inTest check if has remaining edge

Copy link

@pantheon-ai pantheon-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: join reorder fallback behavior

}
})
if p.SCtx().GetSessionVars().EnableOuterJoinReorder {
p, err := joinorder.Optimize(p)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: joinorder.Optimize can hit a fallback where it logs WARN and returns the original join tree when it can’t consume all predicate-carrying edges (pkg/planner/core/joinorder/join_order.go:458). With CD-C becoming default-on here, this could be triggered by complex join graphs and silently skip reordering (correctness preserved, but potential perf regressions + noisy WARNs). Consider downgrading the log / adding a metric, or providing a more deterministic fallback path.

Copy link

@pantheon-ai pantheon-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes due to a P1 cost bug and two operability/behavior changes surfaced by making CD-C join reorder the default. See inline comments for details.

}
})
if p.SCtx().GetSessionVars().EnableOuterJoinReorder {
p, err := joinorder.Optimize(p)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 (see pkg/planner/core/joinorder/join_order.go:507): With default tidb_opt_cartesian_join_order_threshold=0 (CartesianJoinOrderThreshold), CD-C’s 2nd pass forces allowNoEQ=true and then multiplies cumCost by cartesianFactor (0), making cartesian/no-EQ joins cost 0. This can drive severe plan regressions/OOM once this path is default; please ensure the penalty never reduces cost (e.g., clamp factor >=1 / additive penalty) and/or don’t run the 2nd pass when cartesian is disabled.

})
if p.SCtx().GetSessionVars().EnableOuterJoinReorder {
p, err := joinorder.Optimize(p)
return p, false, err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (see pkg/planner/core/joinorder/join_order.go:273): tidb_opt_join_reorder_threshold is effectively ignored (threshold check commented out; useGreedy := true), making DP unreachable. This is a compatibility/operability regression for users relying on DP join reorder with threshold > 0.

failpoint.Return(p2, false, err)
}
})
if p.SCtx().GetSessionVars().EnableOuterJoinReorder {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: EnableOuterJoinReorder now selects which join reorder engine runs (joinorder.Optimize vs optimizeRecursive), impacting pure-inner-join queries too. That changes the var’s operational meaning beyond “whether outer joins participate”; consider a separate session var/flag for engine selection and keep this var scoped to outer-join behavior.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 24, 2026

@pantheon-ai[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Requesting changes due to a P1 cost bug and two operability/behavior changes surfaced by making CD-C join reorder the default. See inline comments for details.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

failpoint.Return(p2, false, err)
}
})
if p.SCtx().GetSessionVars().EnableOuterJoinReorder {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Routing the default join-reorder through joinorder.Optimize (via EnableOuterJoinReorder) bypasses the legacy extractJoinGroup path that checks tidb_opt_join_reorder_through_sel (TiDBOptJoinReorderThroughSel, around L53). This makes that knob ineffective in default configs. Should the new joinorder path honor it (or document/rename it as legacy-only)?

@guo-shaoge guo-shaoge changed the title [WIP] planner: enable cd-c algorithm by default for join reorder [WIP] planner: enable cd-c algorithm by default for join reorder | pr=tidb-test/2676 Mar 2, 2026
@guo-shaoge guo-shaoge changed the title [WIP] planner: enable cd-c algorithm by default for join reorder | pr=tidb-test/2676 planner: enable cd-c algorithm by default for join reorder | pr=tidb-test/2676 Mar 2, 2026
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 2, 2026
@guo-shaoge
Copy link
Collaborator Author

/retest

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces schema-aligned join condition handling and replaces failpoint-based control flow with session variable gating for outer join reordering. Core planner logic is enhanced with condition alignment functions, and test expectations across multiple suites are updated to reflect the resulting query plan changes.

Changes

Cohort / File(s) Summary
Join Reorder Core Logic
pkg/planner/core/rule_join_reorder.go, pkg/planner/core/joinorder/conflict_detector.go
Replaced failpoint-based conditional with session variable gating for outer join reordering; introduced alignment functions (alignScalarFuncsNotNullWithSchema, alignExprsNotNullWithSchema, alignExprNotNullWithSchema) to align EQ and non-EQ conditions with joined schema; modified makeNonInnerJoin to use aligned conditions for predicate distribution.
Build Configuration
pkg/planner/core/joinorder/BUILD.bazel, pkg/planner/core/casetest/enforcempp/BUILD.bazel
Added dependencies: mysql parser package to joinorder library; testify/assert to enforcempp test suite.
Test Infrastructure
pkg/planner/core/casetest/enforcempp/enforce_mpp_test.go
Replaced require.Eventually with assert.Eventually capturing actual plan for each iteration; added conditional failure with diagnostic output including cascades value, SQL statement, expected plan, and actual plan.
Planner Testdata Updates
pkg/planner/core/casetest/binaryplan/testdata/*, pkg/planner/core/casetest/cbotest/testdata/*, pkg/planner/core/casetest/hint/testdata/*, pkg/planner/core/casetest/mpp/testdata/*, pkg/planner/core/casetest/rule/testdata/*, pkg/planner/core/casetest/tpcds/testdata/*, pkg/planner/core/casetest/testdata/*
Updated JSON plan expectations across multiple test suites; changes include node identifier shifts, predicate reordering (e.g., gt/lt precedence changes), removal of redundant warning messages, and structural plan reorganization (projection placement, exchange node reordering, join tree reshaping).
Integration Test Results
tests/integrationtest/r/planner/core/*, tests/integrationtest/r/select.result, tests/integrationtest/r/tpch.result, tests/integrationtest/r/join_reorder2.result, tests/integrationtest/r/plan_cache.result
Updated explain plan outputs reflecting new join ordering strategy: HashJoin topology changes, table scan reordering, join predicate restructuring, predicate pushdown adjustments, and Build/Probe side reassignments across complex multi-join scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

ok-to-test

Suggested reviewers

  • qw4990
  • terry1purcell

Poem

🐰 The planner hops with joy today,
Conditions align along the way,
Join orders bloom in schema's light,
Outer joins reorder left and right!
With testify's assert, we diagnose,
No more failpoints—the session knows! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description lacks critical required information including a proper issue link, detailed problem statement, and explanation of changes. Add a proper issue link (close #66088), provide a detailed problem summary explaining why the cd-c algorithm is being enabled by default, and explain what changed and how it works in detail.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the main change: enabling the cd-c algorithm by default for join reorder in the planner. This is the primary objective of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/planner/core/rule_join_reorder.go (2)

295-300: Remove stale failpoint placeholder block.

The commented failpoint block is dead code and now just adds confusion around the active control path.

🧹 Proposed cleanup
-	// failpoint.Inject("enableCDCJoinReorder", func(val failpoint.Value) {
-	// 	if val.(bool) {
-	// 		p2, err := joinorder.Optimize(p)
-	// 		failpoint.Return(p2, false, err)
-	// 	}
-	// })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/planner/core/rule_join_reorder.go` around lines 295 - 300, Remove the
stale commented failpoint block that references
failpoint.Inject("enableCDCJoinReorder") and the inner call to
joinorder.Optimize(p) (variables p and p2); simply delete the entire
commented-out block so the active control flow is not obscured by dead code, and
ensure no other references to enableCDCJoinReorder remain in this scope.

291-303: Please attach targeted planner test evidence for this optimizer-path switch.

Given this changes join reorder dispatch logic, include the exact planner and integration test commands/results in the PR thread before merge.

As per coding guidelines pkg/planner/**: Run targeted planner unit tests (go test -run <TestName> -tags=intest,deadlock) and update rule testdata when needed for changes to planner rules or logical/physical plans.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/planner/core/rule_join_reorder.go` around lines 291 - 303, Add targeted
planner unit and integration tests that exercise both branches of the join
reorder dispatch (when p.SCtx().GetSessionVars().EnableOuterJoinReorder is true
and false) so CI and reviewers can see exact planner outputs; create/modify
tests (e.g., in a new or existing pkg/planner/*_test.go such as
rule_join_reorder_test.go) that invoke the planner on representative queries and
assert or capture the produced logical/physical plans for the Optimize and
optimizeRecursive paths, update any rule testdata expected outputs under
pkg/planner/testdata as needed, and run the commands to produce evidence: go
test -run <TestName> -tags=intest,deadlock (and the relevant unit test runs)
then paste the exact command lines and their outputs (planner and integration
results) into the PR thread for review.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/planner/core/rule_join_reorder.go`:
- Around line 295-300: Remove the stale commented failpoint block that
references failpoint.Inject("enableCDCJoinReorder") and the inner call to
joinorder.Optimize(p) (variables p and p2); simply delete the entire
commented-out block so the active control flow is not obscured by dead code, and
ensure no other references to enableCDCJoinReorder remain in this scope.
- Around line 291-303: Add targeted planner unit and integration tests that
exercise both branches of the join reorder dispatch (when
p.SCtx().GetSessionVars().EnableOuterJoinReorder is true and false) so CI and
reviewers can see exact planner outputs; create/modify tests (e.g., in a new or
existing pkg/planner/*_test.go such as rule_join_reorder_test.go) that invoke
the planner on representative queries and assert or capture the produced
logical/physical plans for the Optimize and optimizeRecursive paths, update any
rule testdata expected outputs under pkg/planner/testdata as needed, and run the
commands to produce evidence: go test -run <TestName> -tags=intest,deadlock (and
the relevant unit test runs) then paste the exact command lines and their
outputs (planner and integration results) into the PR thread for review.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 255b3fb and 631165e.

📒 Files selected for processing (26)
  • pkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_out.json
  • pkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_xut.json
  • pkg/planner/core/casetest/cbotest/testdata/analyze_suite_out.json
  • pkg/planner/core/casetest/cbotest/testdata/analyze_suite_xut.json
  • pkg/planner/core/casetest/enforcempp/BUILD.bazel
  • pkg/planner/core/casetest/enforcempp/enforce_mpp_test.go
  • pkg/planner/core/casetest/hint/testdata/integration_suite_out.json
  • pkg/planner/core/casetest/hint/testdata/integration_suite_xut.json
  • pkg/planner/core/casetest/mpp/testdata/integration_suite_out.json
  • pkg/planner/core/casetest/mpp/testdata/integration_suite_xut.json
  • pkg/planner/core/casetest/rule/rule_outer2inner_test.go
  • pkg/planner/core/casetest/rule/testdata/join_reorder_suite_out.json
  • pkg/planner/core/casetest/rule/testdata/outer2inner_out.json
  • pkg/planner/core/casetest/rule/testdata/outer2inner_xut.json
  • pkg/planner/core/casetest/testdata/json_plan_suite_out.json
  • pkg/planner/core/casetest/testdata/json_plan_suite_xut.json
  • pkg/planner/core/casetest/tpcds/testdata/tpcds_suite_out.json
  • pkg/planner/core/casetest/tpcds/testdata/tpcds_suite_xut.json
  • pkg/planner/core/rule_join_reorder.go
  • tests/integrationtest/r/planner/core/casetest/index/index.result
  • tests/integrationtest/r/planner/core/casetest/rule/rule_join_reorder.result
  • tests/integrationtest/r/planner/core/issuetest/planner_issue.result
  • tests/integrationtest/r/planner/core/join_reorder2.result
  • tests/integrationtest/r/planner/core/plan_cache.result
  • tests/integrationtest/r/planner/core/rule_join_reorder.result
  • tests/integrationtest/r/select.result

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrationtest/r/tpch.result`:
- Line 447: The EXPLAIN predicate printer is dropping relation/alias qualifiers,
producing unsatisfiable text like eq(tpch50.nation.n_name, "JAPAN") for joined
instances; update the predicate/expr formatter (e.g., the function that formats
scalar expressions / column references used by the HashJoin explain path) to
include the per-source alias or qualifier (the join input alias such as n1/n2)
when rendering ColumnRef nodes so the predicate shows eq(n1.n_name, "JAPAN") and
eq(n2.n_name, "INDIA"); locate the formatter used by HashJoin explain output and
change it to prefer the relation alias (or join alias) over the fully qualified
catalog.table name when an alias is present.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 631165e and 0364655.

📒 Files selected for processing (1)
  • tests/integrationtest/r/tpch.result

├─TableReader(Build) root data:TableFullScan
│ └─TableFullScan cop[tikv] table:customer keep order:false
└─HashJoin(Probe) root inner join, equal:[eq(tpch50.lineitem.l_orderkey, tpch50.orders.o_orderkey)]
└─HashJoin root inner join, equal:[eq(tpch50.customer.c_nationkey, tpch50.nation.n_nationkey)], other cond:or(and(eq(tpch50.nation.n_name, "JAPAN"), eq(tpch50.nation.n_name, "INDIA")), and(eq(tpch50.nation.n_name, "INDIA"), eq(tpch50.nation.n_name, "JAPAN")))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the tpch.result file
cd /repo 2>/dev/null || cd . 
fd -t f "tpch.result" | head -5

Repository: pingcap/tidb

Length of output: 91


🏁 Script executed:

# Get context around line 447 - show lines 430-470 to see full segment
if [ -f "tests/integrationtest/r/tpch.result" ]; then
  sed -n '430,470p' tests/integrationtest/r/tpch.result | cat -n
fi

Repository: pingcap/tidb

Length of output: 3462


🏁 Script executed:

# Find Q7 SQL query - typically before the EXPLAIN output
# Look backwards from line 447 for the SQL query
if [ -f "tests/integrationtest/r/tpch.result" ]; then
  sed -n '1,450p' tests/integrationtest/r/tpch.result | grep -n "Q7" -A 30 | head -50
fi

Repository: pingcap/tidb

Length of output: 1152


🏁 Script executed:

# Search for the Q7 SQL definition in the codebase
rg "SELECT.*nation.*n1.*nation.*n2" -A 5 -B 5 2>/dev/null | head -100

Repository: pingcap/tidb

Length of output: 38


🏁 Script executed:

# Look for TPCHqueries or test data defining Q7
rg "from.*nation.*n1.*from.*nation.*n2\|from.*nation.*n1.*,.*nation.*n2" 2>/dev/null | head -20

Repository: pingcap/tidb

Length of output: 38


EXPLAIN output loses alias distinction in predicate rendering.

Line 447 displays other cond:or(and(eq(tpch50.nation.n_name, "JAPAN"), eq(tpch50.nation.n_name, "INDIA")), ...), which appears textually unsatisfiable. While the planner correctly executes the self-join via distinct table:n1 and table:n2 accesses, the predicate rendering should preserve alias information to clarify that the comparisons apply to different nation instances (e.g., eq(n1.n_name, "JAPAN"), eq(n2.n_name, "INDIA")). The current formatting masks the actual join semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrationtest/r/tpch.result` at line 447, The EXPLAIN predicate
printer is dropping relation/alias qualifiers, producing unsatisfiable text like
eq(tpch50.nation.n_name, "JAPAN") for joined instances; update the
predicate/expr formatter (e.g., the function that formats scalar expressions /
column references used by the HashJoin explain path) to include the per-source
alias or qualifier (the join input alias such as n1/n2) when rendering ColumnRef
nodes so the predicate shows eq(n1.n_name, "JAPAN") and eq(n2.n_name, "INDIA");
locate the formatter used by HashJoin explain output and change it to prefer the
relation alias (or join alias) over the fully qualified catalog.table name when
an alias is present.

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 80.85106% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.0531%. Comparing base (999e8f4) to head (9805e41).
⚠️ Report is 17 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #66349        +/-   ##
================================================
- Coverage   77.6833%   77.0531%   -0.6303%     
================================================
  Files          2008       1930        -78     
  Lines        549801     539861      -9940     
================================================
- Hits         427104     415980     -11124     
- Misses       120989     123863      +2874     
+ Partials       1708         18      -1690     
Flag Coverage Δ
integration 41.1245% <73.4042%> (-7.0632%) ⬇️
unit 75.9567% <80.8510%> (-0.2855%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (+0.0929%) ⬆️
parser ∅ <ø> (∅)
br 48.6028% <ø> (-12.2972%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@pantheon-ai
Copy link

pantheon-ai bot commented Mar 3, 2026

Review Complete

Findings: 0 issues
Posted: 0
Duplicates/Skipped: 0

ℹ️ Learn more details on Pantheon AI.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/planner/core/joinorder/conflict_detector.go`:
- Around line 785-802: The cloned ScalarFunction's cached hash becomes stale
after you mutate its arguments in the block handling *expression.ScalarFunction
(where you call e.Clone(), assign newFunc.GetArgs()[i] = newArgs[i] after
alignExprNotNullWithSchema), so update the code to invalidate/recompute the hash
cache: after the mutation loop call expression.ReHashCode(newFunc) (or rebuild
the function via expression.NewFunctionInternal(...) instead of mutating the
clone) so the canonicalhashcode reflects the new arguments.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0364655 and 0b357bc.

📒 Files selected for processing (5)
  • pkg/planner/core/casetest/rule/testdata/outer_to_semi_join_suite_out.json
  • pkg/planner/core/casetest/rule/testdata/outer_to_semi_join_suite_xut.json
  • pkg/planner/core/casetest/tpch/testdata/tpch_suite_out.json
  • pkg/planner/core/casetest/tpch/testdata/tpch_suite_xut.json
  • pkg/planner/core/joinorder/conflict_detector.go

Comment on lines +785 to +802
case *expression.ScalarFunction:
args := e.GetArgs()
changed := false
newArgs := make([]expression.Expression, len(args))
for i := range args {
newArgs[i] = alignExprNotNullWithSchema(args[i], schema)
if newArgs[i] != args[i] {
changed = true
}
}
if !changed {
return e
}
newFunc := e.Clone().(*expression.ScalarFunction)
for i := range newArgs {
newFunc.GetArgs()[i] = newArgs[i]
}
return newFunc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect ScalarFunction struct definition and Clone() implementation
echo "=== ScalarFunction struct and Clone() implementation ==="
rg -n -A20 'type ScalarFunction struct' pkg/expression/scalar_function.go

echo -e "\n=== ScalarFunction Clone() method ==="
rg -n -A30 'func \(sf \*ScalarFunction\) Clone\(' pkg/expression/scalar_function.go

echo -e "\n=== ScalarFunction GetArgs() method ==="
rg -n -A15 'func \(sf \*ScalarFunction\) GetArgs\(' pkg/expression/scalar_function.go

echo -e "\n=== Hash-related fields/methods in ScalarFunction ==="
rg -n '(hashcode|canonicalhashcode)' pkg/expression/scalar_function.go

echo -e "\n=== Expression interface hashcode/canonicalhashcode ==="
rg -n -A5 'func \(sf \*ScalarFunction\) (HashCode|CanonicalHashCode)\(' pkg/expression/

Repository: pingcap/tidb

Length of output: 1723


🌐 Web query:

TiDB ScalarFunction Clone() implementation arg mutation safety

💡 Result:

TiDB’s (*ScalarFunction).Clone() is arg-mutation safe (with one important caveat about RetType).

Why args are safe to mutate after Clone()

ScalarFunction.Clone() builds a new ScalarFunction and clones the underlying builtinFunc via sf.Function.Clone(). (raw.githubusercontent.com)

All builtin signatures embed baseBuiltinFunc, whose cloneFrom() deep-clones the argument expressions:

So after cloning:

  • the clone’s GetArgs() slice is independent
  • each argument expression node is also cloned
  • replacing/rewriting args in the clone (e.g. clone.GetArgs()[i] = newExpr) will not affect the original

Caveat: RetType is shallow-copied

Both ScalarFunction.RetType and baseBuiltinFunc.tp are copied by pointer (explicitly marked shallow in code via struct tags / assignments). (raw.githubusercontent.com)

So do not mutate clone.RetType (or the builtin ret type) in-place unless you first clone the FieldType; otherwise you risk affecting other references to the same *types.FieldType.

Citations:


🌐 Web query:

TiDB expression.ScalarFunction cached hash Clone behavior

💡 Result:

In TiDB (pkg/expression), ScalarFunction caches its hash in two fields: hashcode (used by HashCode()) and canonicalhashcode (used by CanonicalHashCode()). (raw.githubusercontent.com)

Clone behavior (cached hash):

  • (*ScalarFunction).Clone() does not copy the cached hashcode into the clone; it only clones the underlying builtin Function and copies metadata. So a cloned ScalarFunction will typically have an empty hashcode cache and will recompute it on first HashCode() call. (raw.githubusercontent.com)
  • Clone() does copy canonicalhashcode if it is already present (slices.Clone(sf.canonicalhashcode)). (raw.githubusercontent.com)

How the cache is (re)computed:

  • HashCode() returns the cached sf.hashcode if it’s already set; otherwise it calls ReHashCode(sf) to rebuild it. (raw.githubusercontent.com)
  • If you mutate a scalar function’s arguments “in place”, TiDB expects you to call ReHashCode(sf) afterward (“used after we change the argument in place”). (raw.githubusercontent.com)

Related gotcha:

  • RemapColumn() explicitly clears the cache (newSf.hashcode = nil) after rewriting args, ensuring the hash will be recomputed. (raw.githubusercontent.com)

Citations:


🌐 Web query:

TiDB pkg/expression ScalarFunction architecture design

💡 Result:

TiDB pkg/expression ScalarFunction: architecture/design (high level)

1) Where ScalarFunction sits in the expression system

  • TiDB builds an expression tree (columns/constants/scalar-functions, etc.) and evaluates it during execution. The “refactor builtin” design emphasizes compile-time selection of typed function signatures so execution avoids repeated type-branching and Datum conversions. [5]
  • ScalarFunction is the expression node that represents “a function returning a value”, and it mainly wraps:
    • FuncName (case-insensitive string)
    • RetType (static return FieldType)
    • Function (a builtinFunc implementation that does the real work) [1]

2) Creation path: name → function class → concrete signature (builtinFunc)

  • The central factory is newFunctionImpl(...), used by NewFunction, NewFunctionBase, NewFunctionTryFold, etc. [1]
  • Flow (simplified):
    1. Handle a few special forms (e.g., CAST, GET_VAR, sysdate-now rewrite). [1]
    2. Lookup funcName in the global funcs registry (and extension function registry). If missing, return “function not exists”. [1]
    3. Do some argument/type preprocessing (e.g., infer NULL constant types for most functions). [1]
    4. Call functionClass.getFunction(ctx, args) to select/build the best-matching builtin signature (the actual builtinFunc). [1][4][5]
    5. Construct ScalarFunction{ FuncName, RetType, Function }. Optionally run an init/check callback. [1]
    6. Optionally run constant folding (default NewFunction folds). [1]

3) builtinFunc and “signature structs” (how a function is implemented)

  • TiDB models a SQL function as:
    • a function class (does arg-count checks, return-type inference, overload resolution), and
    • one or more signatures (concrete implementations), each implementing typed evalXX methods. This pattern is explicitly described in the refactor blog and dev guide. [4][5]
  • Best-practice implementation steps (from the dev guide):
    • Register the function in funcs map (name → functionClass)
    • Implement getFunction(...) to build a baseBuiltinFunc (via helpers like newBaseBuiltinFuncWithTp) and return a signature struct
    • Implement evalString/evalInt/evalReal/… depending on return type. [4][5]

4) Common embedding: baseBuiltinFunc carries shared metadata and behavior

  • Most builtin signature structs embed baseBuiltinFunc, which stores:
    • args []Expression, return type tp *types.FieldType
    • pushdown signature id pbCode tipb.ScalarFuncSig
    • collation/coercibility metadata
    • vectorization and thread-safety/shareability flags, plus notes that new fields must be thread-safe/immutable if the expression can be shared across sessions. [2]

5) Execution: row-based and vectorized evaluation

  • ScalarFunction exposes vectorized evaluation entrypoints like VecEvalInt/Real/String/Decimal/Time/Duration/JSON/..., which delegate to the underlying builtinFunc’s vecEvalXX methods. [1]
  • A ScalarFunction reports itself vectorizable only if both the function and its children are vectorizable (Vectorized() checks function + children). [1]
  • Builtin vectorization code is partly generator-driven (//go:generate ... *_vec.go). [2]

6) Pushdown architecture hook: protobuf signature + PB conversion

  • Builtin functions carry a tipb.ScalarFuncSig (pbCode) used for coprocessor/TiKV/TiFlash pushdown encoding. [2]
  • PbConverter converts an expression tree into tipb.Expr protobuf; if conversion fails, TiDB treats the expression as not pushdown-able (internal error “cannot be pushed down”). [3]

Key takeaway: ScalarFunction is a thin expression wrapper around a selected, typed builtinFunc signature. Most “architecture” lives in (a) the factory (newFunctionImpl) that selects a signature from the funcs registry via functionClass.getFunction, (b) the shared baseBuiltinFunc metadata (types/collation/pushdown id/vectorization), and (c) the execution/pushdown adapters (VecEval* and PB conversion). [1][2][3][4][5]


Clear hash caches after mutating cloned ScalarFunction arguments.

On lines 798–801, the code mutates newFunc.GetArgs()[i] after Clone(). While the argument slice itself is safely deep-cloned, the ScalarFunction carries a cached canonicalhashcode that is copied into the clone. Mutating args without invalidating this cache leaves it stale and inconsistent with the new argument values.

Add expression.ReHashCode(newFunc) after the mutation loop to clear and recompute the hash caches, or alternatively use expression.NewFunctionInternal(...) to reconstruct the function (as suggested in the original review).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/planner/core/joinorder/conflict_detector.go` around lines 785 - 802, The
cloned ScalarFunction's cached hash becomes stale after you mutate its arguments
in the block handling *expression.ScalarFunction (where you call e.Clone(),
assign newFunc.GetArgs()[i] = newArgs[i] after alignExprNotNullWithSchema), so
update the code to invalidate/recompute the hash cache: after the mutation loop
call expression.ReHashCode(newFunc) (or rebuild the function via
expression.NewFunctionInternal(...) instead of mutating the clone) so the
canonicalhashcode reflects the new arguments.

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
…/tidb into post_cdc_impl

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Collaborator Author

/retest

Signed-off-by: guo-shaoge <shaoge1994@163.com>
"Warn": [
"Warning 1815 leading hint is inapplicable, check if the leading hint table has join conditions with other tables"
]
"Warn": null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, also an improvement for leading hint.

Signed-off-by: guo-shaoge <shaoge1994@163.com>
"Selection root isnull(test.t3.i)",
"└─HashJoin root left outer join, left side:TableReader, equal:[eq(test.t1.i, test.t2.i)]",
"Projection root test.t1.i, test.t2.i, <nil>->test.t3.i",
"└─HashJoin root anti semi join, left side:HashJoin, equal:[eq(test.t2.i, test.t3.i)]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected? outer join --> anti semi

Copy link
Collaborator Author

@guo-shaoge guo-shaoge Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we have this rule of converting outer to anti semi #64959,
and for this case, the first scenrio matches(IS NULL on the join condition column), that's why the outer join is converted to anti semi join.

Comment on lines +750 to +751
alignedEQConds = alignScalarFuncsNotNullWithSchema(alignedEQConds, join.Schema())
alignedNonEQConds := alignExprsNotNullWithSchema(e.nonEQConds, join.Schema())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems tricky, could you add some comments for these 2 lines

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@tiprow
Copy link

tiprow bot commented Mar 10, 2026

@guo-shaoge: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 9805e41 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 10, 2026

@guo-shaoge: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/check_dev_2 42b1dae link true /test check-dev2
pull-build-next-gen 9805e41 link true /test pull-build-next-gen
idc-jenkins-ci-tidb/build 9805e41 link true /test build
idc-jenkins-ci-tidb/unit-test 9805e41 link true /test unit-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pre-refactor for join reorder conflict detection algorithm

2 participants