Skip to content

agents: add pd-hotspot-troubleshooting skill#10278

Open
lhy1024 wants to merge 2 commits intotikv:masterfrom
lhy1024:agents-add-pd-hotspot-skill-v2
Open

agents: add pd-hotspot-troubleshooting skill#10278
lhy1024 wants to merge 2 commits intotikv:masterfrom
lhy1024:agents-add-pd-hotspot-skill-v2

Conversation

@lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Mar 3, 2026

What problem does this PR solve?

Issue Number: ref #10206, ref #10159

PD currently lacks a dedicated reusable agent skill for hotspot troubleshooting.

What is changed and how does it work?

This PR adds a new agent skill:

  • Add .agents/skills/pd-hotspot-troubleshooting/SKILL.md.
  • Provide a standardized troubleshooting order:
    1. PD/TiKV monitoring
    2. Key Visualizer
    3. Top SQL / Slow SQL
    4. pd-ctl hotspot/history as fallback evidence
  • Include structured decision trees for common hotspot scenarios and rollback-aware
    tuning guidance.
agents: add pd-hotspot-troubleshooting skill

Check List

Tests

  • No code

Release note

None.

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive PD hotspot troubleshooting guide including step-by-step investigation procedures, monitoring strategies, decision trees for hotspot scenarios, and tuning recommendations with safety rollback procedures for TiDB/TiKV cluster management.

Add a reusable English skill for PD hotspot troubleshooting with a
monitor-first workflow.

The runbook prioritizes PD/TiKV monitoring, then Key Visualizer and
SQL evidence, and uses pd-ctl as fallback evidence collection.

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed do-not-merge/needs-linked-issue dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 3, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bufferflies for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 3, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

Adds a new PD hotspot troubleshooting guide documenting inputs, stepwise investigation (PD/TiKV monitoring, Top/Slow SQL, logs, SQL hotspot views, pd-ctl), a decision tree for hotspot scenarios (A–D), tuning recommendations with rollback guidance, and an evidence/output template.

Changes

Cohort / File(s) Summary
PD Hotspot Troubleshooting Guide
.agents/skills/pd-hotspot-troubleshooting/SKILL.md
Adds a new, comprehensive documentation file describing PD/TiKV hotspot investigation workflow, monitoring paths, SQL and hotspot views, pd-ctl fallback commands, a decision tree (A–D), tuning recommendations with rollback steps, and an output/evidence template. (≈+258 lines)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

lgtm, approved

Suggested reviewers

  • okJiang
  • rleungx

Poem

🐰 I hopped through logs and scheduler tunes,

Traced hot spots under silver moons.
With steps and checks and rollback art,
I'll nudge the balance, calm each part.
Hooray—smooth clusters beat my heart!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a new PD hotspot troubleshooting skill for the agents system.
Description check ✅ Passed The description covers the problem being solved, explains what is changed and how it works, includes a proper commit message, and completes the checklist. All required sections are addressed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.agents/skills/pd-hotspot-troubleshooting/SKILL.md (1)

40-40: Prefer plain conjunctions over slash-separated phrasing for readability.

On Line 40, replacing slash separators with commas plus “and” improves scanability in a high-pressure runbook context.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md at line 40, Replace the
slash-separated phrase "Check pending peers / down peers / offline peers / extra
peers / missing peers." with a plain, comma-and conjunction list for readability
— e.g., "Check pending peers, down peers, offline peers, extra peers, and
missing peers." Update the SKILL.md line containing that exact sentence
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md:
- Line 40: Replace the slash-separated phrase "Check pending peers / down peers
/ offline peers / extra peers / missing peers." with a plain, comma-and
conjunction list for readability — e.g., "Check pending peers, down peers,
offline peers, extra peers, and missing peers." Update the SKILL.md line
containing that exact sentence accordingly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27ee39b and 4e564f2.

📒 Files selected for processing (1)
  • .agents/skills/pd-hotspot-troubleshooting/SKILL.md

- Check whether MBps trends align with PD hotspot conclusions.
- If PD shows scheduling activity but TiKV shows no improvement, prioritize snapshot/disk/network execution bottlenecks.

## 4) Step 2: Key Visualizer Heatmap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not refert to this, it's not availiable in cloud in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will remove the keyvisual part.


```bash
# Hotspot views
pd-ctl -u http://<pd>:2379 hot read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce reliance on internal cluster tools for diagnostics? For example, metrics, logs, and TopSQL on the cloud are externally accessible from outside the cluster. PD-CTL requires connecting to the cluster. Alternatively, prioritize using the first three tools for troubleshooting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, pd-ctl is also used last for hotspot diagnosis. I will try to replace the config with logs, and replace historical hotspot retrieval with SQL.

Signed-off-by: lhy1024 <admin@liudos.us>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.agents/skills/pd-hotspot-troubleshooting/SKILL.md (1)

45-45: Minor wording polish for consistency (/ and compound modifiers).

These lines read a bit cleaner with “and” plus consistent compound hyphenation.

✏️ Proposed doc polish
-- Check pending peers / down peers / offline peers / extra peers / missing peers.
+- Check pending peers, down peers, offline peers, extra peers, and missing peers.

-- scheduling-limit related signals.
+- scheduling-limit-related signals.

-- timeout/backoff related signals around hotspot periods;
+- timeout and backoff-related signals around hotspot periods;

Also applies to: 100-100, 103-103

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md at line 45, Replace the
phrase "Check pending peers / down peers / offline peers / extra peers / missing
peers." with a grammatically consistent version using commas and a final "and",
e.g. "Check pending, down, offline, extra, and missing peers.", and make the
same change for the identical phrases elsewhere in the document (the occurrences
of "Check pending peers / down peers / offline peers / extra peers / missing
peers.") to ensure consistent wording and hyphenation/compound formatting across
the doc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md:
- Line 45: Replace the phrase "Check pending peers / down peers / offline peers
/ extra peers / missing peers." with a grammatically consistent version using
commas and a final "and", e.g. "Check pending, down, offline, extra, and missing
peers.", and make the same change for the identical phrases elsewhere in the
document (the occurrences of "Check pending peers / down peers / offline peers /
extra peers / missing peers.") to ensure consistent wording and
hyphenation/compound formatting across the doc.

ℹ️ Review info
Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e002b9eb-d62f-4f24-b0b4-18653476972a

📥 Commits

Reviewing files that changed from the base of the PR and between 4e564f2 and 8ec6076.

📒 Files selected for processing (1)
  • .agents/skills/pd-hotspot-troubleshooting/SKILL.md

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.79%. Comparing base (b273ae0) to head (8ec6076).
⚠️ Report is 42 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10278      +/-   ##
==========================================
+ Coverage   78.54%   78.79%   +0.24%     
==========================================
  Files         520      527       +7     
  Lines       69720    70916    +1196     
==========================================
+ Hits        54763    55878    +1115     
- Misses      10994    11023      +29     
- Partials     3963     4015      +52     
Flag Coverage Δ
unittests 78.79% <ø> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to require a huge amount of context to drive hotspot troubleshooting. Besides skill, what seems to be more needed is how to obtain those inputs to send to the AI, in order to leverage AI's powerful capabilities to help us troubleshoot better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants