agents: add pd-hotspot-troubleshooting skill#10278
agents: add pd-hotspot-troubleshooting skill#10278lhy1024 wants to merge 2 commits intotikv:masterfrom
Conversation
Add a reusable English skill for PD hotspot troubleshooting with a monitor-first workflow. The runbook prioritizes PD/TiKV monitoring, then Key Visualizer and SQL evidence, and uses pd-ctl as fallback evidence collection. Signed-off-by: lhy1024 <admin@liudos.us>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdds a new PD hotspot troubleshooting guide documenting inputs, stepwise investigation (PD/TiKV monitoring, Top/Slow SQL, logs, SQL hotspot views, pd-ctl), a decision tree for hotspot scenarios (A–D), tuning recommendations with rollback guidance, and an evidence/output template. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.agents/skills/pd-hotspot-troubleshooting/SKILL.md (1)
40-40: Prefer plain conjunctions over slash-separated phrasing for readability.On Line 40, replacing slash separators with commas plus “and” improves scanability in a high-pressure runbook context.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md at line 40, Replace the slash-separated phrase "Check pending peers / down peers / offline peers / extra peers / missing peers." with a plain, comma-and conjunction list for readability — e.g., "Check pending peers, down peers, offline peers, extra peers, and missing peers." Update the SKILL.md line containing that exact sentence accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md:
- Line 40: Replace the slash-separated phrase "Check pending peers / down peers
/ offline peers / extra peers / missing peers." with a plain, comma-and
conjunction list for readability — e.g., "Check pending peers, down peers,
offline peers, extra peers, and missing peers." Update the SKILL.md line
containing that exact sentence accordingly.
| - Check whether MBps trends align with PD hotspot conclusions. | ||
| - If PD shows scheduling activity but TiKV shows no improvement, prioritize snapshot/disk/network execution bottlenecks. | ||
|
|
||
| ## 4) Step 2: Key Visualizer Heatmap |
There was a problem hiding this comment.
may not refert to this, it's not availiable in cloud in the future.
There was a problem hiding this comment.
Okay, I will remove the keyvisual part.
|
|
||
| ```bash | ||
| # Hotspot views | ||
| pd-ctl -u http://<pd>:2379 hot read |
There was a problem hiding this comment.
Can we reduce reliance on internal cluster tools for diagnostics? For example, metrics, logs, and TopSQL on the cloud are externally accessible from outside the cluster. PD-CTL requires connecting to the cluster. Alternatively, prioritize using the first three tools for troubleshooting.
There was a problem hiding this comment.
In fact, pd-ctl is also used last for hotspot diagnosis. I will try to replace the config with logs, and replace historical hotspot retrieval with SQL.
Signed-off-by: lhy1024 <admin@liudos.us>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.agents/skills/pd-hotspot-troubleshooting/SKILL.md (1)
45-45: Minor wording polish for consistency (/and compound modifiers).These lines read a bit cleaner with “and” plus consistent compound hyphenation.
✏️ Proposed doc polish
-- Check pending peers / down peers / offline peers / extra peers / missing peers. +- Check pending peers, down peers, offline peers, extra peers, and missing peers. -- scheduling-limit related signals. +- scheduling-limit-related signals. -- timeout/backoff related signals around hotspot periods; +- timeout and backoff-related signals around hotspot periods;Also applies to: 100-100, 103-103
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md at line 45, Replace the phrase "Check pending peers / down peers / offline peers / extra peers / missing peers." with a grammatically consistent version using commas and a final "and", e.g. "Check pending, down, offline, extra, and missing peers.", and make the same change for the identical phrases elsewhere in the document (the occurrences of "Check pending peers / down peers / offline peers / extra peers / missing peers.") to ensure consistent wording and hyphenation/compound formatting across the doc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.agents/skills/pd-hotspot-troubleshooting/SKILL.md:
- Line 45: Replace the phrase "Check pending peers / down peers / offline peers
/ extra peers / missing peers." with a grammatically consistent version using
commas and a final "and", e.g. "Check pending, down, offline, extra, and missing
peers.", and make the same change for the identical phrases elsewhere in the
document (the occurrences of "Check pending peers / down peers / offline peers /
extra peers / missing peers.") to ensure consistent wording and
hyphenation/compound formatting across the doc.
ℹ️ Review info
Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e002b9eb-d62f-4f24-b0b4-18653476972a
📒 Files selected for processing (1)
.agents/skills/pd-hotspot-troubleshooting/SKILL.md
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10278 +/- ##
==========================================
+ Coverage 78.54% 78.79% +0.24%
==========================================
Files 520 527 +7
Lines 69720 70916 +1196
==========================================
+ Hits 54763 55878 +1115
- Misses 10994 11023 +29
- Partials 3963 4015 +52
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
okJiang
left a comment
There was a problem hiding this comment.
This seems to require a huge amount of context to drive hotspot troubleshooting. Besides skill, what seems to be more needed is how to obtain those inputs to send to the AI, in order to leverage AI's powerful capabilities to help us troubleshoot better.
What problem does this PR solve?
Issue Number: ref #10206, ref #10159
PD currently lacks a dedicated reusable agent skill for hotspot troubleshooting.
What is changed and how does it work?
This PR adds a new agent skill:
.agents/skills/pd-hotspot-troubleshooting/SKILL.md.pd-ctlhotspot/history as fallback evidencetuning guidance.
Check List
Tests
Release note
Summary by CodeRabbit