Replaced cheerio with htmlparser2+domutils in url-utils#821
Replaced cheerio with htmlparser2+domutils in url-utils#821kevinansfield wants to merge 1 commit intomainfrom
Conversation
The cheerio dependency causes yarn hoisting issues for downstream consumers that also use `juice` for email compatibility. Using htmlparser2, domutils, and domhandler directly is much lighter and avoids the version constraints that cause these conflicts.
WalkthroughThe changes replace the cheerio dependency with htmlparser2, domhandler, and domutils in the url-utils package. The implementation in html-transform.ts is rewritten to use htmlparser2's parseDocument for DOM parsing and domutils utilities for element traversal and attribute access, replacing cheerio's jQuery-like API. All four corresponding test files are updated to mock htmlparser2 instead of cheerio, preserving equivalent test coverage and assertions. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/url-utils/src/utils/html-transform.ts`:
- Around line 88-95: The isInsideCode function incorrectly starts traversal at
el.parent so a <code> element itself isn't treated as protected; change the
traversal to begin at el (not el.parent) in isInsideCode so it checks the
element itself and then walks up parents, keeping the existing 'name' in node &&
node.name === 'code' check and returning true if found and false otherwise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fcf65f77-ef45-4005-89c5-d3ce9296f86a
⛔ Files ignored due to path filters (1)
yarn.lockis excluded by!**/yarn.lock,!**/*.lock
📒 Files selected for processing (6)
packages/url-utils/package.jsonpackages/url-utils/src/utils/html-transform.tspackages/url-utils/test/unit/utils/html-absolute-to-relative.test.jspackages/url-utils/test/unit/utils/html-absolute-to-transform-ready.test.jspackages/url-utils/test/unit/utils/html-relative-to-absolute.test.jspackages/url-utils/test/unit/utils/html-relative-to-transform-ready.test.js
|
@vershwal I ran into issues with Dropping the direct |
ref TryGhost/SDK#821 The direct cheerio 0.22 dependency causes yarn hoisting conflicts with the cheerio ^1.0 that juice requires, leading to version resolution issues across the monorepo. This replaces all cheerio usage with a thin utility built on htmlparser2, css-select, domutils, and dom-serializer — the same packages cheerio wraps — avoiding the conflict entirely. Benchmark results (cheerio 0.22 → html-utils): Scenario Speedup OEmbed link discovery 2.3x Extract links (mentions) 5.8x Mention verification 6.8x Segment extraction 5.9x Full email post-processing 2.1x Segment filtering + serialize 2.4x RSS feed generation 18.8x
ref TryGhost/SDK#821 The direct cheerio 0.22 dependency causes yarn hoisting conflicts with the cheerio ^1.0 that juice requires, leading to version resolution issues across the monorepo. This replaces all cheerio usage with a thin utility built on htmlparser2, css-select, domutils, and dom-serializer — the same packages cheerio wraps — avoiding the conflict entirely. Benchmark results (cheerio 0.22 → html-utils): Scenario Speedup OEmbed link discovery 2.3x Extract links (mentions) 5.8x Mention verification 6.8x Segment extraction 5.9x Full email post-processing 2.1x Segment filtering + serialize 2.4x RSS feed generation 18.8x
ref TryGhost/SDK#821 The direct cheerio 0.22 dependency causes yarn hoisting conflicts with the cheerio ^1.0 that juice requires, leading to version resolution issues across the monorepo. This replaces all cheerio usage with a thin utility built on htmlparser2, css-select, domutils, and dom-serializer — the same packages cheerio wraps — avoiding the conflict entirely. Benchmark results (cheerio 0.22 → html-utils): Scenario Speedup OEmbed link discovery 2.3x Extract links (mentions) 5.8x Mention verification 6.8x Segment extraction 5.9x Full email post-processing 2.1x Segment filtering + serialize 2.4x RSS feed generation 18.8x
ref TryGhost/SDK#821 The direct cheerio 0.22 dependency causes yarn hoisting conflicts with the cheerio ^1.0 that juice requires, leading to version resolution issues across the monorepo. This replaces all cheerio usage with a thin utility built on htmlparser2, css-select, domutils, and dom-serializer — the same packages cheerio wraps — avoiding the conflict entirely. Benchmark results (cheerio 0.22 → html-utils): Scenario Speedup OEmbed link discovery 2.3x Extract links (mentions) 5.8x Mention verification 6.8x Segment extraction 5.9x Full email post-processing 2.1x Segment filtering + serialize 2.4x RSS feed generation 18.8x
ref TryGhost/SDK#821 The direct cheerio 0.22 dependency causes yarn hoisting conflicts with the cheerio ^1.0 that juice requires, leading to version resolution issues across the monorepo. This replaces all cheerio usage with a thin utility built on htmlparser2, css-select, domutils, and dom-serializer — the same packages cheerio wraps — avoiding the conflict entirely. Benchmark results (cheerio 0.22 → html-utils): Scenario Speedup OEmbed link discovery 2.3x Extract links (mentions) 5.8x Mention verification 6.8x Segment extraction 5.9x Full email post-processing 2.1x Segment filtering + serialize 2.4x RSS feed generation 18.8x
ref TryGhost/SDK#821 The direct cheerio 0.22 dependency causes yarn hoisting conflicts with the cheerio ^1.0 that juice requires, leading to version resolution issues across the monorepo. This replaces all cheerio usage with a thin utility built on htmlparser2, css-select, domutils, and dom-serializer — the same packages cheerio wraps — avoiding the conflict entirely. Benchmark results (cheerio 0.22 → html-utils): Scenario Speedup OEmbed link discovery 2.3x Extract links (mentions) 5.8x Mention verification 6.8x Segment extraction 5.9x Full email post-processing 2.1x Segment filtering + serialize 2.4x RSS feed generation 18.8x
vershwal
left a comment
There was a problem hiding this comment.
Nice one, thanks for this! Cheerio did feel like more than we needed here, so reducing the dependency weight and simplifying this path is spot on.
Summary
cheeriodependency in@tryghost/url-utilswith the lighterhtmlparser2,domutils, anddomhandlerpackagesjuicefor email compatibility — this removes that conflictChanges
html-transform.ts:cheerio.load()→parseDocument(), jQuery selectors →findAll()/hasAttrib()/getAttributeValue(),.closest('code')→ simple parent-walk helperpackage.json: Removedcheerio, addedhtmlparser2(10.1.0),domutils(3.2.2),domhandler(5.0.3)rewirespy targets fromcheeriotohtmlparser2_1to match compiled outputBenchmark results
Tested cheerio vs htmlparser2 vs
@thednp/domparser(another zero-dep alternative) across small, medium, and large HTML inputs:htmlparser2 is 10-32x faster than cheerio and 1.3-3.5x faster than
@thednp/domparseracross all input sizes.