Skip to content

Replaced cheerio with htmlparser2+domutils in url-utils#821

Open
kevinansfield wants to merge 1 commit intomainfrom
no-cheerio
Open

Replaced cheerio with htmlparser2+domutils in url-utils#821
kevinansfield wants to merge 1 commit intomainfrom
no-cheerio

Conversation

@kevinansfield
Copy link
Member

@kevinansfield kevinansfield commented Mar 14, 2026

Summary

  • Replaced the cheerio dependency in @tryghost/url-utils with the lighter htmlparser2, domutils, and domhandler packages
  • The cheerio dependency causes yarn hoisting issues for downstream consumers that also use juice for email compatibility — this removes that conflict
  • Cheerio's jQuery-like API and CSS selector engine were overkill here; only DOM traversal and attribute reading are needed

Changes

  • html-transform.ts: cheerio.load()parseDocument(), jQuery selectors → findAll()/hasAttrib()/getAttributeValue(), .closest('code') → simple parent-walk helper
  • package.json: Removed cheerio, added htmlparser2 (10.1.0), domutils (3.2.2), domhandler (5.0.3)
  • 4 test files: Updated rewire spy targets from cheerio to htmlparser2_1 to match compiled output

Benchmark results

Tested cheerio vs htmlparser2 vs @thednp/domparser (another zero-dep alternative) across small, medium, and large HTML inputs:

Small HTML (46 chars, 10,000 iterations):
  Library               Total(ms)  Per-op(μs)  Relative
  cheerio                   543.2        54.3    31.66x
  htmlparser2                17.2         1.7     1.00x
  @thednp/domparser          22.0         2.2     1.28x

Medium HTML (796 chars, 5,000 iterations):
  Library               Total(ms)  Per-op(μs)  Relative
  cheerio                   522.9       104.6    15.45x
  htmlparser2                33.8         6.8     1.00x
  @thednp/domparser         115.3        23.1     3.41x

Large HTML (50 blocks, 43,209 chars, 500 iterations):
  Library               Total(ms)  Per-op(μs)  Relative
  cheerio                  1701.3      3402.6    10.45x
  htmlparser2               162.9       325.7     1.00x
  @thednp/domparser         565.8      1131.5     3.47x

htmlparser2 is 10-32x faster than cheerio and 1.3-3.5x faster than @thednp/domparser across all input sizes.

The cheerio dependency causes yarn hoisting issues for downstream consumers
that also use `juice` for email compatibility. Using htmlparser2, domutils,
and domhandler directly is much lighter and avoids the version constraints
that cause these conflicts.
@coderabbitai
Copy link

coderabbitai bot commented Mar 14, 2026

Walkthrough

The changes replace the cheerio dependency with htmlparser2, domhandler, and domutils in the url-utils package. The implementation in html-transform.ts is rewritten to use htmlparser2's parseDocument for DOM parsing and domutils utilities for element traversal and attribute access, replacing cheerio's jQuery-like API. All four corresponding test files are updated to mock htmlparser2 instead of cheerio, preserving equivalent test coverage and assertions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately describes the main change: replacing cheerio with htmlparser2+domutils across the url-utils package.
Description check ✅ Passed The PR description clearly relates to the changeset, detailing the replacement of cheerio with htmlparser2+domutils, the motivation (yarn hoisting issues), and specific file changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch no-cheerio
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/url-utils/src/utils/html-transform.ts`:
- Around line 88-95: The isInsideCode function incorrectly starts traversal at
el.parent so a <code> element itself isn't treated as protected; change the
traversal to begin at el (not el.parent) in isInsideCode so it checks the
element itself and then walks up parents, keeping the existing 'name' in node &&
node.name === 'code' check and returning true if found and false otherwise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fcf65f77-ef45-4005-89c5-d3ce9296f86a

📥 Commits

Reviewing files that changed from the base of the PR and between fb3ad19 and ab92f81.

⛔ Files ignored due to path filters (1)
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (6)
  • packages/url-utils/package.json
  • packages/url-utils/src/utils/html-transform.ts
  • packages/url-utils/test/unit/utils/html-absolute-to-relative.test.js
  • packages/url-utils/test/unit/utils/html-absolute-to-transform-ready.test.js
  • packages/url-utils/test/unit/utils/html-relative-to-absolute.test.js
  • packages/url-utils/test/unit/utils/html-relative-to-transform-ready.test.js

@kevinansfield kevinansfield requested a review from vershwal March 14, 2026 16:10
@kevinansfield
Copy link
Member Author

@vershwal I ran into issues with url-utils bump in Koenig after the direct cheerio dependency was bumped here. We'll have similar issues in Ghost too because it also uses juice. The problem is that juice depends on a very specific RC version of cheerio to not introduce double-encoding issues but due to the way yarn resolution and hoisting works, when another dependency specifies a later version of cheerio that satisfies the dependency range yarn hoists it and juice starts using the incompatible version.

Dropping the direct cheerio usage here both solves that issue and offers a significant speedup 🚀

kevinansfield added a commit to kevinansfield/Ghost that referenced this pull request Mar 14, 2026
ref TryGhost/SDK#821
The direct cheerio 0.22 dependency causes yarn hoisting conflicts with
the cheerio ^1.0 that juice requires, leading to version resolution
issues across the monorepo. This replaces all cheerio usage with a thin
utility built on htmlparser2, css-select, domutils, and dom-serializer
— the same packages cheerio wraps — avoiding the conflict entirely.

Benchmark results (cheerio 0.22 → html-utils):

  Scenario                          Speedup
  OEmbed link discovery              2.3x
  Extract links (mentions)           5.8x
  Mention verification               6.8x
  Segment extraction                 5.9x
  Full email post-processing         2.1x
  Segment filtering + serialize      2.4x
  RSS feed generation               18.8x
kevinansfield added a commit to TryGhost/Ghost that referenced this pull request Mar 14, 2026
ref TryGhost/SDK#821
The direct cheerio 0.22 dependency causes yarn hoisting conflicts with
the cheerio ^1.0 that juice requires, leading to version resolution
issues across the monorepo. This replaces all cheerio usage with a thin
utility built on htmlparser2, css-select, domutils, and dom-serializer
— the same packages cheerio wraps — avoiding the conflict entirely.

Benchmark results (cheerio 0.22 → html-utils):

  Scenario                          Speedup
  OEmbed link discovery              2.3x
  Extract links (mentions)           5.8x
  Mention verification               6.8x
  Segment extraction                 5.9x
  Full email post-processing         2.1x
  Segment filtering + serialize      2.4x
  RSS feed generation               18.8x
kevinansfield added a commit to TryGhost/Ghost that referenced this pull request Mar 14, 2026
ref TryGhost/SDK#821
The direct cheerio 0.22 dependency causes yarn hoisting conflicts with
the cheerio ^1.0 that juice requires, leading to version resolution
issues across the monorepo. This replaces all cheerio usage with a thin
utility built on htmlparser2, css-select, domutils, and dom-serializer
— the same packages cheerio wraps — avoiding the conflict entirely.

Benchmark results (cheerio 0.22 → html-utils):

  Scenario                          Speedup
  OEmbed link discovery              2.3x
  Extract links (mentions)           5.8x
  Mention verification               6.8x
  Segment extraction                 5.9x
  Full email post-processing         2.1x
  Segment filtering + serialize      2.4x
  RSS feed generation               18.8x
kevinansfield added a commit to TryGhost/Ghost that referenced this pull request Mar 15, 2026
ref TryGhost/SDK#821
The direct cheerio 0.22 dependency causes yarn hoisting conflicts with
the cheerio ^1.0 that juice requires, leading to version resolution
issues across the monorepo. This replaces all cheerio usage with a thin
utility built on htmlparser2, css-select, domutils, and dom-serializer
— the same packages cheerio wraps — avoiding the conflict entirely.

Benchmark results (cheerio 0.22 → html-utils):

  Scenario                          Speedup
  OEmbed link discovery              2.3x
  Extract links (mentions)           5.8x
  Mention verification               6.8x
  Segment extraction                 5.9x
  Full email post-processing         2.1x
  Segment filtering + serialize      2.4x
  RSS feed generation               18.8x
kevinansfield added a commit to TryGhost/Ghost that referenced this pull request Mar 15, 2026
ref TryGhost/SDK#821
The direct cheerio 0.22 dependency causes yarn hoisting conflicts with
the cheerio ^1.0 that juice requires, leading to version resolution
issues across the monorepo. This replaces all cheerio usage with a thin
utility built on htmlparser2, css-select, domutils, and dom-serializer
— the same packages cheerio wraps — avoiding the conflict entirely.

Benchmark results (cheerio 0.22 → html-utils):

  Scenario                          Speedup
  OEmbed link discovery              2.3x
  Extract links (mentions)           5.8x
  Mention verification               6.8x
  Segment extraction                 5.9x
  Full email post-processing         2.1x
  Segment filtering + serialize      2.4x
  RSS feed generation               18.8x
kevinansfield added a commit to TryGhost/Ghost that referenced this pull request Mar 15, 2026
ref TryGhost/SDK#821
The direct cheerio 0.22 dependency causes yarn hoisting conflicts with
the cheerio ^1.0 that juice requires, leading to version resolution
issues across the monorepo. This replaces all cheerio usage with a thin
utility built on htmlparser2, css-select, domutils, and dom-serializer
— the same packages cheerio wraps — avoiding the conflict entirely.

Benchmark results (cheerio 0.22 → html-utils):

  Scenario                          Speedup
  OEmbed link discovery              2.3x
  Extract links (mentions)           5.8x
  Mention verification               6.8x
  Segment extraction                 5.9x
  Full email post-processing         2.1x
  Segment filtering + serialize      2.4x
  RSS feed generation               18.8x
Copy link
Member

@vershwal vershwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, thanks for this! Cheerio did feel like more than we needed here, so reducing the dependency weight and simplifying this path is spot on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants