Skip to content

High-performance HTML-to-plain-text conversion for .NET. Optimised for speed, low allocations, and predictable output.

License

Notifications You must be signed in to change notification settings

pavlosmcg/Html2Text.Net

Repository files navigation

Html2Text.Net

CI Benchmarks License

Just fast HTML -> plain text.

Lightweight, hand rolled, high-performance HTML to plain text conversion for .NET.

Usage

Simple as possible:

using Html2Text;

string html = "<h1>Hello</h1><p>World</p>";

string text = Html2Text.Convert(html);

// Hello
//
// World
image

How it works

Pipeline

HTML document -> Lexer (tokens) -> Parser (AST nodes) -> Renderer (string text)
  • Text nodes are emitted in document order.
  • Basic block separation is preserved (e.g., paragraphs/headings insert newlines).
  • Whitespace is normalized to produce readable plain text.

Minimal formatting is added to make the plain text output readable:

  • HTML tables are given cell separators (|) and horizontal lines (---) under column headers .
  • The <hr/> element adds a horizontal line of dashes (---).
  • The <title> element also gets a horizontal underline.

Formatting logic can be found in Html2Text/Rendering.

Goals

This project is focused on:

  • High performance: designed for low allocations and fast throughput.
  • Text extraction only: get the words from the page/document.
  • No dependencies: Lightweight, not an embedded browser engine. No dependencies other than .NET itself.

Non-goals (by design)

The following are intentionally out of scope so the library can excel at the goals above:

  • Respecting CSS, computed styles, display:none, or visibility.
  • Pixel-accurate layout, whitespace mirroring, or browser-equivalent rendering.
  • Executing JavaScript or loading remote resources.

Performance notes

Benchmarks

High performance is a goal of this project. This library:

  • designed for converting many documents quickly (batch processing, indexing, search pipelines).
  • avoids DOM dependencies.
  • uses a lightweight, hand rolled lexer/parser/renderer pipeline.

Benchmarks are in Html2Text.PerfTests and can be run locally with:

dotnet run -c Release --project Html2Text.PerfTests

Or check out the latest automated perf test results here: https://pavlosmcg.github.io/Html2Text.Net/dev/bench/

image image

Install, build, test

When I've published to NuGet (coming soon!), you will be able to:

dotnet add package Html2Text

Or, for now, download or submodule the repo and reference the project directly.

Build with:

dotnet build

Run unit tests and regression tests:

dotnet test

Regression tests

Each file in the Samples/ directory acts as an acceptance/regression test. The results of converting these HTML files to plain text are saved in Html2Text.RegressionTests/*.verified.txt:

Samples/<file-name>.html -> Html2Text.Convert(<file-contents>) -> <file-name>.verified.txt

For example scottallen.html -> scottallen.verified.txt

Html2Text.RegressionTests uses Verify to make test assertions against verified output snapshots. If you need to update the outputs please see the Verify docs for snapshot management.

Projects in this repository

  • Html2Text/: core library
  • Html2Text.Example/: small example app
  • Html2Text.Tests/: unit tests
  • Html2Text.RegressionTests/: regression/acceptance tests
  • Html2Text.PerfTests/: performance benchmarking console app
  • Samples/: sample HTML files used during development and automated regression testing

Target frameworks

  • .NET 8+

License

MPL-2.0 see LICENSE.txt

About

High-performance HTML-to-plain-text conversion for .NET. Optimised for speed, low allocations, and predictable output.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages