Just fast HTML -> plain text.
Lightweight, hand rolled, high-performance HTML to plain text conversion for .NET.
Simple as possible:
using Html2Text;
string html = "<h1>Hello</h1><p>World</p>";
string text = Html2Text.Convert(html);
// Hello
//
// World
HTML document -> Lexer (tokens) -> Parser (AST nodes) -> Renderer (string text)
- Text nodes are emitted in document order.
- Basic block separation is preserved (e.g., paragraphs/headings insert newlines).
- Whitespace is normalized to produce readable plain text.
Minimal formatting is added to make the plain text output readable:
- HTML tables are given cell separators (|) and horizontal lines (---) under column headers .
- The
<hr/>element adds a horizontal line of dashes (---). - The
<title>element also gets a horizontal underline.
Formatting logic can be found in Html2Text/Rendering.
This project is focused on:
- High performance: designed for low allocations and fast throughput.
- Text extraction only: get the words from the page/document.
- No dependencies: Lightweight, not an embedded browser engine. No dependencies other than .NET itself.
The following are intentionally out of scope so the library can excel at the goals above:
- Respecting CSS, computed styles,
display:none, or visibility. - Pixel-accurate layout, whitespace mirroring, or browser-equivalent rendering.
- Executing JavaScript or loading remote resources.
High performance is a goal of this project. This library:
- designed for converting many documents quickly (batch processing, indexing, search pipelines).
- avoids DOM dependencies.
- uses a lightweight, hand rolled lexer/parser/renderer pipeline.
Benchmarks are in Html2Text.PerfTests and can be run locally with:
dotnet run -c Release --project Html2Text.PerfTests
Or check out the latest automated perf test results here: https://pavlosmcg.github.io/Html2Text.Net/dev/bench/
When I've published to NuGet (coming soon!), you will be able to:
dotnet add package Html2Text
Or, for now, download or submodule the repo and reference the project directly.
Build with:
dotnet build
Run unit tests and regression tests:
dotnet test
Each file in the Samples/ directory acts as an acceptance/regression test. The results of converting these HTML files to plain text are saved in Html2Text.RegressionTests/*.verified.txt:
Samples/<file-name>.html -> Html2Text.Convert(<file-contents>) -> <file-name>.verified.txt
For example scottallen.html -> scottallen.verified.txt
Html2Text.RegressionTests uses Verify to make test assertions against verified output snapshots. If you need to update the outputs please see the Verify docs for snapshot management.
Html2Text/: core libraryHtml2Text.Example/: small example appHtml2Text.Tests/: unit testsHtml2Text.RegressionTests/: regression/acceptance testsHtml2Text.PerfTests/: performance benchmarking console appSamples/: sample HTML files used during development and automated regression testing
- .NET 8+
MPL-2.0 see LICENSE.txt