URL → Clean Markdown — Fetch a webpage, extract the main content, and convert it to LLM-friendly Markdown. Built for agents and pipelines.
- Fetches HTTP(S) URLs with configurable timeout and headers
- Source types: website, twitter (Nitter), reddit — inferred from URL when not specified. Medium adapter is removed for now.
- Extracts article content with Mozilla Readability (or raw body)
- Converts to Markdown with Turndown and custom rules
- Optimizes for agents: normalizes spacing, dedupes links, strips tracking params, optional chunking
- Typed API and CLI
Requirements: Node.js 18+
npm install markliftimport { urlToMarkdown } from "marklift";
// source is inferred from URL when omitted (twitter/x.com → twitter, reddit → reddit, else website)
const result = await urlToMarkdown("https://example.com/article", {
timeout: 10_000,
});
const tweet = await urlToMarkdown("https://x.com/user/status/123"); // uses twitter adapter
console.log(result.title);
console.log(result.markdown);
console.log(result.wordCount, result.sections.length, result.links.length);# Install globally to get the `marklift` command
npm install -g marklift
# Convert a URL to Markdown (prints to stdout). Source is inferred from URL.
marklift https://example.com
marklift https://x.com/user/status/123 # uses twitter adapter
marklift https://reddit.com/r/... # uses reddit adapter
# Output full result as JSON
marklift https://example.com --json
# Options
marklift https://example.com --timeout 15000
marklift https://example.com --chunk-size 2000
marklift https://example.com --source website # override inferred sourceCLI options:
| Option | Description |
|---|---|
--source <website|twitter|reddit> |
Source adapter (default: inferred from URL). Override when needed. |
--timeout <ms> |
Request timeout in milliseconds (default: 15000) |
--chunk-size <n> |
Split markdown into chunks of ~n characters |
--json |
Output full result as JSON instead of markdown |
Converts a URL to clean Markdown. Returns a Promise<MarkdownResult>.
Options:
| Option | Type | Description |
|---|---|---|
source |
"website" | "twitter" | "reddit" |
Source adapter. Default: inferred from URL (twitter.com/x.com/nitter → twitter, reddit.com → reddit, else website). Override to force a specific adapter. |
timeout |
number |
Request timeout in ms (default: 15000) |
headers |
Record<string, string> |
Custom HTTP headers (e.g. User-Agent) |
chunkSize |
number |
If set, result.chunks will contain token-safe chunks |
Result (MarkdownResult):
url— Original URLtitle— Page titledescription— Meta description (if present)markdown— Full markdown with source-specific frontmatter (see below) + bodysections—{ heading, content }[]by heading (stable order)links— Deduplicated links, sorted (tracking params stripped)wordCount— Approximate word countcontentHash— SHA-256 of optimized markdown (stability checks)metadata?— Structured metadata (OG, canonical, author, publishedAt, image, language)chunks?— WhenchunkSizeis set:{ content, index, total }[](no split inside code blocks or tables)
Async generator that yields MarkdownChunk (meta, sections, links) as they are produced. Useful for streaming into an LLM or pipeline.
Each adapter outputs markdown with a frontmatter block (--- … ---) then the body.
Website (and reddit). Format type: website. Medium not supported currently.
---
source: https://example.com/article
canonical: https://example.com/article
title: Example Article Title
description: Short meta description
author: John Doe
published_at: 2025-01-12
language: en
content_hash: <sha256>
word_count: 1243
---
# Title
Body content…Twitter:
---
platform: twitter
source: https://twitter.com/username/status/1234567890
tweet_id: 1234567890
author:
name: Author Name
published_at: 2025-01-10T18:22:00Z
language: en
content_hash: <sha256>
---
Body content…InvalidUrlError— Invalid or non-HTTP(S) URLFetchError— Network error, timeout, or non-2xx responseParseError— Readability or parsing failure
Production: Website and reddit adapters use a browser-like User-Agent by default so requests from servers/datacenters get full HTML. The Twitter adapter keeps the Marklift User-Agent so Nitter works. Override via headers if needed.
import { urlToMarkdown, urlToMarkdownStream } from "marklift";
// One-shot (source inferred from URL)
const result = await urlToMarkdown("https://blog.example.com/post", {
timeout: 10_000,
chunkSize: 2000,
});
console.log(result.title, result.wordCount);
if (result.chunks) {
for (const chunk of result.chunks) {
// Send chunk to LLM, etc.
}
}
// Streaming
for await (const chunk of urlToMarkdownStream(
"https://blog.example.com/post"
)) {
process.stdout.write(chunk.content);
}npm test # unit + E2E (E2E needs network)
npm run test:unit # unit only (no network)
npm run test:e2e # E2E with real URLs onlySet SKIP_E2E=1 to skip E2E tests (e.g. in CI without network).
Contributions are welcome. See CONTRIBUTING.md for setup, code style, and how to submit changes.
