Skip to content

DopamineDriven/pdfdown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

@d0paminedriven/pdfdown

CI

Rust-powered PDF extraction for Node.js via napi-rs. Extracts per-page text, images (as PNG), annotations (links, destinations), structured text (with header/footer detection), and metadata from PDF buffers. All extraction is parallelized with rayon for multi-core performance.

An OCR variant is available as @d0paminedriven/pdfdown-ocr for image-only PDF pages (scanned documents, embedded screenshots). It includes all of the APIs below plus Tesseract-based OCR fallback.

Install

npm install @d0paminedriven/pdfdown
# or
yarn add @d0paminedriven/pdfdown
# or
pnpm add @d0paminedriven/pdfdown

API

Standalone functions

Synchronous

export declare function extractTextPerPage(buffer: Buffer): Array<PageText>
export declare function extractImagesPerPage(buffer: Buffer): Array<PageImage>
export declare function extractAnnotationsPerPage(buffer: Buffer): Array<PageAnnotation>
export declare function extractStructuredTextPerPage(buffer: Buffer): Array<StructuredPageText>
export declare function pdfMetadata(buffer: Buffer): PdfMeta
export declare function pdfDocument(buffer: Buffer): PdfDocument

Async (libuv thread pool)

Each sync function has an async counterpart that runs on the libuv thread pool, keeping the main thread free.

export declare function extractTextPerPageAsync(buffer: Buffer): Promise<Array<PageText>>
export declare function extractImagesPerPageAsync(buffer: Buffer): Promise<Array<PageImage>>
export declare function extractAnnotationsPerPageAsync(buffer: Buffer): Promise<Array<PageAnnotation>>
export declare function extractStructuredTextPerPageAsync(buffer: Buffer): Promise<Array<StructuredPageText>>
export declare function pdfMetadataAsync(buffer: Buffer): Promise<PdfMeta>
export declare function pdfDocumentAsync(buffer: Buffer): Promise<PdfDocument>

PdfDown class

Parse once, call many. The constructor parses the PDF document once and reuses it across all sync method calls. Async methods share the parsed document across worker threads via Arc (atomic reference counting) with zero re-parsing overhead.

export declare class PdfDown {
  constructor(buffer: Buffer)
  textPerPage(): Array<PageText>
  imagesPerPage(): Array<PageImage>
  annotationsPerPage(): Array<PageAnnotation>
  structuredText(): Array<StructuredPageText>
  metadata(): PdfMeta
  document(): PdfDocument
  textPerPageAsync(): Promise<Array<PageText>>
  imagesPerPageAsync(): Promise<Array<PageImage>>
  annotationsPerPageAsync(): Promise<Array<PageAnnotation>>
  structuredTextAsync(): Promise<Array<StructuredPageText>>
  metadataAsync(): Promise<PdfMeta>
  documentAsync(): Promise<PdfDocument>
}

Types

export interface PageText {
  page: number
  text: string
}

export interface StructuredPageText {
  page: number
  header: string
  body: string
  footer: string
}

export interface PageImage {
  page: number
  imageIndex: number
  width: number
  height: number
  data: Buffer // PNG-encoded bytes
  colorSpace: string
  bitsPerComponent: number
  filter: string
  xobjectName: string
  objectId: string
}

export interface PageAnnotation {
  page: number
  subtype: string // "Link", "Text", "Highlight", etc.
  rect: Array<number> // [x1, y1, x2, y2] bounding box
  uri?: string // external link URL
  dest?: string // internal named destination
  content?: string // tooltip / alt text
}

export const enum BoxType {
  CropBox = 'CropBox',
  MediaBox = 'MediaBox',
  Unknown = 'Unknown',
}

export interface PageBox {
  pageCount: number // number of pages with these dimensions
  left: number
  bottom: number
  right: number
  top: number
  width: number
  height: number
  boxType: BoxType
  pages?: Array<number> // only present on non-dominant entries
}

export interface PdfMeta {
  pageCount: number
  version: string
  isLinearized: boolean
  creator?: string
  producer?: string
  creationDate?: string
  modificationDate?: string
  pageBoxes: Array<PageBox>
}

export interface PdfDocument {
  version: string
  isLinearized: boolean
  pageCount: number
  creator?: string
  producer?: string
  creationDate?: string
  modificationDate?: string
  pageBoxes: Array<PageBox>
  totalImages: number
  totalAnnotations: number
  imagePages: Array<number>
  annotationPages: Array<number>
  text: Array<PageText>
  structuredText: Array<StructuredPageText>
  images: Array<PageImage>
  annotations: Array<PageAnnotation>
}

Usage

Standalone functions

Extract text per page

import { readFileSync } from 'fs'
import { extractTextPerPage } from '@d0paminedriven/pdfdown'

const pdf = readFileSync('document.pdf')
const pages = extractTextPerPage(pdf)

for (const { page, text } of pages) {
  console.log(`Page ${page}: ${text.slice(0, 100)}...`)
}

Extract text per page (async)

import { readFile } from 'fs/promises'
import { extractTextPerPageAsync } from '@d0paminedriven/pdfdown'

const pdf = await readFile('document.pdf')
const pages = await extractTextPerPageAsync(pdf)

for (const { page, text } of pages) {
  console.log(`Page ${page}: ${text.slice(0, 100)}...`)
}

Extract structured text with header/footer detection

Splits each page into header, body, and footer sections by detecting repeated lines across pages. Lines that appear at the same position (top or bottom) on >= 60% of pages are classified as headers or footers. Page numbers and other varying digits are normalized during comparison, so "Page 1" and "Page 42" are treated as the same line.

import { readFileSync } from 'fs'
import { extractStructuredTextPerPage } from '@d0paminedriven/pdfdown'

const pdf = readFileSync('document.pdf')
const pages = extractStructuredTextPerPage(pdf)

for (const { page, header, body, footer } of pages) {
  console.log(`Page ${page}:`)
  if (header) console.log(`  Header: ${header}`)
  console.log(`  Body: ${body.slice(0, 100)}...`)
  if (footer) console.log(`  Footer: ${footer}`)
}

Extract structured text (async)

import { readFile } from 'fs/promises'
import { extractStructuredTextPerPageAsync } from '@d0paminedriven/pdfdown'

const pdf = await readFile('document.pdf')
const pages = await extractStructuredTextPerPageAsync(pdf)

// Filter out headers/footers for clean text extraction
const cleanText = pages.map((p) => p.body).join('\n\n')

Extract images as PNG

import { readFileSync } from 'fs'
import { extractImagesPerPage } from '@d0paminedriven/pdfdown'

const pdf = readFileSync('document.pdf')
const images = extractImagesPerPage(pdf)

for (const img of images) {
  const dataUrl = `data:image/png;base64,${img.data.toString('base64')}`
  console.log(`Page ${img.page} image ${img.imageIndex}: ${img.width}x${img.height} ${img.colorSpace}`)
}

Extract images as PNG (async)

import { readFile } from 'fs/promises'
import { extractImagesPerPageAsync } from '@d0paminedriven/pdfdown'

const pdf = await readFile('document.pdf')
const images = await extractImagesPerPageAsync(pdf)

for (const img of images) {
  const dataUrl = `data:image/png;base64,${img.data.toString('base64')}`
  console.log(`Page ${img.page} image ${img.imageIndex}: ${img.width}x${img.height} ${img.colorSpace}`)
}

Extract annotations

import { readFileSync } from 'fs'
import { extractAnnotationsPerPage } from '@d0paminedriven/pdfdown'

const pdf = readFileSync('document.pdf')
const annots = extractAnnotationsPerPage(pdf)

for (const annot of annots) {
  if (annot.uri) {
    console.log(`Page ${annot.page}: ${annot.uri}`)
  }
  if (annot.dest) {
    console.log(`Page ${annot.page}: internal link to ${annot.dest}`)
  }
}

Extract annotations (async)

import { readFile } from 'fs/promises'
import { extractAnnotationsPerPageAsync } from '@d0paminedriven/pdfdown'

const pdf = await readFile('document.pdf')
const annots = await extractAnnotationsPerPageAsync(pdf)

const links = annots.filter((a) => a.uri)
console.log(`Found ${links.length} external links across ${new Set(links.map((a) => a.page)).size} pages`)

Get PDF metadata

import { readFileSync } from 'fs'
import { pdfMetadata } from '@d0paminedriven/pdfdown'

const pdf = readFileSync('document.pdf')
const meta = pdfMetadata(pdf)

console.log(`v${meta.version}, ${meta.pageCount} pages, linearized: ${meta.isLinearized}`)

Get PDF metadata (async)

import { readFile } from 'fs/promises'
import { pdfMetadataAsync } from '@d0paminedriven/pdfdown'

const pdf = await readFile('document.pdf')
const meta = await pdfMetadataAsync(pdf)

console.log(`v${meta.version}, ${meta.pageCount} pages, linearized: ${meta.isLinearized}`)

Page bounding boxes

pageBoxes on PdfMeta and PdfDocument returns deduplicated page dimensions. Uniform PDFs (all pages the same size) return a single entry. Mixed-size PDFs return one entry per distinct geometry — the dominant (most frequent) entry has pages absent, while non-dominant entries list their specific page numbers.

import { readFileSync } from 'fs'
import { pdfMetadata } from '@d0paminedriven/pdfdown'

const meta = pdfMetadata(readFileSync('document.pdf'))

if (meta.pageBoxes.length === 1) {
  const box = meta.pageBoxes[0]
  console.log(`Uniform: ${box.width}x${box.height} ${box.boxType} (${box.pageCount} pages)`)
} else {
  for (const box of meta.pageBoxes) {
    const scope = box.pages ? `pages ${box.pages.join(', ')}` : 'all other pages'
    console.log(`${box.width}x${box.height} ${box.boxType}${scope}`)
  }
}

PdfDown class

The class-based API parses the PDF once in the constructor. Sync methods reuse the parsed document directly (zero re-parsing). Async methods share the parsed document across libuv worker threads via Arc — no data copying, no re-parsing.

import { readFile } from 'fs/promises'
import { PdfDown } from '@d0paminedriven/pdfdown'

const pdf = new PdfDown(await readFile('document.pdf'))

// Sync — reuses the already-parsed document
const text = pdf.textPerPage()
const images = pdf.imagesPerPage()
const annots = pdf.annotationsPerPage()
const structured = pdf.structuredText()
const meta = pdf.metadata()

// Async — shares parsed document via Arc across worker threads
const [asyncText, asyncImages, asyncAnnots, asyncStructured] = await Promise.all([
  pdf.textPerPageAsync(),
  pdf.imagesPerPageAsync(),
  pdf.annotationsPerPageAsync(),
  pdf.structuredTextAsync(),
])

Combined: text + images + links for multimodal embeddings

import { readFile } from 'fs/promises'
import { PdfDown } from '@d0paminedriven/pdfdown'

const pdf = new PdfDown(await readFile('document.pdf'))

const [text, images, annots] = await Promise.all([
  pdf.textPerPageAsync(),
  pdf.imagesPerPageAsync(),
  pdf.annotationsPerPageAsync(),
])

// Group images and links by page
const imagesByPage = Map.groupBy(images, (img) => img.page)
const linksByPage = Map.groupBy(
  annots.filter((a) => a.uri),
  (a) => a.page,
)

// Build per-page payloads for multimodal embeddings
for (const { page, text: pageText } of text) {
  const payload = {
    page,
    text: pageText,
    images: (imagesByPage.get(page) ?? []).map((img) => ({
      dataUrl: `data:image/png;base64,${img.data.toString('base64')}`,
      width: img.width,
      height: img.height,
    })),
    links: (linksByPage.get(page) ?? []).map((a) => a.uri),
  }
  // Send payload to Voyage AI, pgvector, etc.
}

Combined: structured text for clean content pipelines

import { readFile } from 'fs/promises'
import { PdfDown } from '@d0paminedriven/pdfdown'

const pdf = new PdfDown(await readFile('document.pdf'))
const pages = await pdf.structuredTextAsync()

// Strip headers/footers for clean RAG ingestion
const chunks = pages.map(({ page, body }) => ({
  page,
  content: body.trim(),
})).filter((c) => c.content.length > 0)

Parallelism

All extraction functions use rayon for automatic multi-core parallelism. Per-page text extraction, image decoding, and annotation parsing run concurrently across CPU cores. The pdfDocument / document() call additionally runs text, image, and annotation extraction concurrently via nested rayon::join. This is internal to the native addon — the Node.js API surface is unchanged.

OCR runs on a dedicated capped thread pool (default 4 threads, configurable via maxThreads) to prevent oversubscription when running inside a libuv worker thread. Non-OCR extraction uses the global rayon pool.

Tuning Node.js worker threads

Node's libuv thread pool defaults to 4 threads. If you're calling multiple async extraction functions concurrently, you may want to increase it:

UV_THREADPOOL_SIZE=8 node app.js

Supported formats

Filter Description Handling
DCTDecode JPEG-compressed images Decoded and re-encoded as PNG
JPXDecode JPEG 2000 images Decoded and re-encoded as PNG
FlateDecode Zlib-compressed raw pixels Decompressed, reconstructed as PNG
None Uncompressed raw pixels Reconstructed as PNG
ColorSpace Channels
DeviceRGB 3
DeviceGray 1
DeviceCMYK 4 (converted to RGB)
ICCBased Inferred from /N parameter

8-bit and 16-bit BitsPerComponent are both supported (16-bit downscaled to 8-bit for PNG output).

How it works

Built with lopdf (pure Rust PDF parser), image (PNG/JPEG encoding), hayro-jpeg2000 (JPEG 2000 decoding), and rayon (data parallelism). Compiled to a native Node.js addon via napi-rs with prebuilt binaries for:

  • macOS (x64, ARM64)
  • Windows (x64, ia32, ARM64)
  • Linux glibc (x64, ARM64, ARMv7)
  • Linux musl (x64, ARM64)
  • FreeBSD (x64)
  • Android (ARM64, ARMv7)
  • WASI

License

MIT