Skip to content

Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ#157

Closed
priyankeshh wants to merge 8 commits intodevelopfrom
marker-deploy
Closed

Integrate Modal-hosted Marker and add env-based switch to offload layout extraction from local RQ#157
priyankeshh wants to merge 8 commits intodevelopfrom
marker-deploy

Conversation

@priyankeshh
Copy link
Contributor

Summary

This PR adds a clean integration path to run Marker on Modal instead of locally via RQ. It introduces an environment-based switch (MARKER_RUN_MODE) and an HTTP client that forwards PDFs to a Modal-hosted Marker service. This lets us use proper GPU/compute and avoid HF Space resource limits. Layout correctness improvements can be iterated later; this PR focuses on wiring and compute offload.

Key changes

  • ocr_jobs.py
    • Loads .env early to pick up configuration.
    • Adds a modal execution branch that calls the Modal endpoint for output_format=json and parses the response.
    • Logs the selected run mode and endpoint for visibility.
  • extralit_server/integrations/marker_modal_client.py
    • New lightweight client to POST PDFs to Modal /convert, with timeout and basic error handling.
    • Reads MARKER_MODAL_BASE_URL and optional MARKER_MODAL_TIMEOUT_SECS from env.
  • .env.example
    • Documents MARKER_RUN_MODE=modal and MARKER_MODAL_BASE_URL required to enable Modal integration.

Configuration

  • Set the following variables (local .env for dev; real env in prod/worker runtime):

How to test

  1. Ensure a Modal Marker service is deployed and healthy
  1. Local CLI test (bypassing RQ)
  • In repo root (so .env is picked up):
    • python extralit-server/src/extralit_server/jobs/ocr_jobs.py "/path/to/sample.pdf" --extract-text
  • Expected logs:
  • Expected output: JSON with tables, figures, text_blocks, metadata.

Checklist

  • Env-based switch to Modal
  • New Modal client and wiring in ocr_jobs
  • .env.example updated
  • Local CLI test path documented
  • Optional: add Modal auth header support
  • Optional: retries/backoff and timeouts tuning

Comment on lines 174 to 216
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @priyankeshh, these code are handling nested dict for both function arguments and returns types, and it's better to define our own data types with pydantic.BaseModel or import a defined type from marker. We want to avoid using these nested dict parsing method since it's hard to maintain this code when there's no typehinting. Can you define the models at extralit-server/src/extralit_server/api/schemas/v1/document/layout.py?

You can see here for an example: https://github.com/Extralit/extralit/blob/e919e0453c808c89a4e7bfa331f1542fde5c2674/extralit-server/src/extralit_server/api/schemas/v1/document/metadata.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored the code to use Pydantic models for layout parsing and defined them in layout.py as suggested.

JonnyTran and others added 4 commits September 29, 2025 23:18
… OCR settings and client

- Deleted the .env.example file as it is no longer needed.
- Added new layout.py for PDF OCR settings using Pydantic.
- Created marker_client.py to handle interactions with the Modal-hosted Marker service.
- Updated ocr_jobs.py to import the new Modal client for document conversion.
@JonnyTran JonnyTran marked this pull request as ready for review October 7, 2025 05:11
@JonnyTran JonnyTran requested a review from a team as a code owner October 7, 2025 05:11
@JonnyTran
Copy link
Member

Save later for determining modal external call due to complexity

@JonnyTran JonnyTran closed this Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments