Skip to content
David edited this page May 24, 2025 · 3 revisions

Translated from Spanish by 🌐💬 Aphra.

Announcement - #1

Published Apr 26, 2025

A few weeks ago, I was creating a chatbot demo that would respond to potential customers using information from a company's website. I extracted all the text from it and added it to a RAG index, to enable the retrieval of relevant information for any request to the chatbot.

Everything was going well until I came across a series of PDF catalogs on the website. What now? Lots of useful information, relevant images... All scanned pages. Therefore, text extraction yielded no results. I tried running OCR on it, but besides being full of inaccuracies, it returned all the text jumbled together and chaotically.

Ideally, if a customer asks about something that appears in the catalog, the language model could indicate: "you can see an image of the product and other similar ones on page 12 of the catalog," and link to it. I set it aside for the moment, omitting that information from the RAG index.

A couple of weeks ago, I looked for information about this again. I know there are quite a few innovative solutions for converting PDFs to Markdown using new models, with very promising results. I explored options like MinerU, MarkItDown, Nougat, and Vision Parse. They used a combination of LLM for text and VLM for document images, trying to replicate the document's content in Markdown as accurately as possible. Vision Parse was especially good at this. But the result wasn't anything like what I was looking for. Because the catalog wasn't a simple text document with interspersed images. Each page was a full-sheet image, with different photographic compositions of the products and some accompanying product data.

What I needed was a detailed and contextualized description, page by page, of what was in the catalog. This way, the LLM would have access to the same information I would perceive when looking at the page. I needed a description that would allow someone who isn't seeing that page to form the most accurate mental image possible. I know there are visual reasoning models capable enough for this, allowing me to make queries page by page. But it was very slow. Since I couldn't find any open-source application that did exactly that, I created and published one: It's called DescribePDF.

DescribePDF not only serves to describe documents like catalogs that aren't well-suited to text conversion via other methods, it can also explain what a PDF contains to a person with visual impairments.

Although I've created a version for local execution, if you don't have a computer with significant graphics processing power, I recommend the version that works with OpenRouter. You can try it for free without adding your own API key for a limited time at https://huggingface.co/spaces/davidlms/describepdf.

I hope you like it! Any suggestions for improvement or recommendations of other applications that are more convenient for this use case are more than welcome.

Source: https://www.linkedin.com/posts/david-romero-santos_describepdf-a-hugging-face-space-by-davidlms-activity-7321815000854020097-zCZs

MCP Server and page selector - #2

Published May 2, 2025

Updates to DescribePDF: MCP Server and Page Selection!

The team behind Gradio keeps adding interesting features. Since last Wednesday, any application can become an MCP server with just a single line of code change or by adding an environment variable. This enables the creation of three-in-one applications:

  • Web interface designed for human users
  • API for other applications
  • MCP for Large Language Models (LLMs)

What is MCP? The Model Context Protocol (MCP) is a standard protocol that allows language models to interact directly with external tools. This means that AI assistants like Claude can understand your application's capabilities and use it as a tool, without the need for additional programming.

For this reason, from now on, DescribePDF will also function as an MCP server, with the version available in the Hugging Face test environment included (https://huggingface.co/spaces/davidlms/describepdf). In the images, you can see a comparison of an MCP client, Claude's application, before and after connecting it to the DescribePDF's MCP server. As a result, it becomes a utility that autonomous AI agent systems can easily utilize.

Before using DescribePDF as MCP

After using DescribePDF as MCP

I thought it would be interesting to take this opportunity to add the option to select only a specific set of PDF pages to process. I believe this will be useful for very long documents or when dealing with mixed documents (containing both text-only pages that can be processed more efficiently through other methods, and pages requiring visual interpretation through AI).

What do you think would be the most interesting next step in the development of DescribePDF?

Source: https://www.linkedin.com/posts/david-romero-santos_novedades-en-describepdf-servidor-mcp-y-activity-7323971759068590080-NNYd

Clone this wiki locally