Skip to content

chore(deps): update dependency unstructured to ^0.14.0 [security]#29

Open
renovate[bot] wants to merge 1 commit intodevfrom
renovate/pypi-unstructured-vulnerability
Open

chore(deps): update dependency unstructured to ^0.14.0 [security]#29
renovate[bot] wants to merge 1 commit intodevfrom
renovate/pypi-unstructured-vulnerability

Conversation

@renovate
Copy link

@renovate renovate bot commented Dec 26, 2025

Note: This PR body was truncated due to platform limits.

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
unstructured ^0.5.11^0.14.0 age adoption passing confidence

GitHub Vulnerability Alerts

CVE-2024-46455

unstructured v.0.14.2 and before is vulnerable to XML External Entity (XXE) via the XMLParser.


Release Notes

Unstructured-IO/unstructured (unstructured)

v0.14.3

Compare Source

Enhancements
  • Move category field from Text class to Element class.
  • partition_docx() now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCX Paragraph and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.
  • Add VoyageAI embedder Adds VoyageAI embeddings to support embedding via Voyage AI.
Features
Fixes
  • Fix partition_pdf() to keep spaces in the text. The control character \t is now replaced with a space instead of being removed when merging inferred elements with embedded elements.
  • Turn off XML resolve entities Sets resolve_entities=False for XML parsing with lxml
    to avoid text being dynamically injected into the XML document.
  • Add backward compatibility for the deprecated pdf_infer_table_structure parameter.
  • Add the missing form_extraction_skip_tables argument to the partition_pdf_or_image call.
    to avoid text being dynamically injected into the XML document.
  • Chromadb change from Add to Upsert using element_id to make idempotent
  • Diable table_as_cells output by default to reduce overhead in partition; now table_as_cells is only produced when the env EXTACT_TABLE_AS_CELLS is true
  • Reduce excessive logging Change per page ocr info level logging into detail level trace logging
  • Replace try block in document_to_element_list for handling HTMLDocument Use getattr(element, "type", "") to get the type attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block

v0.14.2

Compare Source

Enhancements
  • Bump unstructured-inference==0.7.33.
Features
  • Add attribution to the pinecone connector.
Fixes

v0.14.0

Compare Source

BREAKING CHANGES
  • Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.
Enhancements
  • Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
  • Faster evaluation Support for concurrent processing of documents during evaluation
  • Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
  • Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.
Features
  • Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.
Fixes
  • Add missing starting_page_num param to partition_image
  • Make the filename and file params for partition_image and partition_pdf match the other partitioners
  • Fix include_slide_notes and include_page_breaks params in partition_ppt
  • Re-apply: skip accuracy calculation feature Overwritten by mistake
  • Fix type hint for paragraph_grouper param paragraph_grouper can be set to False, but the type hint did not not reflect this previously.
  • Remove links param from partition_pdf links is extracted during partitioning and is not needed as a paramter in partition_pdf.
  • Improve CSV delimeter detection. partition_csv() would raise on CSV files with very long lines.
  • Fix disk-space leak in partition_doc(). Remove temporary file created but not removed when file argument is passed to partition_doc().
  • Fix possible SyntaxError or SyntaxWarning on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.
  • Fix disk-space leak in partition_odt(). Remove temporary file created but not removed when file argument is passed to partition_odt().
  • AstraDB: option to prevent indexing metadata
  • Fix Missing py.typed

v0.13.7

Compare Source

Enhancements
  • Remove page_number metadata fields for HTML partition until we have a better strategy to decide page counting.
  • Extract OCRAgent.get_agent(). Generalize access to the configured OCRAgent instance beyond its use for PDFs.
  • Add calculation of table related metrics which take into account colspans and rowspans
  • Evaluation: skip accuracy calculation for files for which output and ground truth sizes differ greatly
Features
  • add ability to get ratio of cid characters in embedded text extracted by pdfminer.
Fixes
  • partition_docx() handles short table rows. The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate .text and .metadata.text_as_html for these tables.
  • Remedy macOS test failure not triggered by CI. Generalize temp-file detection beyond hard-coded Linux-specific prefix.
  • Remove unnecessary warning log for using default layout model.
  • Add chunking to partition_tsv Even though partition_tsv() produces a single Table element, chunking is made available because the Table element is often larger than the desired chunk size and must be divided into smaller chunks.

v0.13.6

Compare Source

Enhancements
Features
Fixes
  • ValueError: Invalid file (FileType.UNK) when parsing Content-Type header with charset directive URL response Content-Type headers are now parsed according to RFC 9110.

v0.13.5

Compare Source

Enhancements
Features
Fixes
  • KeyError raised when updating parent_id In the past, combining ListItem elements could result in reusing the same memory location which then led to unexpected side effects when updating element IDs.
  • Bump unstructured-inference==0.7.29: table transformer predictions are now removed if confidence is below threshold

v0.13.4

Compare Source

Enhancements
  • Unique and deterministic hash IDs for elements Element IDs produced by any partitioning
    function are now deterministic and unique at the document level by default. Before, hashes were
    based only on text; however, they now also take into account the element's sequence number on a
    page, the page's number in the document, and the document's file name.
  • Enable remote chunking via unstructured-ingest Chunking using unstructured-ingest was
    previously limited to local chunking using the strategies basic and by_title. Remote chunking
    options via the API are now accessible.
  • Save table in cells format. UnstructuredTableTransformerModel is able to return predicted table in cells format
Features
  • Add a PDF_ANNOTATION_THRESHOLD environment variable to control the capture of embedded links in partition_pdf() for fast strategy.
  • Add integration with the Google Cloud Vision API. Adds a third OCR provider, alongside Tesseract and Paddle: the Google Cloud Vision API.
Fixes
  • Remove ElementMetadata.section field.. This field was unused, not populated by any partitioners.

v0.13.3

Compare Source

Enhancements
  • Remove duplicate image elements. Remove image elements identified by PDFMiner that have similar bounding boxes and the same text.
  • Add support for start_index in html links extraction
  • Add strategy arg value to _PptxPartitionerOptions. This makes this paritioning option available for sub-partitioners to come that may optionally use inference or other expensive operations to improve the partitioning.
  • Support pluggable sub-partitioner for PPTX Picture shapes. Use a distinct sub-partitioner for partitioning PPTX Picture (image) shapes and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.
  • Introduce starting_page_number parameter to partitioning functions It applies to those partitioners which support page_number in element's metadata: PDF, TIFF, XLSX, DOC, DOCX, PPT, PPTX.
  • Redesign the internal mechanism of assigning element IDs This allows for further enhancements related to element IDs such as deterministic and document-unique hashes. The way partitioning functions operate hasn't changed, which means unique_element_ids continues to be False by default, utilizing text hashes.
Features
Fixes
  • Add support for extracting text from tag tails in HTML. This fix adds ability to generate separate elements using tag tails.
  • Add support for extracting text from <b> tags in HTML Now partition_html() can extract text from <b> tags inside container tags (like <div>, <pre>).
  • Fix pip-compile make target Missing base.in dependency missing from requirments make file added

v0.13.2

Compare Source

Enhancements
Features
Fixes
  • Brings back missing word list files that caused partition failures in 0.13.1.

v0.13.1

Compare Source

Enhancements
  • Drop constraint on pydantic, supporting later versions All dependencies has pydantic pinned at an old version. This explicit pin was removed, allowing the latest version to be pulled in when requirements are compiled.
Features
  • Add a set of new ElementTypes to extend future element types
Fixes
  • Fix partition_html() swallowing some paragraphs. The partition_html() only considers elements with limited depth to avoid becoming the text representation of a giant div. This fix increases the limit value.
  • Fix SFTP Adds flag options to SFTP connector on whether to use ssh keys / agent, with flag values defaulting to False. This is to prevent looking for ssh files when using username and password. Currently, username and password are required, making that always the case.

v0.13.0

Compare Source

Enhancements
  • Add .metadata.is_continuation to text-split chunks. .metadata.is_continuation=True is added to second-and-later chunks formed by text-splitting an oversized Table element but not to their counterpart Text element splits. Add this indicator for CompositeElement to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
  • Add compound_structure_acc metric to table eval. Add a new property to unstructured.metrics.table_eval.TableEvaluation: composite_structure_acc, which is computed from the element level row and column index and content accuracy scores
  • Add .metadata.orig_elements to chunks. .metadata.orig_elements: list[Element] is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like page_number, coordinates, and image_base64.
  • Add --include_orig_elements option to Ingest CLI. By default, when chunking, the original elements used to form each chunk are added to chunk.metadata.orig_elements for each chunk. * The include_orig_elements parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata.
  • Add Google VertexAI embedder Adds VertexAI embeddings to support embedding via Google Vertex AI.
Features
  • Chunking populates .metadata.orig_elements for each chunk. This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as .coordinates that cannot be consolidated across elements and so is dropped from chunks. This option is controlled by the include_orig_elements parameter to partition_*() or to the chunking functions. This option defaults to True so original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to other unstructured repositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR.
  • Add Clarifai destination connector Adds support for writing partitioned and chunked documents into Clarifai.
Fixes
  • Fix clean_pdfminer_inner_elements() to remove only pdfminer (embedded) elements merged with inferred elements. Previously, some embedded elements were removed even if they were not merged with inferred elements. Now, only embedded elements that are already merged with inferred elements are removed.
  • Clarify IAM Role Requirement for GCS Platform Connectors. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
  • Change table extraction defaults Change table extraction defaults in favor of using skip_infer_table_types parameter and reflect these changes in documentation.
  • Fix OneDrive dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
  • Adds tracking for AstraDB Adds tracking info so AstraDB can see what source called their api.
  • Support AWS Bedrock Embeddings in ingest CLI The configs required to instantiate the bedrock embedding class are now exposed in the api and the version of boto being used meets the minimum requirement to introduce the bedrock runtime required to hit the service.
  • Change MongoDB redacting Original redact secrets solution is causing issues in platform. This fix uses our standard logging redact solution.

v0.12.6

Compare Source

Enhancements
  • Improve ability to capture embedded links in partition_pdf() for fast strategy Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing.
  • Refactor add_chunking_strategy decorator to dispatch by name. Add chunk() function to be used by the add_chunking_strategy decorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers.
  • Redefine table_level_acc metric for table evaluation. table_level_acc now is an average of individual predicted table's accuracy. A predicted table's accuracy is defined as the sequence matching ratio between itself and its corresponding ground truth table.
Features
  • Added Unstructured Platform Documentation The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors.
Fixes
  • Partitioning raises on file-like object with .name not a local file path. When partitioning a file using the file= argument, and file is a file-like object (e.g. io.BytesIO) having a .name attribute, and the value of file.name is not a valid path to a file present on the local filesystem, FileNotFoundError is raised. This prevents use of the file.name attribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP.
  • Fix SharePoint dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string.
  • Include warnings about the potential risk of installing a version of pandoc which does not support RTF files + instructions that will help resolve that issue.
  • Incorporate the install-pandoc Makefile recipe into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.
  • Fix Google Drive source key Allow passing string for source connector key.
  • Fix table structure evaluations calculations Replaced special value -1.0 with np.nan and corrected rows filtering of files metrics basing on that.
  • Fix Sharepoint-with-permissions test Ignore permissions metadata, update test.
  • Fix table structure evaluations for edge case Fixes the issue when the prediction does not contain any table - no longer errors in such case.

v0.12.5

Compare Source

Enhancements
Features
  • Add date_from_file_object parameter to partition. If True and if file is provided via file parameter it will cause partition to infer last modified date from file's content. If False, last modified metadata will be None.
  • Header and footer detection for fast strategy partition_pdf with fast strategy now
    detects elements that are in the top or bottom 5 percent of the page as headers and footers.
  • Add parent_element to overlapping case output Adds parent_element to the output for identify_overlapping_or_nesting_case and catch_overlapping_and_nested_bboxes functions.
  • Add table structure evaluation Adds a new function to evaluate the structure of a table and return a metric that represents the quality of the table structure. This function is used to evaluate the quality of the table structure and the table contents.
  • Add AstraDB destination connector Adds support for writing embedded documents into an AstraDB vector database.
  • Add OctoAI embedder Adds support for embeddings via OctoAI.
Fixes
  • Fix passing list type parameters when calling unstructured API via partition_via_api() Update partition_via_api() to convert all list type parameters to JSON formatted strings before calling the unstructured client SDK. This will support image block extraction via partition_via_api().
  • Fix check_connection in opensearch, databricks, postgres, azure connectors
  • Fix don't treat plain text files with double quotes as JSON If a file can be deserialized as JSON but it deserializes as a string, treat it as plain text even though it's valid JSON.
  • Fix check_connection in opensearch, databricks, postgres, azure connectors
  • Fix cluster of bugs in partition_xlsx() that dropped content. Algorithm for detecting "subtables" within a worksheet dropped table elements for certain patterns of populated cells such as when a trailing single-cell row appeared in a contiguous block of populated cells.
  • Improved documentation. Fixed broken links and improved readability on Key Concepts page.
  • Rename OpenAiEmbeddingConfig to OpenAIEmbeddingConfig.
  • Fix partition_json() doesn't chunk. The @add_chunking_strategy decorator was missing from partition_json() such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.

v0.12.4

Compare Source

Enhancements
  • Apply New Version of black formatting The black library recently introduced a new major version that introduces new formatting conventions. This change brings code in the unstructured repo into compliance with the new conventions.
  • Move ingest imports to local scopes Moved ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies).
  • Add support for .p7s files partition_email can now process .p7s files. The signature for the signed message is extracted and added to metadata.
  • Fallback to valid content types for emails If the user selected content type does not exist on the email message, partition_email now falls back to anoter valid content type if it's available.
Features
  • Add .heic file partitioning .heic image files were previously unsupported and are now supported though partition_image()
  • Add the ability to specify an alternate OCR implementation by implementing an OCRAgent interface and specify it using OCR_AGENT environment variable.
  • Add Vectara destination connector Adds support for writing partitioned documents into a Vectara index.
  • Add ability to detect text in .docx inline shapes extensions of docx partition, extracts text from inline shapes and includes them in paragraph's text
Fixes
  • Fix partition_pdf() not working when using chipper model with file
  • Handle common incorrect arguments for languages and ocr_languages Users are regularly receiving errors on the API because they are defining ocr_languages or languages with additional quotationmarks, brackets, and similar mistakes. This update handles common incorrect arguments and raises an appropriate warning.
  • Default hi_res_model_name now relies on unstructured-inference When no explicit hi_res_model_name is passed into partition or partition_pdf_or_image the default model is picked by unstructured-inference's settings or os env variable UNSTRUCTURED_HI_RES_MODEL_NAME; it now returns the same model name regardless of infer_table_structure's value; this function will be deprecated in the future and the default model name will simply rely on unstructured-inference and will not consider os env in a future release.
  • Fix remove Vectara requirements from setup.py - there are no dependencies
  • Add missing dependency files to package manifest. Updates the file path for the ingest
    dependencies and adds missing extra dependencies.
  • **Fix remove Vectara requirements from setup.py - there are no dependencies **
  • **Add title to Vectara upload - was not separated out from initial connector **
  • **Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test **

v0.12.3

Compare Source

Enhancements
  • Driver for MongoDB connector. Adds a driver with unstructured version information to the
    MongoDB connector.
Features
  • Add Databricks Volumes destination connector Databricks Volumes connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data to a Databricks Volumes storage service.
Fixes
  • Fix support for different Chipper versions and prevent running PDFMiner with Chipper
  • Treat YAML files as text. Adds YAML MIME types to the file detection code and treats those
    files as text.
  • Fix FSSpec destination connectors check_connection. FSSpec destination connectors did not use check_connection. There was an error when trying to ls destination directory - it may not exist at the moment of connector creation. Now check_connection calls ls on bucket root and this method is called on initialize of destination connector.
  • Fix databricks-volumes extra location. setup.py is currently pointing to the wrong location for the databricks-volumes extra requirements. This results in errors when trying to build the wheel for unstructured. This change updates to point to the correct path.
  • Fix uploading None values to Chroma and Pinecone. Removes keys with None values with Pinecone and Chroma destinations. Pins Pinecone dependency
  • Update documentation. (i) best practice for table extration by using 'skip_infer_table_types' param, instead of 'pdf_infer_table_structure', and (ii) fixed CSS, RST issues and typo in the documentation.
  • Fix postgres storage of link_texts. Formatting of link_texts was breaking metadata storage.

v0.12.2

Compare Source

Enhancements
Features
Fixes
  • Fix index error in table processing. Bumps the unstructured-inference version to address and
    index error that occurs on some tables in the table transformer object.

v0.12.0

Compare Source

Enhancements
  • Drop support for python3.8 All dependencies are now built off of the minimum version of python being 3.10

v0.11.8

Compare Source

Enhancements
  • Add SaaS API User Guide. This documentation serves as a guide for Unstructured SaaS API users to register, receive an API key and URL, and manage your account and billing information.
  • Add inter-chunk overlap capability. Implement overlap between chunks. This applies to all chunks prior to any text-splitting of oversized chunks so is a distinct behavior; overlap at text-splits of oversized chunks is independent of inter-chunk overlap (distinct chunk boundaries) and can be requested separately. Note this capability is not yet available from the API but will shortly be made accessible using a new overlap_all kwarg on partition functions.
Features
Fixes

v0.11.7

Compare Source

Enhancements
  • Add intra-chunk overlap capability. Implement overlap for split-chunks where text-splitting is used to divide an oversized chunk into two or more chunks that fit in the chunking window. Note this capability is not yet available from the API but will shortly be made accessible using a new overlap kwarg on partition functions.
  • Update encoders to leverage dataclasses All encoders now follow a class approach which get annotated with the dataclass decorator. Similar to the connectors, it uses a nested dataclass for the configs required to configure a client as well as a field/property approach to cache the client. This makes sure any variable associated with the class exists as a dataclass field.
Features
  • Add Qdrant destination connector. Adds support for writing documents and embeddings into a Qdrant collection.
  • Store base64 encoded image data in metadata fields. Rather than saving to file, stores base64 encoded data of the image bytes and the mimetype for the image in metadata fields: image_base64 and image_mime_type (if that is what the user specifies by some other param like pdf_extract_to_payload). This would allow the API to have parity with the library.
Fixes
  • Fix table structure metric script Update the call to table agent to now provide OCR tokens as required
  • Fix element extraction not working when using "auto" strategy for pdf and image If element extraction is specified, the "auto" strategy falls back to the "hi_res" strategy.
  • Fix a bug passing a custom url to partition_via_api Users that self host the api were not able to pass their custom url to partition_via_api.

v0.11.6

Compare Source

Enhancements
  • Update the layout analysis script. The previous script only supported annotating final elements. The updated script also supports annotating inferred and extracted elements.
  • AWS Marketplace API documentation: Added the user guide, including setting up VPC and CloudFormation, to deploy Unstructured API on AWS platform.
  • Azure Marketplace API documentation: Improved the user guide to deploy Azure Marketplace API by adding references to Azure documentation.
  • Integration documentation: Updated URLs for the staging_for bricks
Features
  • Partition emails with base64-encoded text. Automatically handles and decodes base64 encoded text in emails with content type text/plain and text/html.
  • Add Chroma destination connector Chroma database connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned/embedded data to a Chroma vector database.
  • Add Elasticsearch destination connector. Problem: After ingesting data from a source, users might want to move their data into a destination. Elasticsearch is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch.
Fixes
  • Enable --fields argument omission for elasticsearch connector Solves two bugs where removing the optional parameter --fields broke the connector due to an integer processing error and using an elasticsearch config for a destination connector resulted in a serialization issue when optional parameter --fields was not provided.
  • Add hi_res_model_name Adds kwarg to relevant functions and add comments that model_name is to be deprecated.

v0.11.5

Compare Source

Enhancements
Features
Fixes
  • Fix partition_pdf() and partition_image() importation issue. Reorganize pdf.py and image.py modules to be consistent with other types of document import code.

v0.11.4

Compare Source

Enhancements
  • Refactor image extraction code. The image extraction code is moved from unstructured-inference to unstructured.
  • Refactor pdfminer code. The pdfminer code is moved from unstructured-inference to unstructured.
  • Improve handling of auth data for fsspec connectors. Leverage an extension of the dataclass paradigm to support a sensitive annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.
  • Add glob support for fsspec connectors Similar to the glob support in the ingest local source connector, similar filters are now enabled on all fsspec based source connectors to limit files being partitioned.
  • Define a constant for the splitter "+" used in tesseract ocr languages.
Features
  • Save tables in PDF's separately as images. The "table" elements are saved as table-<pageN>-<tableN>.jpg. This filename is presented in the image_path metadata field for the Table element. The default would be to not do this.
  • Add Weaviate destination connector Weaviate connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data from over 20 data sources (so far) to a Weaviate object collection.
  • Sftp Source Connector. New source connector added to support downloading/partitioning files from Sftp.
Fixes
  • Fix pdf hi_res partitioning failure when pdfminer fails. Implemented logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the hi_res strategy.
  • Fix a bug where image can be scaled too large for tesseract Adds a limit to prevent auto-scaling an image beyond the maximum size tesseract can handle for ocr layout detection
  • Update partition_csv to handle different delimiters CSV files containing both non-comma delimiters and commas in the data were throwing an error in Pandas. partition_csv now identifies the correct delimiter before the file is processed.
  • partition returning cid code in hi_res occasionally pdfminer can fail to decode the text in an pdf file and return cid code as text. Now when this happens the text from OCR is used.

v0.11.2

Compare Source

Enhancements
  • Updated Documentation: (i) Added examples, and (ii) API Documentation, including Usage, SDKs, Azure Marketplace, and parameters and validation errors.
Features
    • Add Pinecone destination connector. Problem: After ingesting data from a source, users might want to produce embeddings for their data and write these into a vector DB. Pinecone is an option among these vector databases. Feature: Added Pinecone destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Pinecone.
Fixes
  • Process chunking parameter names in ingest correctly Solves a bug where chunking parameters weren't being processed and used by ingest cli by renaming faulty parameter names and prepends; adds relevant parameters to ingest pinecone test to verify that the parameters are functional.

v0.11.1

Compare Source

Enhancements
  • Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.
  • Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.
Features
  • Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
  • Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot
Fixes
  • Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.
  • Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
  • Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.
  • Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.

v0.11.0

Compare Source

Enhancements
  • Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.
  • Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.
  • Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.
Features
  • Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.
  • MongoDB Destination Connector. New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.
Fixes
  • Fix TYPE_TO_TEXT_ELEMENT_MAP. Updated Figure mapping from FigureCaption to Image.
  • Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.
  • Fix fast strategy fall back to ocr_only The fast strategy should not fall back to a more expensive strategy.
  • Remove default user ./ssh folder The default notebook user during image build would create the known_hosts file with incorrect ownership, this is legacy and no longer needed so it was removed.
  • Include languages in metadata when partitioning strategy=hi_res or fast User defined languages was previously used for text detection, but not included in the resulting element metadata for some strategies. languages will now be included in the metadata regardless of partition strategy for pdfs and images.
  • Handle a case where Paddle returns a list item in ocr_data as None In partition, while parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data. Removed the assumption by skipping the text region whenever this happens.
  • Fix some pdfs returning KeyError: 'N' Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned.
  • Fix mis-splits on Table chunks. Remedies repeated appearance of full .text_as_html on metadata of each TableChunk split from a Table element too large to fit in the chunking window.
  • Import tables_agent from inference so that we don't have to initialize a global table agent in unstructured OCR again
  • Fix empty table is identified as bulleted-table. A table with no text content was mistakenly identified as a bulleted-table and processed by the wrong branch of the initial HTML partitioner.
  • Fix partition_html() emits empty (no text) tables. A table with cells nested below a <thead> or <tfoot> element was emitted as a table element having no text and unparseable HTML in element.metadata.text_as_html. Do not emit empty tables to the element stream.
  • Fix HTML element.metadata.text_as_html contains spurious <br> elements in invalid locations. The HTML generated for the text_as_html metadata for HTML tables contained <br> elements invalid locations like between <table> and <tr>. Change the HTML generator such that these do not appear.
  • Fix HTML table cells enclosed in <thead> and <tfoot> elements are dropped. HTML table cells nested in a <thead> or <tfoot> element were not detected and the text in those cells was omitted from the table element text and .text_as_html. Detect table rows regardless of the semantic tag they may be nested in.
  • Remove whitespace padding from .text_as_html. tabulate inserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability.
  • Fix local connector with absolute input path When passed an absolute filepath for the input document path, the local connector incorrectly writes the output file to the input file directory. This fixes such that the output in this case is written to output-dir/input-filename.json

v0.10.30

Compare Source

Enhancements
  • Support nested DOCX tables. In DOCX, like HTML, a table cell can itself contain a table. In this case, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table.
  • Add connection check to ingest connectors Each source and destination connector now support a check_connection() method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.
Features
  • Add functionality to do a second OCR on cropped table images. Changes to the values for scaling ENVs affect entire page OCR output(OCR regression) so we now do a second OCR for tables.
  • Adds ability to pass timeout for a request when partitioning via a url. partition now accepts a new optional parameter request_timeout which if set will prevent any requests.get from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.
Fixes
  • Fix logic that determines pdf auto strategy. Previously, _determine_pdf_auto_strategy returned hi_res strategy only if infer_table_structure was true. It now returns the hi_res strategy if either infer_table_structure or extract_images_in_pdf is true.
  • Fix invalid coordinates when parsing tesseract ocr data. Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to 0. A logical check is now added to avoid such error.
  • Fix ingest partition parameters not being passed to the api. When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
  • Support tables in section-less DOCX. Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
  • Support tables that contain only numbers when partitioning via ocr_only Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from .image_to_data(). An AttributeError was raised downstream when trying to .strip() the floats.
  • Improve DOCX page-break detection. DOCX page breaks are reliably indicated by w:lastRenderedPageBreak elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a w:lastRenderedPageBreak element so cause over-counting if used. Use rendered page-breaks only.

v0.10.29

Compare Source

Enhancements
  • Adds include_header argument for partition_csv and partition_tsv Now supports retaining header rows in CSV and TSV documents element partitioning.
  • Add retry logic for all source connectors All http calls being made by the ingest source connectors have been isolated and wrapped by the SourceConnectionNetworkError custom error, which triggers the retry logic, if enabled, in the ingest pipeline.
  • Google Drive source connector supports credentials from memory Originally, the connector expected a filepath to pull the credentials from when creating the client. This was expanded to support passing that information from memory as a dict if access to the file system might not be available.
  • Add support for generic partition configs in ingest cli Along with the explicit partition options supported by the cli, an additional_partition_args arg was added to allow users to pass in any other arguments that should be added when calling partition(). This helps keep any changes to the input parameters of the partition() exposed in the CLI.
  • Map full output schema for table-based destination connectors A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this.
  • Incorporate multiple embedding model options into ingest, add diff test embeddings Problem: Ingest pipeline already supported embedding functionality, however users might want to use different types of embedding providers. Enhancement: Extend ingest pipeline so that users can specify and embed via a particular embedding provider from a range of options. Also adds a diff test to compare output from an embedding module with the expected output
Features
  • Allow setting table crop parameter In certain circumstances, adjusting the table crop padding may improve table.
Fixes
  • Fixes partition_text to prevent empty elements Adds a check to filter out empty bullets.
  • Handle empty string for ocr_languages with values for languages Some API users ran into an issue with sending languages params because the API defaulted to also using an empty string for ocr_languages. This update handles situations where languages is defined and ocr_languages is an empty string.
  • Fix PDF tried to loop through None Previously the PDF annotation extraction tried to loop through annots that resolved out as None. A logical check added to avoid such error.
  • Ingest session handler not being shared correctly All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
  • Ingest download-only fix. Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.
  • Fix flaky chunk-metadata. Prior implementation was sensitive to element order in the section resulting in metadata values sometimes being dropped. Also, not all metadata items can be consolidated across multiple elements (e.g. coordinates) and so are now dropped from consolidated metadata.
  • Fix tesseract error Estimating resolution as X leaded by invalid language parameters input. Proceed with defalut language eng when lang.py fails to find valid language code for tesseract, so that we don't pass an empty string to tesseract CLI and raise an exception in downstream.

v0.10.28

Compare Source

Enhancements
  • Add table structure evaluation helpers Adds functions to evaluate the similarity between predicted table structure and actual table structure.
  • Use yolox by default for table extraction when partitioning pdf/image yolox model provides higher recall of the table regions than the quantized version and it is now the default element detection model when infer_table_structure=True for partitioning pdf/image files
  • Remove pdfminer elements from inside tables Previously, when using hi_res some elements where extracted using pdfminer too, so we removed pdfminer from the tables pipeline to avoid duplicated elements.
  • Fsspec downstream connectors New destination connector added to ingest CLI, users may now use unstructured-ingest to write to any of the following:
    • Azure
    • Box
    • Dropbox
    • Google Cloud Service
Features
  • Update ocr_only strategy in partition_pdf() Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the ocr_only strategy.
Fixes
  • Fixed SharePoint permissions for the fetching to be opt-in Problem: Sharepoint permissions were trying to be fetched even when no reletad cli params were provided, and this gave an error due to values for those keys not existing. Fix: Updated getting keys to be with .get() method and changed the "skip-check" to check individual cli params rather than checking the existance of a config object.
  • Fixes issue where tables from markdown documents were being treated as text Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the tables extension when instantiating the python-markdown object. Importance: This will allow users to extract structured data from tables in markdown documents.
  • Fix wrong logger for paddle info Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module.
  • Fix ingest pipeline to be able to use chunking and embedding together Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
  • Fix unnecessary mid-text chunk-splitting. The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting.
  • Fix frequent dissociation of title from chunk. The sectioning algorithm included the title of the next section with the prior section whenever it would fit, frequently producing association of a section title with the prior section and dissociating it from its actual section. Fix this by performing combination of whole sections only.
  • Fix PDF attempt to get dict value from string. Fixes a rare edge case that prevented some PDF's from being partitioned. The get_uris_from_annots function tried to access the dictionary value of a string instance variable. Assign None to the annotation variable if the instance type is not dictionary to avoid the erroneous attempt.

v0.10.27

Compare Source

Enhancements
  • Leverage dict to share content across ingest pipeline To share the ingest doc content across steps in the ingest pipeline, this was updated to use a multiprocessing-safe dictionary so changes get persisted and each step has the option to modify the ingest docs in place.
Features
Fixes
  • Removed ebooklib as a dependency ebooklib is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.
  • Caching fixes in ingest pipeline Previously, steps like the source node were not leveraging parameters such as re_download to dictate if files should be forced to redownload rather than use what might already exist locally.

v0.10.26

Compare Source

Enhancements
  • Add text CCT CI evaluation workflow Adds cct text extraction evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
Features
  • Functionality to catch and classify overlapping/nested elements Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the overlapping_elements, overlapping_case, overlapping_percentage, largest_ngram_percentage, overlap_percentage_total, max_area, min_area, and total_area.
  • Add Local connector source metadata python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
Fixes
  • **Fixes elements partitioned from an image fil

Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate
Copy link
Author

renovate bot commented Dec 26, 2025

⚠️ Artifact update problem

Renovate failed to update an artifact related to this branch. You probably do not want to merge this PR as-is.

♻ Renovate will retry this branch, including artifacts, only when one of the following happens:

  • any of the package files in this branch needs updating, or
  • the branch becomes conflicted, or
  • you click the rebase/retry checkbox if found above, or
  • you rename this PR's title to start with "rebase!" to trigger it manually

The artifact failure details are included below:

File name: poetry.lock
Updating dependencies
Resolving dependencies...


The current project's Python requirement (>=3.9,<4.0) is not compatible with some of the required packages Python requirement:
  - unstructured requires Python <3.12,>=3.9.0, so it will not be satisfied for Python >=3.12,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0
  - unstructured requires Python <3.13,>=3.9.0, so it will not be satisfied for Python >=3.13,<4.0

Because no versions of unstructured match >0.14.0,<0.14.2 || >0.14.2,<0.14.3 || >0.14.3,<0.14.4 || >0.14.4,<0.14.5 || >0.14.5,<0.14.6 || >0.14.6,<0.14.7 || >0.14.7,<0.14.8 || >0.14.8,<0.14.9 || >0.14.9,<0.14.10 || >0.14.10,<0.15.0
 and unstructured (0.14.0) requires Python <3.12,>=3.9.0, unstructured is forbidden.
And because unstructured (0.14.2) requires Python <3.13,>=3.9.0
 and unstructured (0.14.3) requires Python <3.13,>=3.9.0, unstructured is forbidden.
And because unstructured (0.14.4) requires Python <3.13,>=3.9.0
 and unstructured (0.14.5) requires Python <3.13,>=3.9.0, unstructured is forbidden.
And because unstructured (0.14.6) requires Python <3.13,>=3.9.0
 and unstructured (0.14.7) requires Python <3.13,>=3.9.0, unstructured is forbidden.
And because unstructured (0.14.8) requires Python <3.13,>=3.9.0
 and unstructured (0.14.9) requires Python <3.13,>=3.9.0, unstructured is forbidden.
So, because unstructured (0.14.10) requires Python <3.13,>=3.9.0
 and langflow depends on unstructured (^0.14.0), version solving failed.

  • Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties
    
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.12"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"
    For unstructured, a possible solution would be to set the `python` property to ">=3.9,<3.13"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

@cr-gpt
Copy link

cr-gpt bot commented Dec 26, 2025

Seems you are using me but didn't get OPENAI_API_KEY seted in Variables/Secrets for this repo. you could follow readme for more information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants