chore(deps): update dependency unstructured to ^0.14.0 [security]#29
Open
renovate[bot] wants to merge 1 commit intodevfrom
Open
chore(deps): update dependency unstructured to ^0.14.0 [security]#29renovate[bot] wants to merge 1 commit intodevfrom
renovate[bot] wants to merge 1 commit intodevfrom
Conversation
Author
|
|
Seems you are using me but didn't get OPENAI_API_KEY seted in Variables/Secrets for this repo. you could follow readme for more information |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
^0.5.11→^0.14.0GitHub Vulnerability Alerts
CVE-2024-46455
unstructured v.0.14.2 and before is vulnerable to XML External Entity (XXE) via the XMLParser.
Release Notes
Unstructured-IO/unstructured (unstructured)
v0.14.3Compare Source
Enhancements
categoryfield from Text class to Element class.partition_docx()now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCXParagraphand generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.Features
Fixes
partition_pdf()to keep spaces in the text. The control character\tis now replaced with a space instead of being removed when merging inferred elements with embedded elements.resolve_entities=Falsefor XML parsing withlxmlto avoid text being dynamically injected into the XML document.
form_extraction_skip_tablesargument to thepartition_pdf_or_imagecall.to avoid text being dynamically injected into the XML document.
table_as_cellsoutput by default to reduce overhead in partition; nowtable_as_cellsis only produced when the envEXTACT_TABLE_AS_CELLSistruedocument_to_element_listfor handling HTMLDocument Usegetattr(element, "type", "")to get thetypeattribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try blockv0.14.2Compare Source
Enhancements
Features
pineconeconnector.Fixes
v0.14.0Compare Source
BREAKING CHANGES
Enhancements
partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter sopartition_docx()is aware of the requested strategy.Features
NotImplementedError.Fixes
paragraph_groupercan be set toFalse, but the type hint did not not reflect this previously.linksis extracted during partitioning and is not needed as a paramter in partition_pdf.partition_csv()would raise on CSV files with very long lines.partition_doc(). Remove temporary file created but not removed whenfileargument is passed topartition_doc().SyntaxErrororSyntaxWarningon regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.partition_odt(). Remove temporary file created but not removed whenfileargument is passed topartition_odt().v0.13.7Compare Source
Enhancements
page_numbermetadata fields for HTML partition until we have a better strategy to decide page counting.Features
cidcharacters in embedded text extracted bypdfminer.Fixes
partition_docx()handles short table rows. The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate.textand.metadata.text_as_htmlfor these tables.v0.13.6Compare Source
Enhancements
Features
Fixes
v0.13.5Compare Source
Enhancements
Features
Fixes
ListItemelements could result in reusing the same memory location which then led to unexpected side effects when updating element IDs.v0.13.4Compare Source
Enhancements
function are now deterministic and unique at the document level by default. Before, hashes were
based only on text; however, they now also take into account the element's sequence number on a
page, the page's number in the document, and the document's file name.
previously limited to local chunking using the strategies
basicandby_title. Remote chunkingoptions via the API are now accessible.
UnstructuredTableTransformerModelis able to return predicted table in cells formatFeatures
PDF_ANNOTATION_THRESHOLDenvironment variable to control the capture of embedded links inpartition_pdf()forfaststrategy.Fixes
v0.13.3Compare Source
Enhancements
start_indexinhtmllinks extractionstrategyarg value to_PptxPartitionerOptions. This makes this paritioning option available for sub-partitioners to come that may optionally use inference or other expensive operations to improve the partitioning.starting_page_numberparameter to partitioning functions It applies to those partitioners which supportpage_numberin element's metadata: PDF, TIFF, XLSX, DOC, DOCX, PPT, PPTX.unique_element_idscontinues to beFalseby default, utilizing text hashes.Features
Fixes
<b>tags in HTML Nowpartition_html()can extract text from<b>tags inside container tags (like<div>,<pre>).v0.13.2Compare Source
Enhancements
Features
Fixes
partitionfailures in 0.13.1.v0.13.1Compare Source
Enhancements
Features
ElementTypes to extend future element typesFixes
partition_html()swallowing some paragraphs. Thepartition_html()only considers elements with limited depth to avoid becoming the text representation of a giant div. This fix increases the limit value.v0.13.0Compare Source
Enhancements
.metadata.is_continuationto text-split chunks..metadata.is_continuation=Trueis added to second-and-later chunks formed by text-splitting an oversizedTableelement but not to their counterpartTextelement splits. Add this indicator forCompositeElementto allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.compound_structure_accmetric to table eval. Add a new property tounstructured.metrics.table_eval.TableEvaluation:composite_structure_acc, which is computed from the element level row and column index and content accuracy scores.metadata.orig_elementsto chunks..metadata.orig_elements: list[Element]is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, likepage_number,coordinates, andimage_base64.--include_orig_elementsoption to Ingest CLI. By default, when chunking, the original elements used to form each chunk are added tochunk.metadata.orig_elementsfor each chunk. * Theinclude_orig_elementsparameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata.Features
.metadata.orig_elementsfor each chunk. This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as.coordinatesthat cannot be consolidated across elements and so is dropped from chunks. This option is controlled by theinclude_orig_elementsparameter topartition_*()or to the chunking functions. This option defaults toTrueso original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to otherunstructuredrepositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR.Fixes
clean_pdfminer_inner_elements()to remove only pdfminer (embedded) elements merged with inferred elements. Previously, some embedded elements were removed even if they were not merged with inferred elements. Now, only embedded elements that are already merged with inferred elements are removed.skip_infer_table_typesparameter and reflect these changes in documentation.v0.12.6Compare Source
Enhancements
partition_pdf()forfaststrategy Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing.add_chunking_strategydecorator to dispatch by name. Addchunk()function to be used by theadd_chunking_strategydecorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers.table_level_accmetric for table evaluation.table_level_accnow is an average of individual predicted table's accuracy. A predicted table's accuracy is defined as the sequence matching ratio between itself and its corresponding ground truth table.Features
Fixes
.namenot a local file path. When partitioning a file using thefile=argument, andfileis a file-like object (e.g. io.BytesIO) having a.nameattribute, and the value offile.nameis not a valid path to a file present on the local filesystem,FileNotFoundErroris raised. This prevents use of thefile.nameattribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP.pandocwhich does not support RTF files + instructions that will help resolve that issue.install-pandocMakefile recipe into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.-1.0withnp.nanand corrected rows filtering of files metrics basing on that.v0.12.5Compare Source
Enhancements
Features
date_from_file_objectparameter to partition. If True and if file is provided viafileparameter it will cause partition to infer last modified date fromfile's content. If False, last modified metadata will beNone.partition_pdfwithfaststrategy nowdetects elements that are in the top or bottom 5 percent of the page as headers and footers.
identify_overlapping_or_nesting_caseandcatch_overlapping_and_nested_bboxesfunctions.Fixes
partition_via_api()Updatepartition_via_api()to convert all list type parameters to JSON formatted strings before calling the unstructured client SDK. This will support image block extraction viapartition_via_api().check_connectionin opensearch, databricks, postgres, azure connectorscheck_connectionin opensearch, databricks, postgres, azure connectorspartition_xlsx()that dropped content. Algorithm for detecting "subtables" within a worksheet dropped table elements for certain patterns of populated cells such as when a trailing single-cell row appeared in a contiguous block of populated cells.Key Conceptspage.OpenAiEmbeddingConfigtoOpenAIEmbeddingConfig.@add_chunking_strategydecorator was missing frompartition_json()such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.v0.12.4Compare Source
Enhancements
blackformatting Theblacklibrary recently introduced a new major version that introduces new formatting conventions. This change brings code in theunstructuredrepo into compliance with the new conventions..p7sfilespartition_emailcan now process.p7sfiles. The signature for the signed message is extracted and added to metadata.partition_emailnow falls back to anoter valid content type if it's available.Features
OCRAgentinterface and specify it usingOCR_AGENTenvironment variable.Fixes
partition_pdf()not working when using chipper model withfilelanguagesandocr_languagesUsers are regularly receiving errors on the API because they are definingocr_languagesorlanguageswith additional quotationmarks, brackets, and similar mistakes. This update handles common incorrect arguments and raises an appropriate warning.hi_res_model_namenow relies onunstructured-inferenceWhen no explicithi_res_model_nameis passed intopartitionorpartition_pdf_or_imagethe default model is picked byunstructured-inference's settings or os env variableUNSTRUCTURED_HI_RES_MODEL_NAME; it now returns the same model name regardless ofinfer_table_structure's value; this function will be deprecated in the future and the default model name will simply rely onunstructured-inferenceand will not consider os env in a future release.dependencies and adds missing extra dependencies.
v0.12.3Compare Source
Enhancements
unstructuredversion information to theMongoDB connector.
Features
unstructured-ingestto write partitioned data to a Databricks Volumes storage service.Fixes
files as text.
check_connection. There was an error when trying tolsdestination directory - it may not exist at the moment of connector creation. Nowcheck_connectioncallslson bucket root and this method is called oninitializeof destination connector.setup.pyis currently pointing to the wrong location for the databricks-volumes extra requirements. This results in errors when trying to build the wheel for unstructured. This change updates to point to the correct path.v0.12.2Compare Source
Enhancements
Features
Fixes
unstructured-inferenceversion to address andindex error that occurs on some tables in the table transformer object.
v0.12.0Compare Source
Enhancements
3.10v0.11.8Compare Source
Enhancements
overlap_allkwarg on partition functions.Features
Fixes
v0.11.7Compare Source
Enhancements
overlapkwarg on partition functions.Features
image_base64andimage_mime_type(if that is what the user specifies by some other param likepdf_extract_to_payload). This would allow the API to have parity with the library.Fixes
partition_via_apiUsers that self host the api were not able to pass their custom url topartition_via_api.v0.11.6Compare Source
Enhancements
finalelements. The updated script also supports annotatinginferredandextractedelements.staging_forbricksFeatures
text/plainandtext/html.unstructured-ingestto write partitioned/embedded data to a Chroma vector database.Fixes
v0.11.5Compare Source
Enhancements
Features
Fixes
partition_pdf()andpartition_image()importation issue. Reorganizepdf.pyandimage.pymodules to be consistent with other types of document import code.v0.11.4Compare Source
Enhancements
unstructured-inferencetounstructured.unstructured-inferencetounstructured.sensitiveannotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.Features
table-<pageN>-<tableN>.jpg. This filename is presented in theimage_pathmetadata field for the Table element. The default would be to not do this.unstructured-ingestto write partitioned data from over 20 data sources (so far) to a Weaviate object collection.Fixes
hi_respartitioning failure when pdfminer fails. Implemented logic to fall back to the "inferred_layout + OCR" if pdfminer fails in thehi_resstrategy.tesseractcan handle for ocr layout detectionpartition_csvnow identifies the correct delimiter before the file is processed.hi_resoccasionally pdfminer can fail to decode the text in an pdf file and return cid code as text. Now when this happens the text from OCR is used.v0.11.2Compare Source
Enhancements
Features
Fixes
v0.11.1Compare Source
Enhancements
pikepdfto repair invalid PDF structure for PDFminer when we see errorPSSyntaxErrorwhen PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.Features
Fixes
<style>tags in HTML.<style>tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a<style>element as textual content.HeaderandFooterdocument elements.v0.11.0Compare Source
Enhancements
PartitionStrategyfor the strategy constants and use the constants to replace strategy strings.DEFAULT_PADDLE_LANGbefore we have the language mapping for paddle.Features
ElementMetadatainstance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, likeelement.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.Fixes
TYPE_TO_TEXT_ELEMENT_MAP. UpdatedFiguremapping fromFigureCaptiontoImage.pdfminer, causingpartition_pdf()to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.faststrategy fall back toocr_onlyThefaststrategy should not fall back to a more expensive strategy.languagesin metadata when partitioningstrategy=hi_resorfastUser definedlanguageswas previously used for text detection, but not included in the resulting element metadata for some strategies.languageswill now be included in the metadata regardless of partition strategy for pdfs and images.KeyError: 'N'Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned.Tablechunks. Remedies repeated appearance of full.text_as_htmlon metadata of eachTableChunksplit from aTableelement too large to fit in the chunking window.<thead>or<tfoot>element was emitted as a table element having no text and unparseable HTML inelement.metadata.text_as_html. Do not emit empty tables to the element stream.element.metadata.text_as_htmlcontains spurious<br>elements in invalid locations. The HTML generated for thetext_as_htmlmetadata for HTML tables contained<br>elements invalid locations like between<table>and<tr>. Change the HTML generator such that these do not appear.<thead>and<tfoot>elements are dropped. HTML table cells nested in a<thead>or<tfoot>element were not detected and the text in those cells was omitted from the table element text and.text_as_html. Detect table rows regardless of the semantic tag they may be nested in..text_as_html.tabulateinserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability.output-dir/input-filename.jsonv0.10.30Compare Source
Enhancements
check_connection()method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.Features
url.partitionnow accepts a new optional parameterrequest_timeoutwhich if set will prevent anyrequests.getfrom hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.Fixes
_determine_pdf_auto_strategyreturnedhi_resstrategy only ifinfer_table_structurewas true. It now returns thehi_resstrategy if eitherinfer_table_structureorextract_images_in_pdfis true.0. A logical check is now added to avoid such error.ocr_onlyTables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from.image_to_data(). An AttributeError was raised downstream when trying to.strip()the floats.w:lastRenderedPageBreakelements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to aw:lastRenderedPageBreakelement so cause over-counting if used. Use rendered page-breaks only.v0.10.29Compare Source
Enhancements
SourceConnectionNetworkErrorcustom error, which triggers the retry logic, if enabled, in the ingest pipeline.additional_partition_argsarg was added to allow users to pass in any other arguments that should be added when calling partition(). This helps keep any changes to the input parameters of the partition() exposed in the CLI.Features
Fixes
partition_textto prevent empty elements Adds a check to filter out empty bullets.ocr_languageswith values forlanguagesSome API users ran into an issue with sendinglanguagesparams because the API defaulted to also using an empty string forocr_languages. This update handles situations wherelanguagesis defined andocr_languagesis an empty string.annotsthat resolved out as None. A logical check added to avoid such error.Estimating resolution as Xleaded by invalid language parameters input. Proceed with defalut languageengwhenlang.pyfails to find valid language code for tesseract, so that we don't pass an empty string to tesseract CLI and raise an exception in downstream.v0.10.28Compare Source
Enhancements
yoloxby default for table extraction when partitioning pdf/imageyoloxmodel provides higher recall of the table regions than the quantized version and it is now the default element detection model wheninfer_table_structure=Truefor partitioning pdf/image fileshi_ressome elements where extracted using pdfminer too, so we removed pdfminer from the tables pipeline to avoid duplicated elements.unstructured-ingestto write to any of the following:Features
ocr_onlystrategy inpartition_pdf()Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with theocr_onlystrategy.Fixes
tablesextension when instantiating thepython-markdownobject. Importance: This will allow users to extract structured data from tables in markdown documents.get_uris_from_annotsfunction tried to access the dictionary value of a string instance variable. AssignNoneto the annotation variable if the instance type is not dictionary to avoid the erroneous attempt.v0.10.27Compare Source
Enhancements
Features
Fixes
ebooklibas a dependencyebooklibis licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.re_downloadto dictate if files should be forced to redownload rather than use what might already exist locally.v0.10.26Compare Source
Enhancements
Features
overlapping_elements,overlapping_case,overlapping_percentage,largest_ngram_percentage,overlap_percentage_total,max_area,min_area, andtotal_area.Fixes
Configuration
📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.