Conversation
|
Failed to generate code suggestions for PR |
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the client to use a new backend API structure, introduces two new parsing modes (parse_pro and parse_textract), and updates example notebooks and the package version.
- Updated API endpoints and default URLs in sync and async parsers
- Added
ParseProSyncParserandParseTextractSyncParsersupport - Revamped async job submission/fetch logic and bumped version to 0.0.25
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Bump package version to 0.0.25 |
| examples/*.ipynb | Updated execution counts, environment var names, removed outdated output cells |
| any_parser/constants.py | Added default base URLs, timeout constant |
| any_parser/sync_parser.py | Changed payload handling, updated sync endpoints, added new parser classes |
| any_parser/async_parser.py | Simplified async request logic, introduced job status API |
| any_parser/any_parser.py | Added high-level parse_pro/parse_textract; updated decorator logic and response handling |
| any_parser/init.py | Bump __version__ to 0.0.25 |
Comments suppressed due to low confidence (3)
any_parser/any_parser.py:108
- The docstring mentions a
batch_urlparameter that isn’t in the__init__signature. Update the signature or remove the doc entry to keep them in sync.
batch_url: Batch API endpoint URL, defaults to public batch endpoint
any_parser/any_parser.py:260
- [nitpick] The variables
extracted_resultandextracted_htmlare used interchangeably, which can be confusing. Consider using a single, consistent name for the parsed tables payload.
extracted_result, time_elapsed = self._sync_extract_tables.extract(
any_parser/sync_parser.py:28
- Flattening
extract_argsdirectly intopayloadchanges the JSON structure expected by the backend. To preserve nesting, usepayload["extract_args"] = extract_args.
payload.update(extract_args)
any_parser/constants.py
Outdated
|
|
||
| # Default URLs for AnyParser | ||
| PUBLIC_SHARED_BASE_URL = "https://anyparser.cambioml.com/api/v1" | ||
| PUBLIC_BATCH_BASE_URL = "http://batch-api.cambioml.com" # TODO: Fix Later |
There was a problem hiding this comment.
Using HTTP for PUBLIC_BATCH_BASE_URL may expose API credentials or data in transit. Consider switching to HTTPS for secure communications.
| PUBLIC_BATCH_BASE_URL = "http://batch-api.cambioml.com" # TODO: Fix Later | |
| PUBLIC_BATCH_BASE_URL = "https://batch-api.cambioml.com" # Updated to use HTTPS |
There was a problem hiding this comment.
Need to decide whether not keep batch api, new backend does not support for now, so archive here. cc @lingjiekong
Description
This pull request introduces several enhancements and refactorings to the
any_parserlibrary, including the addition of new parsing methods, improved handling of extraction instructions, and updates to asynchronous job management. The changes aim to expand functionality, improve code clarity, and address edge cases in data processing.New Features and Enhancements
ParseProSyncParserandParseTextractSyncParser, enabling multi-language support and AWS Textract integration. [1] [2]async_parse_proandasync_parse_textractfor handling advanced models and AWS Textract asynchronously.Refactoring and Improvements
extract_key_valuemethod to support both dictionary and list formats for extraction instructions, improving flexibility and usability.extract_tablesmethod to handle new result formats, such as dictionaries with amarkdownkey, and added fallback mechanisms for edge cases like missing tables. [1] [2]Asynchronous Job Management
async_fetchmethod withget_job_statusfor better handling of async job statuses, including support for presigned URLs and inline results.Miscellaneous Updates
0.0.25in__init__.py.requestsandjson.Related Issue
Type of Change
How Has This Been Tested?
Screenshots (if applicable)
Checklist
Additional Notes