Skip to content
This repository was archived by the owner on Dec 15, 2025. It is now read-only.

refactor with new backend#86

Merged
boqiny merged 7 commits intomainfrom
refactor-0616
Jun 17, 2025
Merged

refactor with new backend#86
boqiny merged 7 commits intomainfrom
refactor-0616

Conversation

@boqiny
Copy link
Contributor

@boqiny boqiny commented Jun 17, 2025

Description

This pull request introduces several enhancements and refactorings to the any_parser library, including the addition of new parsing methods, improved handling of extraction instructions, and updates to asynchronous job management. The changes aim to expand functionality, improve code clarity, and address edge cases in data processing.

New Features and Enhancements

  • Added support for synchronous parsing using ParseProSyncParser and ParseTextractSyncParser, enabling multi-language support and AWS Textract integration. [1] [2]
  • Introduced asynchronous parsing methods async_parse_pro and async_parse_textract for handling advanced models and AWS Textract asynchronously.

Refactoring and Improvements

  • Enhanced extract_key_value method to support both dictionary and list formats for extraction instructions, improving flexibility and usability.
  • Updated extract_tables method to handle new result formats, such as dictionaries with a markdown key, and added fallback mechanisms for edge cases like missing tables. [1] [2]

Asynchronous Job Management

  • Replaced the legacy async_fetch method with get_job_status for better handling of async job statuses, including support for presigned URLs and inline results.

Miscellaneous Updates

  • Updated the library version to 0.0.25 in __init__.py.
  • Simplified imports and removed unused dependencies like requests and json.

Related Issue

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

How Has This Been Tested?

Screenshots (if applicable)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

Copilot AI review requested due to automatic review settings June 17, 2025 04:59
@github-actions
Copy link

Failed to generate code suggestions for PR

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the client to use a new backend API structure, introduces two new parsing modes (parse_pro and parse_textract), and updates example notebooks and the package version.

  • Updated API endpoints and default URLs in sync and async parsers
  • Added ParseProSyncParser and ParseTextractSyncParser support
  • Revamped async job submission/fetch logic and bumped version to 0.0.25

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pyproject.toml Bump package version to 0.0.25
examples/*.ipynb Updated execution counts, environment var names, removed outdated output cells
any_parser/constants.py Added default base URLs, timeout constant
any_parser/sync_parser.py Changed payload handling, updated sync endpoints, added new parser classes
any_parser/async_parser.py Simplified async request logic, introduced job status API
any_parser/any_parser.py Added high-level parse_pro/parse_textract; updated decorator logic and response handling
any_parser/init.py Bump __version__ to 0.0.25
Comments suppressed due to low confidence (3)

any_parser/any_parser.py:108

  • The docstring mentions a batch_url parameter that isn’t in the __init__ signature. Update the signature or remove the doc entry to keep them in sync.
        batch_url: Batch API endpoint URL, defaults to public batch endpoint

any_parser/any_parser.py:260

  • [nitpick] The variables extracted_result and extracted_html are used interchangeably, which can be confusing. Consider using a single, consistent name for the parsed tables payload.
        extracted_result, time_elapsed = self._sync_extract_tables.extract(

any_parser/sync_parser.py:28

  • Flattening extract_args directly into payload changes the JSON structure expected by the backend. To preserve nesting, use payload["extract_args"] = extract_args.
payload.update(extract_args)


# Default URLs for AnyParser
PUBLIC_SHARED_BASE_URL = "https://anyparser.cambioml.com/api/v1"
PUBLIC_BATCH_BASE_URL = "http://batch-api.cambioml.com" # TODO: Fix Later
Copy link

Copilot AI Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using HTTP for PUBLIC_BATCH_BASE_URL may expose API credentials or data in transit. Consider switching to HTTPS for secure communications.

Suggested change
PUBLIC_BATCH_BASE_URL = "http://batch-api.cambioml.com" # TODO: Fix Later
PUBLIC_BATCH_BASE_URL = "https://batch-api.cambioml.com" # Updated to use HTTPS

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to decide whether not keep batch api, new backend does not support for now, so archive here. cc @lingjiekong

Copy link
Member

@lingjiekong lingjiekong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@boqiny boqiny merged commit ab8d754 into main Jun 17, 2025
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants