Update the DI to CU Labeled data migration code to work with CU GA#128
Open
jfilcik wants to merge 10 commits intoAzure-Samples:mainfrom
Open
Update the DI to CU Labeled data migration code to work with CU GA#128jfilcik wants to merge 10 commits intoAzure-Samples:mainfrom
jfilcik wants to merge 10 commits intoAzure-Samples:mainfrom
Conversation
… scripts with "knowledge source" property from the GA API
aainav269
approved these changes
Dec 17, 2025
aainav269
reviewed
Dec 18, 2025
Contributor
There was a problem hiding this comment.
Do we need this file? I saw that this was from an October commit
aainav269
reviewed
Dec 18, 2025
| # Document Intelligence to Content Understanding Migration Tool (Python) | ||
|
|
||
| Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as used in AI Foundry. The following DI versions are supported: | ||
| Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **GA** 2025-11-01 format, as used in AI Foundry. The following DI versions are supported: |
Contributor
There was a problem hiding this comment.
I think it would be best to support both CU Preview and GA conversions?
| 3. Configure permissions and expiry for your SAS URL as follows: | ||
|
|
||
| - For the **DI source dataset**, please select permissions: _**Read & List**_ | ||
| https://jfilcikditestdata.blob.core.windows.net/didata?sv=2025-07-05&spr=https&st=2025-12-16T22%3A17%3A06Z&se=2025-12-17T22%3A17%3A06Z&sr=c&sp=rl&sig=nvUIelZQ9yWEJx3jA%2FjUOIdHn6OVnp5gvKSJ3zgzwvE%3D |
Member
There was a problem hiding this comment.
Need to remove this secret SAS URL.
|
|
||
| - For the **CU target dataset**, please select permissions: _**Read, Add, Create, & Write**_ | ||
|
|
||
| https://jfilcikditestdata.blob.core.windows.net/cudata?sv=2025-07-05&spr=https&st=2025-12-16T22%3A19%3A39Z&se=2025-12-17T22%3A19%3A39Z&sr=c&sp=racwl&sig=K82dxEFNpYhuf5JRq3xJ4vc5SYE8A7FfsBnTJbB1VJY%3D |
Member
There was a problem hiding this comment.
We won't want to check in the secret blob SAS URL.
| "baseAnalyzerId": "prebuilt-documentAnalyzer", | ||
| "baseAnalyzerId": "prebuilt-document", | ||
| "models": { | ||
| "completion": "gpt-4.1", |
Member
There was a problem hiding this comment.
We will need to set completion and embedding models in multiple converters. It will be good to have these default values in constants.py to be reused. Another possible option could be allowing the users to put these as arguments when running the converter.
…ers. - Created a Jupyter notebook to demonstrate processing legal documents with Azure Content Understanding. - Implemented functionality to reflow output to include inline line numbers for better citation and reference. - Included detailed markdown explanations and code for loading, analyzing, and reflowing legal transcripts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Updated the DI to CU labeled data migration tool from Azure AI Content Understanding Preview API (2025-05-01-preview) to General Availability API (2025-11-01).
Key Changes Implemented
Changed API version from 2025-05-01-preview to 2025-11-01 across all configuration files
Updated .sample_env template with new GA version
Updated 18 analyzer template JSON files with GA API requirements:
Added models section: Required object specifying completion and embedding models
Updated baseAnalyzerId naming: Changed from preview naming (e.g., prebuilt-documentAnalyzer) to GA naming (e.g., prebuilt-document)
Removed deprecated properties: Eliminated scenario property and pro mode configurations not supported in GA
3. OCR Extraction Improvements
Modified get_ocr.py to use prebuilt-read analyzer directly instead of creating temporary analyzers
Streamlined layout result generation by calling built-in analyzer API endpoint
4. Training Data Integration
Updated converter files (cu_converter_generative.py, cu_converter_neural.py) to add training data reference during conversion
Implemented correct knowledgeSources array format per GA API specification:
Added optional parameters for target container SAS URL and blob folder to support training data linking
5. Type Safety
Added missing Optional import to di_to_cu_converter.py for proper type hints
6. Documentation Updates
Updated README.md to reflect GA API version and documentation links
Added prerequisite warning about configuring default model deployments before migration
Clarified analyzer creation requirements and setup steps