-
Notifications
You must be signed in to change notification settings - Fork 7
Replace parquet.js with hyparquet #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
Conversation
|
@Nutlope? Anyone? |
|
This should be straightforward to review. |
|
Here's a video showing that instructions on together.ai website fail due to parquetjs parsing error. This PR fixes this issue: together-upload.mp4@Nutlope @samselikoff anything I can do to help move this along? |
|
@platypii thanks so much for reporting and for the PR! We're in the process of fixing some things with the upload. @yogishbaliga, mind taking at this PR when you do your other PR on the upload functionality too? |
|
@yogishbaliga thoughts? happy to contribute if there's more work that needs done |
|
@platypii apologies for the delay on this. I'm actively maintaining these codebases now. I switched the base branch to |
2a5b13e to
6f1787b
Compare
|
@blainekasten thanks, I just rebased onto FYI there seems to be an issue with installing packages in this repo. When I run Thanks! |
10df0df to
7e75fa7
Compare
|
@platypii - apologies again.. seems there is a bunch of conflicts again |
|
Based on the current diff, I think you had a bad merge at some point. The diff looks much too big :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Bug: Missing error handling for uninstalled hyparquet dependency
The old code had a separate try-catch around the parquetjs import that threw a helpful error message when the library wasn't installed. The new code removes this handling and places the hyparquet import inside the general error-catching block. Since hyparquet is a devDependency (not a regular dependency in package.json), production users won't have it installed. When they try to use parquet functionality, the import will fail and they'll see a misleading error about "file corruption" instead of being told to install hyparquet. The error handling should distinguish between missing library and actual file processing errors.
src/lib/check-file.ts#L652-L693
together-typescript/src/lib/check-file.ts
Lines 652 to 693 in 91931b5
| try { | |
| const { asyncBufferFromFile, parquetMetadataAsync, parquetSchema } = await import('hyparquet'); | |
| const asyncBuffer = await asyncBufferFromFile(file); | |
| const metadata = await parquetMetadataAsync(asyncBuffer); | |
| const { children } = parquetSchema(metadata); | |
| const column_names = children.map((child: any) => child.element.name); | |
| if (!column_names.includes('input_ids')) { | |
| report_dict.load_parquet = `Parquet file ${file} does not contain the \`input_ids\` column.`; | |
| report_dict.is_check_passed = false; | |
| return report_dict; | |
| } | |
| for (const column_name of column_names) { | |
| if (!PARQUET_EXPECTED_COLUMNS.includes(column_name)) { | |
| report_dict.load_parquet = `Parquet file ${file} contains an unexpected column ${column_name}. Only columns ${PARQUET_EXPECTED_COLUMNS.join( | |
| ', ', | |
| )} are supported.`; | |
| report_dict.is_check_passed = false; | |
| return report_dict; | |
| } | |
| } | |
| const num_samples = Number(metadata.num_rows); | |
| if (num_samples < MIN_SAMPLES) { | |
| report_dict.has_min_samples = false; | |
| report_dict.message = `Processing ${file} resulted in only ${num_samples} samples. Our minimum is ${MIN_SAMPLES} samples. `; | |
| report_dict.is_check_passed = false; | |
| return report_dict; | |
| } else { | |
| report_dict.num_samples = num_samples; | |
| } | |
| report_dict.is_check_passed = true; | |
| } catch (e) { | |
| const errorMessage = e instanceof Error ? e.message : String(e); | |
| const stack = e instanceof Error ? e.stack : ''; | |
| report_dict.load_parquet = `An exception has occurred when loading the Parquet file ${file}. Please check the file for corruption. Exception trace:\n${errorMessage}\n${stack}`; | |
| report_dict.is_check_passed = false; | |
| } |
| throw new Error( | ||
| 'hyparquet is not installed and is required to use parquet files. Please install it via `npm install hyparquet`', | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Overly broad catch masks file parsing errors
The inner try-catch block catches all errors from asyncBufferFromFile, parquetMetadataAsync, parquetSchema, and the schema mapping—not just import failures. When a parquet file is corrupt, doesn't exist, or has an unexpected schema structure, the catch block throws a misleading "hyparquet is not installed" error instead of letting the actual error propagate to the outer catch, which would correctly report it as a file corruption issue. The original code only wrapped the import in the "not installed" try-catch.
|
@blainekasten I rebased and pushed up my changes, should be good now |
94f6844 to
09e42af
Compare
Uses hyparquet for javascript parquet parsing. It is a small, pure js implementation of parquet parsing with no dependencies. Parquet.js that this replaces is unmaintained and has not been updated in 5+ years.
Fixes #102 and #104 by using a well-maintained parquet library that supports modern parquet files.
I tested this with the parquet file generated by
together-pythonand confirmed that upload works and fixes issue #104.Let me know if I can help with anything!
Note
Replace ParquetJS with hyparquet and update parquet validation to read schema and row count via hyparquet.
src/lib/check-file.ts_check_parquetto usehyparquetfor reading metadata/schema and row count (asyncBufferFromFile,parquetMetadataAsync,parquetSchema).parquetjsreader usage and related close calls; adjust error message to referencehyparquet.PARQUET_EXPECTED_COLUMNSand minimum sample checks usingmetadata.num_rows.hyparquet@1.14.0; removeparquetjsand@types/parquetjs.Written by Cursor Bugbot for commit 72d4550. This will update automatically on new commits. Configure here.