Skip to content

Conversation

@platypii
Copy link

@platypii platypii commented Dec 9, 2024

Uses hyparquet for javascript parquet parsing. It is a small, pure js implementation of parquet parsing with no dependencies. Parquet.js that this replaces is unmaintained and has not been updated in 5+ years.

Fixes #102 and #104 by using a well-maintained parquet library that supports modern parquet files.

I tested this with the parquet file generated by together-python and confirmed that upload works and fixes issue #104.

Let me know if I can help with anything!


Note

Replace ParquetJS with hyparquet and update parquet validation to read schema and row count via hyparquet.

  • Parquet handling:
    • Update src/lib/check-file.ts _check_parquet to use hyparquet for reading metadata/schema and row count (asyncBufferFromFile, parquetMetadataAsync, parquetSchema).
    • Remove parquetjs reader usage and related close calls; adjust error message to reference hyparquet.
    • Keep column validation against PARQUET_EXPECTED_COLUMNS and minimum sample checks using metadata.num_rows.
  • Dependencies:
    • Add hyparquet@1.14.0; remove parquetjs and @types/parquetjs.

Written by Cursor Bugbot for commit 72d4550. This will update automatically on new commits. Configure here.

@platypii
Copy link
Author

@Nutlope? Anyone?

@nicolasembleton
Copy link

This should be straightforward to review.

@platypii
Copy link
Author

platypii commented Jan 2, 2025

@samselikoff?

@platypii
Copy link
Author

Here's a video showing that instructions on together.ai website fail due to parquetjs parsing error. This PR fixes this issue:

together-upload.mp4

@Nutlope @samselikoff anything I can do to help move this along?

@Nutlope
Copy link
Collaborator

Nutlope commented Jan 24, 2025

@platypii thanks so much for reporting and for the PR! We're in the process of fixing some things with the upload. @yogishbaliga, mind taking at this PR when you do your other PR on the upload functionality too?

@platypii
Copy link
Author

@yogishbaliga thoughts? happy to contribute if there's more work that needs done

@blainekasten
Copy link
Contributor

@platypii apologies for the delay on this. I'm actively maintaining these codebases now. I switched the base branch to next - could you address the conflicts and let me know when this is stable again. I would love to land this

@platypii platypii force-pushed the main branch 3 times, most recently from 2a5b13e to 6f1787b Compare November 28, 2025 19:20
@platypii
Copy link
Author

platypii commented Nov 28, 2025

@blainekasten thanks, I just rebased onto next branch, this should fix the Tokenized Data walkthrough on the together docs.

FYI there seems to be an issue with installing packages in this repo. When I run yarn install it calls the npm prepare script which includes a call to git-swap.sh which... deletes your entire local git repo and git history 😬 I THINK what you want is to move it from prepare to prepublish? I'm happy to put up another PR for that but it's not my repo so I didn't want to mess with packaging stuff. LMK.

Thanks!

@blainekasten
Copy link
Contributor

@platypii - apologies again.. seems there is a bunch of conflicts again

@blainekasten
Copy link
Contributor

Based on the current diff, I think you had a bad merge at some point. The diff looks much too big :)

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.


Bug: Missing error handling for uninstalled hyparquet dependency

The old code had a separate try-catch around the parquetjs import that threw a helpful error message when the library wasn't installed. The new code removes this handling and places the hyparquet import inside the general error-catching block. Since hyparquet is a devDependency (not a regular dependency in package.json), production users won't have it installed. When they try to use parquet functionality, the import will fail and they'll see a misleading error about "file corruption" instead of being told to install hyparquet. The error handling should distinguish between missing library and actual file processing errors.

src/lib/check-file.ts#L652-L693

try {
const { asyncBufferFromFile, parquetMetadataAsync, parquetSchema } = await import('hyparquet');
const asyncBuffer = await asyncBufferFromFile(file);
const metadata = await parquetMetadataAsync(asyncBuffer);
const { children } = parquetSchema(metadata);
const column_names = children.map((child: any) => child.element.name);
if (!column_names.includes('input_ids')) {
report_dict.load_parquet = `Parquet file ${file} does not contain the \`input_ids\` column.`;
report_dict.is_check_passed = false;
return report_dict;
}
for (const column_name of column_names) {
if (!PARQUET_EXPECTED_COLUMNS.includes(column_name)) {
report_dict.load_parquet = `Parquet file ${file} contains an unexpected column ${column_name}. Only columns ${PARQUET_EXPECTED_COLUMNS.join(
', ',
)} are supported.`;
report_dict.is_check_passed = false;
return report_dict;
}
}
const num_samples = Number(metadata.num_rows);
if (num_samples < MIN_SAMPLES) {
report_dict.has_min_samples = false;
report_dict.message = `Processing ${file} resulted in only ${num_samples} samples. Our minimum is ${MIN_SAMPLES} samples. `;
report_dict.is_check_passed = false;
return report_dict;
} else {
report_dict.num_samples = num_samples;
}
report_dict.is_check_passed = true;
} catch (e) {
const errorMessage = e instanceof Error ? e.message : String(e);
const stack = e instanceof Error ? e.stack : '';
report_dict.load_parquet = `An exception has occurred when loading the Parquet file ${file}. Please check the file for corruption. Exception trace:\n${errorMessage}\n${stack}`;
report_dict.is_check_passed = false;
}

Fix in Cursor Fix in Web


throw new Error(
'hyparquet is not installed and is required to use parquet files. Please install it via `npm install hyparquet`',
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Overly broad catch masks file parsing errors

The inner try-catch block catches all errors from asyncBufferFromFile, parquetMetadataAsync, parquetSchema, and the schema mapping—not just import failures. When a parquet file is corrupt, doesn't exist, or has an unexpected schema structure, the catch block throws a misleading "hyparquet is not installed" error instead of letting the actual error propagate to the outer catch, which would correctly report it as a file corruption issue. The original code only wrapped the import in the "not installed" try-catch.

Fix in Cursor Fix in Web

@platypii
Copy link
Author

@blainekasten I rebased and pushed up my changes, should be good now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace parquetjs for better deno compatibility?

4 participants