Skip to content

Conversation

@philrhc
Copy link

@philrhc philrhc commented Jan 22, 2026

Merges the HTTP example with the train example

@philrhc
Copy link
Author

philrhc commented Jan 22, 2026

Hi @IAvecilla, i saw this comment: #506 (comment)
It would make it clear to me if you could give me an example URL or expand on your comment in the context of this file. I've tried the HTTP template example with these parameters and it seems to work, I suppose this not pre-processed data from huggingface?

cargo run --example train --     --model emozilla/llama2-20m-init     --total-batch 2     --micro-batch 1     http-template         --template "https://huggingface.co/datasets/emozilla/fineweb-10bt-tokenized-datatrove-llama2/resolve/main/00000_{}_shuffled.ds"         --start 0         --end 1         --left-pad-zeros 5

@philrhc philrhc changed the title Adds HTTP to examples train Adds HTTP to train.rs example Jan 22, 2026
}
Err(err) => {
println!(
"Failed to load with local data provider. {err:?} Trying preprocessed data provider instead"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in main, the train.rs falls back to pre-processed data if local data fails, this change keeps this behaviour, but might be more clear if it has an explicit option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it’s a good idea to separate the possibility of using Preprocessed data. We might even end up splitting it further into Preprocessed Local (this behavior) and HTTP Preprocessed, which is the one I mentioned in the issue and whose implementation is here: #506. That should cover all the different data provider possibilities we have at the moment.

@dsocolobsky
Copy link
Contributor

dsocolobsky commented Jan 27, 2026

Hey sorry for the delay! I've tested some cases and it seems to be working alright, good PR.

I was wondering that now since we have a more options for the tool, if the user gives no arguments it should show the --help text to let them know what is available

So instead of

$ cargo run --example train
Caused by:
    Failed to open data directory "data": No such file or directory (os error 2) Trying preprocessed data provider instead
Error: Failed to load preprocessed data

Caused by:
    Failed to open data directory "data": No such file or directory (os error 2)

Perhaps show

$ cargo run --example train
Error: Data directory 'data' does not exist.

Usage: psyche-modeling [OPTIONS] [COMMAND]

Commands:
  local            Local directory (default behavior)
...

I think we can do it with something like

if !std::path::Path::new(data_path).exists() {
    eprintln!("Error: Data directory '{}' does not exist.\n", data_path);
    CliArgs::command().print_long_help()?;
    std::process::exit(1);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants