feat: Add MLflow artifact upload for traces and logs#440
feat: Add MLflow artifact upload for traces and logs#440
Conversation
- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container
There was a problem hiding this comment.
Pull request overview
This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.
Key changes:
- New artifact upload module with functions to collect and upload trace/log files to MLflow
- Integration of artifact uploads before MLflow run completion in the trainer
- Configuration options to control trace and log uploads (defaulting to enabled)
- Shell script improvements for timestamp-based output directories with multi-node consistency
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| primus/backends/megatron/training/mlflow_artifacts.py | New module implementing trace/log file discovery and MLflow artifact upload functionality |
| primus/backends/megatron/training/global_vars.py | Adds global variable for exp_root_path and wrapper function for artifact uploads |
| primus/modules/trainer/megatron/trainer.py | Integrates artifact upload calls before MLflow run termination in two exit paths |
| primus/configs/modules/megatron/primus_megatron_module.yaml | Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true) |
| examples/run_slurm_pretrain.sh | Implements timestamp-based output directory naming and exports timestamp for multi-node consistency |
| examples/run_pretrain.sh | Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message |
| examples/run_local_pretrain.sh | Adds MLflow environment variables and Primus path variables to Docker container environment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3c149be to
13dfa81
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When mlflow_upload_traces or mlflow_upload_logs is True: - Auto-enable mlflow (set disable_mlflow=False) - Auto-enable profiling if trace upload is requested This removes the need to explicitly set: - --disable_mlflow=False - --profile=True - --use_pytorch_profiler=True
The profiler saves traces to tensorboard_dir, which is None when tensorboard is disabled. This caused a TypeError during trace save. Moved auto-enable logic before tensorboard section and added tensorboard auto-enable when mlflow_upload_traces is True.
Co-authored-by: Cursor <cursoragent@cursor.com>
…anks participate Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ame process Co-authored-by: Cursor <cursoragent@cursor.com>
|
@wenxie-amd @Xiaoming-AMD @limou102 Could you please review? |
Scope MLflow artifact imports to call sites, add exception detail and tracebacks, and avoid forcing default upload flags when args omit them. Co-authored-by: Cursor <cursoragent@cursor.com>
Align Slurm EXP defaults with local training, ensure finalize barriers wait for uploads, and update tests to match handled exceptions. Co-authored-by: Cursor <cursoragent@cursor.com>
feat: Add MLflow artifact upload for traces and logs
Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.
Features
artifacts/traces/artifacts/logs/Config Options
Usage
When MLflow is enabled, artifacts are automatically uploaded at the end of training:
tensorboard_dir→ MLflowartifacts/traces/exp_root_path/logs/→ MLflowartifacts/logs/Example
Code
`
export CONFIG_NAME="deepseek_v2_lite-FP8-pretrain"
export EXP="examples/megatron/configs/MI300X/${CONFIG_NAME}.yaml"
export MLFLOW_RUN_NAME="${CONFIG_NAME}.single-node.baseline"
bash ./examples/run_pretrain.sh
--train_iters=5
--profile_step_start=2
--profile_step_end=4
--profile_ranks=ALL
--mlflow_run_name=${MLFLOW_RUN_NAME}
--mlflow_experiment_name=/Performance-data/Megatron-LM/primus-test
--mlflow_upload_performance_metrics=True
--mlflow_upload_traces=True
--mlflow_upload_logs=True
`
Output