Skip to content

feat: Add MLflow artifact upload for traces and logs#440

Open
gphuang wants to merge 42 commits intomainfrom
feat/6-enable-mlflow-uploading
Open

feat: Add MLflow artifact upload for traces and logs#440
gphuang wants to merge 42 commits intomainfrom
feat/6-enable-mlflow-uploading

Conversation

@gphuang
Copy link

@gphuang gphuang commented Dec 18, 2025

feat: Add MLflow artifact upload for traces and logs

Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.

Features

  • Upload PyTorch profiler trace files to MLflow artifacts/traces/
  • Upload training log files to MLflow artifacts/logs/
  • Unique timestamp-based output directories for multi-node consistency
  • Pass MLflow environment variables through Docker container

Config Options

  • mlflow_upload_traces: true # Upload profiler trace files to MLflow
  • mlflow_upload_logs: true # Upload training log files to MLflow

Usage

When MLflow is enabled, artifacts are automatically uploaded at the end of training:

  • Trace files from tensorboard_dir → MLflow artifacts/traces/
  • Log files from exp_root_path/logs/ → MLflow artifacts/logs/

Example

Code
`
export CONFIG_NAME="deepseek_v2_lite-FP8-pretrain"

export EXP="examples/megatron/configs/MI300X/${CONFIG_NAME}.yaml"

export MLFLOW_RUN_NAME="${CONFIG_NAME}.single-node.baseline"

bash ./examples/run_pretrain.sh
--train_iters=5
--profile_step_start=2
--profile_step_end=4
--profile_ranks=ALL
--mlflow_run_name=${MLFLOW_RUN_NAME}
--mlflow_experiment_name=/Performance-data/Megatron-LM/primus-test
--mlflow_upload_performance_metrics=True
--mlflow_upload_traces=True
--mlflow_upload_logs=True
`

Output

- Add mlflow_artifacts.py with functions to collect and upload trace/log files
- Add upload_mlflow_artifacts() wrapper in global_vars.py
- Integrate artifact upload in trainer.py before MLflow run ends
- Add mlflow_upload_traces and mlflow_upload_logs config options
- Add unique timestamp-based output directories for multi-node consistency
- Pass MLflow environment variables through Docker container
Copilot AI review requested due to automatic review settings December 18, 2025 09:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.

Key changes:

  • New artifact upload module with functions to collect and upload trace/log files to MLflow
  • Integration of artifact uploads before MLflow run completion in the trainer
  • Configuration options to control trace and log uploads (defaulting to enabled)
  • Shell script improvements for timestamp-based output directories with multi-node consistency

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
primus/backends/megatron/training/mlflow_artifacts.py New module implementing trace/log file discovery and MLflow artifact upload functionality
primus/backends/megatron/training/global_vars.py Adds global variable for exp_root_path and wrapper function for artifact uploads
primus/modules/trainer/megatron/trainer.py Integrates artifact upload calls before MLflow run termination in two exit paths
primus/configs/modules/megatron/primus_megatron_module.yaml Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true)
examples/run_slurm_pretrain.sh Implements timestamp-based output directory naming and exports timestamp for multi-node consistency
examples/run_pretrain.sh Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message
examples/run_local_pretrain.sh Adds MLflow environment variables and Primus path variables to Docker container environment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI commented Dec 18, 2025

@gphuang I've opened a new pull request, #441, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI review requested due to automatic review settings December 18, 2025 10:30
@gphuang gphuang force-pushed the feat/6-enable-mlflow-uploading branch from 3c149be to 13dfa81 Compare December 18, 2025 10:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 18, 2025 10:37
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gphuang and others added 2 commits December 18, 2025 15:15
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0]
which are interpreted as glob pattern character classes, causing glob.glob to
return empty results even though files exist.

Fixed by using glob.escape() on directory paths before using them with glob.glob().
Copilot AI review requested due to automatic review settings December 19, 2025 08:26
@gphuang gphuang marked this pull request as ready for review December 19, 2025 08:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

When mlflow_upload_traces or mlflow_upload_logs is True:
- Auto-enable mlflow (set disable_mlflow=False)
- Auto-enable profiling if trace upload is requested

This removes the need to explicitly set:
- --disable_mlflow=False
- --profile=True
- --use_pytorch_profiler=True
The profiler saves traces to tensorboard_dir, which is None when
tensorboard is disabled. This caused a TypeError during trace save.

Moved auto-enable logic before tensorboard section and added
tensorboard auto-enable when mlflow_upload_traces is True.
Copilot AI review requested due to automatic review settings February 5, 2026 14:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings February 9, 2026 08:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings February 9, 2026 12:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

gphuang and others added 3 commits February 9, 2026 14:47
…anks participate

Co-authored-by: Cursor <cursoragent@cursor.com>
…ame process

Co-authored-by: Cursor <cursoragent@cursor.com>
@gphuang
Copy link
Author

gphuang commented Feb 9, 2026

@wenxie-amd @Xiaoming-AMD @limou102 Could you please review?

Copilot AI review requested due to automatic review settings February 10, 2026 08:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

gphuang and others added 2 commits February 10, 2026 10:30
Scope MLflow artifact imports to call sites, add exception detail and tracebacks,
and avoid forcing default upload flags when args omit them.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings February 11, 2026 09:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Align Slurm EXP defaults with local training, ensure finalize barriers
wait for uploads, and update tests to match handled exceptions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants