feat: Add TraceLens integration for trace analysis with MLflow upload#439
feat: Add TraceLens integration for trace analysis with MLflow upload#439
Conversation
- Add TraceLens trace analysis report generation (XLSX, CSV formats) - Add mlflow_upload_tracelens_report config option (default: false) - Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options - Add mlflow_tracelens_output_format option (all, xlsx, csv) - Auto-install TraceLens from GitHub if not present - Upload analysis reports to MLflow artifacts/trace_analysis/
There was a problem hiding this comment.
Pull request overview
This PR adds TraceLens integration to automatically generate performance analysis reports from PyTorch profiler traces and upload them to MLflow. TraceLens is auto-installed from GitHub if not present, and users can configure rank filtering, report limits, and output formats (XLSX/CSV/HTML).
Key changes:
- New module for MLflow artifact management with TraceLens integration
- Automatic TraceLens installation from GitHub with fallback CSV generation
- Configuration options to control trace analysis (ranks, max reports, output formats)
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
| primus/backends/megatron/training/mlflow_artifacts.py | New 725-line module implementing trace/log file uploads and TraceLens report generation with fallback CSV summary |
| primus/backends/megatron/training/global_vars.py | Adds import and wrapper function upload_mlflow_artifacts to expose artifact upload functionality |
| primus/modules/trainer/megatron/trainer.py | Calls upload_mlflow_artifacts before ending MLflow run with configuration parameters from args |
| primus/configs/modules/megatron/primus_megatron_module.yaml | Adds 6 new configuration options for controlling trace/log uploads and TraceLens analysis |
Comments suppressed due to low confidence (2)
primus/backends/megatron/training/mlflow_artifacts.py:382
- Variable dfs is not used.
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
primus/backends/megatron/training/mlflow_artifacts.py:370
- This assignment to 'dfs' is unnecessary as it is redefined before this value is used.
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
df2e40a to
2861bdf
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…config parser Addresses Copilot review comment: if mlflow_tracelens_ranks is configured as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would receive a string instead of a list, causing _filter_traces_by_rank to silently filter out all trace files. Added ast.literal_eval() conversion in: - generate_tracelens_reports() - upload_tracelens_reports_to_mlflow() Falls back to None (process all ranks) with a warning if parsing fails.
When output_format='all', previously the trace file was parsed twice: - Once for XLSX generation - Once for CSV generation Now when format is 'all', we call generate_perf_report_pytorch once with both output_xlsx_path and output_csvs_dir parameters, parsing the trace file only once and generating both formats from the same data. This improves performance significantly for the common use case of generating both report formats.
After TraceLens reports are successfully uploaded to MLflow, the local tracelens_reports directory is automatically cleaned up to save disk space. This addresses the issue of temporary directories not being cleaned up after artifact upload. The reports remain accessible in MLflow while freeing up local storage. Other directories checked: - tensorboard_dir: Contains original trace files, NOT temporary - exp_root_path/logs: Contains original log files, NOT temporary - tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup
Added mlflow_tracelens_cleanup_after_upload parameter to control whether local TraceLens reports are removed after upload to MLflow. Default: True (cleanup to save disk space) Set to False to keep reports locally for inspection/debugging Changes: - Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow() - Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow() - Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true) - Updated trainer to pass through the parameter Use cases: - True (default): Production runs, save disk space - False: Development/debugging, keep local copies for inspection
Normalize and validate TraceLens rank filters, warn on invalid values, and clarify where XLSX/CSV outputs land for all output formats. Co-authored-by: Cursor <cursoragent@cursor.com>
Rename perf/{mem_collector}_peak_mem_gb to current_mem_gb to reflect
instantaneous memory usage rather than a peak value.
Co-authored-by: Cursor <cursoragent@cursor.com>
Parse rocm-smi GPU utilization using labeled/percentage values to avoid misreading device indices, and clarify TraceLens report item logging. Co-authored-by: Cursor <cursoragent@cursor.com>
Only run the openpyxl check after TraceLens imports, so CSV-only fallback paths avoid unnecessary runtime installs. Co-authored-by: Cursor <cursoragent@cursor.com>
Add timeout and stderr logging for TraceLens installs, skip install when rank validation fails, and clarify MLflow artifact call behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
Normalize TraceLens output formats, improve CSV handling, add auto-install controls and timeouts, and harden MLflow/ROCm metric handling and docs. Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid duplicate local generation in distributed runs, align default output_format with xlsx, and downgrade to CSV when openpyxl is unavailable. Co-authored-by: Cursor <cursoragent@cursor.com>
Update docs to state the last rank (writer) performs TraceLens artifact uploads in distributed runs. Co-authored-by: Cursor <cursoragent@cursor.com>
Align mlflow_setup.py docstring with the actual default of 'xlsx'. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: Add TraceLens integration for trace analysis with MLflow upload
Adds TraceLens trace analysis capability to automatically generate performance
reports from PyTorch profiler traces and upload them to MLflow.
Addresses review feedback and adds tests for TraceLens report generation and MLflow artifact upload. Keeps MLflow opt-in, makes local-only TraceLens work without MLflow, and tightens safety/docs.
Features
artifacts/trace_analysis/Config Options
Example
Code
`
export CONFIG_NAME="deepseek_v2_lite-FP8-pretrain"
export EXP="examples/megatron/configs/MI300X/${CONFIG_NAME}.yaml"
export MLFLOW_RUN_NAME="${CONFIG_NAME}.single-node.baseline"
bash ./examples/run_pretrain.sh
--train_iters=5
--profile_step_start=2
--profile_step_end=4
--profile_ranks=ALL
--mlflow_run_name=${MLFLOW_RUN_NAME}
--mlflow_experiment_name=/Performance-data/Megatron-LM/primus-test
--mlflow_upload_performance_metrics=True
--mlflow_upload_tracelens_report=True
`
Output
Single node run
Multi (2) node