Add `parallelism_auto` flag to automatically set dp, tp and micro batch size #516

pefontana · 2026-01-23T21:44:34Z

Adds --parallelism-auto flag to automatically detect optimal dp, tp, and micro_batch_size based on the model and GPU
hardware.

Detects GPU type via nvidia-smi (with fallback to /proc/driver/nvidia/gpus/*/information for containers)
Looks up config from JSON: model repo → GPU type → num GPUs → {dp, tp, micro_batch_size}

Lookup order:

Fetch parallelism_data.json from the model's HuggingFace repo (preferred)
Fall back to compiled parallelism_data.json with warning

{
  "your-org/your-model": {
    "H100": {
      "1": { "dp": 1, "tp": 1, "micro_batch_size": 4 },
      "8": { "dp": 4, "tp": 2, "micro_batch_size": 4 }
    }
  }
}

…hardcode-parallelism-data

entropidelic · 2026-01-26T21:35:38Z

shared/client/src/parallelism_lookup.rs

+
+fn get_gpu_type() -> String {
+    // Try nvidia-smi first
+    let raw = Command::new("nvidia-smi")


small nit: maybe we can rename this variable to something better?

arilotter · 2026-01-27T22:08:35Z

shared/client/src/parallelism_lookup.rs

+fn get_gpu_type() -> String {
+    // Try nvidia-smi first
+    let raw_gpu_name = Command::new("nvidia-smi")
+        .args(["--query-gpu=name", "--format=csv,noheader"])
+        .output()
+        .ok()
+        .and_then(|o| String::from_utf8(o.stdout).ok())
+        .and_then(|s| s.lines().next().map(|l| l.trim().to_string()))
+        .filter(|s| !s.is_empty())
+        // Fallback: read from /proc/driver/nvidia (works in containers without nvidia-smi)
+        .or_else(|| {
+            std::fs::read_dir("/proc/driver/nvidia/gpus")
+                .ok()?
+                .filter_map(|e| e.ok())
+                .next()
+                .and_then(|entry| {
+                    let info_path = entry.path().join("information");
+                    std::fs::read_to_string(info_path).ok()
+                })
+                .and_then(|content| {
+                    content
+                        .lines()
+                        .find(|line| line.starts_with("Model:"))
+                        .map(|line| line.trim_start_matches("Model:").trim().to_string())
+                })
+        })
+        .unwrap_or_default();
+
+    // Normalize GPU name to match table keys
+    if raw_gpu_name.to_uppercase().contains("H200") {
+        "H200".to_string()
+    } else if raw_gpu_name.to_uppercase().contains("H100") {
+        "H100".to_string()
+    } else {
+        raw_gpu_name
+    }
+}


have you considered using the nvml_wrapper crate instead of shelling out / reading /proc/fs stuff? we can grab gpu count from there too 🤷 and assert that they're all the same GPU for sanity checking :D

we use this in some of the metrics stuff already -
it would be something like:

use nvml_wrapper::Nvml; #[derive(Debug)] struct GpuInfo { name: String, device_count: u32, } fn get_gpu_info() -> anyhow::Result<GpuInfo> { let nvml = Nvml::init()?; let device_count = nvml.device_count()?; if device_count == 0 { anyhow::bail!("No GPUs found!"); } let mut gpu_names = Vec::new(); for i in 0..device_count { let device = nvml.device_by_index(i)?; gpu_names.push(device.name()?); } let first_name = &gpu_names[0]; if !gpu_names.iter().all(|name| name == first_name) { anyhow::bail!( "All GPUs must be of the same type, but we have mismatching names: {:?}", gpu_names ); } Ok(GpuInfo { name: gpu_names.pop().unwrap(), device_count, }) }

pefontana added 11 commits January 23, 2026 13:43

Implement --auto-parallelism

3f48d73

parallelism_data.json

653d0eb

Merge branch 'main' into hardcode-parallelism-data

01a6714

simplify code

b541cc5

Merge remote-tracking branch 'origin/hardcode-parallelism-data' into …

4a6e1d2

…hardcode-parallelism-data

clippy

3af705e

Merge branch 'main' into hardcode-parallelism-data

9303f99

add parallelism_data.json to Garnix

27f4ed5

add hardware type to json

a8d5330

update .json

f265de9

Fallback: read from /proc/driver/nvidia

1d3aafd

pefontana changed the title ~~Hardcode parallelism data~~ Add parallelism_auto flag to automatically set dp, tp and micro batch size Jan 26, 2026

restore scripts/train-solana-test.sh

50762e5

pefontana marked this pull request as ready for review January 26, 2026 20:40

update documentation

f6c1c7a

entropidelic reviewed Jan 26, 2026

View reviewed changes

pefontana added 3 commits January 26, 2026 18:36

Change micro_batch_size for Meta-Llama-3.1 to 1

42fd012

nit

1695b93

look data-parallelism.json in HF repo

5175723

arilotter reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `parallelism_auto` flag to automatically set dp, tp and micro batch size #516

Add `parallelism_auto` flag to automatically set dp, tp and micro batch size #516

Uh oh!

pefontana commented Jan 23, 2026 •

edited

Loading

Uh oh!

entropidelic Jan 26, 2026

Uh oh!

arilotter Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add parallelism_auto flag to automatically set dp, tp and micro batch size #516

Are you sure you want to change the base?

Add parallelism_auto flag to automatically set dp, tp and micro batch size #516

Uh oh!

Conversation

pefontana commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

entropidelic Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

arilotter Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `parallelism_auto` flag to automatically set dp, tp and micro batch size #516

Add `parallelism_auto` flag to automatically set dp, tp and micro batch size #516

pefontana commented Jan 23, 2026 •

edited

Loading