-
Notifications
You must be signed in to change notification settings - Fork 86
Add parallelism_auto flag to automatically set dp, tp and micro batch size
#516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…hardcode-parallelism-data
parallelism_auto flag to automatically set dp, tp and micro batch size
|
|
||
| fn get_gpu_type() -> String { | ||
| // Try nvidia-smi first | ||
| let raw = Command::new("nvidia-smi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: maybe we can rename this variable to something better?
| fn get_gpu_type() -> String { | ||
| // Try nvidia-smi first | ||
| let raw_gpu_name = Command::new("nvidia-smi") | ||
| .args(["--query-gpu=name", "--format=csv,noheader"]) | ||
| .output() | ||
| .ok() | ||
| .and_then(|o| String::from_utf8(o.stdout).ok()) | ||
| .and_then(|s| s.lines().next().map(|l| l.trim().to_string())) | ||
| .filter(|s| !s.is_empty()) | ||
| // Fallback: read from /proc/driver/nvidia (works in containers without nvidia-smi) | ||
| .or_else(|| { | ||
| std::fs::read_dir("/proc/driver/nvidia/gpus") | ||
| .ok()? | ||
| .filter_map(|e| e.ok()) | ||
| .next() | ||
| .and_then(|entry| { | ||
| let info_path = entry.path().join("information"); | ||
| std::fs::read_to_string(info_path).ok() | ||
| }) | ||
| .and_then(|content| { | ||
| content | ||
| .lines() | ||
| .find(|line| line.starts_with("Model:")) | ||
| .map(|line| line.trim_start_matches("Model:").trim().to_string()) | ||
| }) | ||
| }) | ||
| .unwrap_or_default(); | ||
|
|
||
| // Normalize GPU name to match table keys | ||
| if raw_gpu_name.to_uppercase().contains("H200") { | ||
| "H200".to_string() | ||
| } else if raw_gpu_name.to_uppercase().contains("H100") { | ||
| "H100".to_string() | ||
| } else { | ||
| raw_gpu_name | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you considered using the nvml_wrapper crate instead of shelling out / reading /proc/fs stuff? we can grab gpu count from there too 🤷 and assert that they're all the same GPU for sanity checking :D
we use this in some of the metrics stuff already -
it would be something like:
use nvml_wrapper::Nvml;
#[derive(Debug)]
struct GpuInfo {
name: String,
device_count: u32,
}
fn get_gpu_info() -> anyhow::Result<GpuInfo> {
let nvml = Nvml::init()?;
let device_count = nvml.device_count()?;
if device_count == 0 {
anyhow::bail!("No GPUs found!");
}
let mut gpu_names = Vec::new();
for i in 0..device_count {
let device = nvml.device_by_index(i)?;
gpu_names.push(device.name()?);
}
let first_name = &gpu_names[0];
if !gpu_names.iter().all(|name| name == first_name) {
anyhow::bail!(
"All GPUs must be of the same type, but we have mismatching names: {:?}",
gpu_names
);
}
Ok(GpuInfo {
name: gpu_names.pop().unwrap(),
device_count,
})
}
Adds --parallelism-auto flag to automatically detect optimal dp, tp, and micro_batch_size based on the model and GPU
hardware.
Lookup order: