diff --git a/README.md b/README.md index 61abd7ce..559b8d28 100644 --- a/README.md +++ b/README.md @@ -225,11 +225,14 @@ Formatting is only available for jobs and not for partition and partition_hourly Do note: `azslurm cost` relies on slurm's admincomment feature to associate specific vm_size and meter info for jobs. ### Topology -`azslurm` in slurm 4.0 project upgrades `azslurm generate_topology` to `azslurm topology` to generate the topology plugin configuration for slurm either using VMSS topology or a fabric manager that has SHARP enabled. +`azslurm` in slurm 4.0 project upgrades `azslurm generate_topology` to `azslurm topology` to generate the [topology plugin configuration](https://slurm.schedmd.com/topology.html) for slurm either using VMSS topology, a fabric manager that has SHARP enabled, or the NVLink Domain. `azslurm topology` can generate both tree and block topology plugin configurations for Slurm. Users may use `azslurm topology` to generate the topology file but must manually add it to `/etc/slurm/topology.conf` either by giving that as the output file or copying the file over. Additionally, users must specify `topologyType=tree|block` in `slurm.conf` for full functionality. + +Note: `azslurm topology` is only useful in manually scaled clusters or clusters of fixed size. Autoscaling does not take topology into account and topology is not updated on autoscale. ``` usage: azslurm topology [-h] [--config CONFIG] [-p, PARTITION] [-o OUTPUT] - [-v | -f] + [-v | -f | -n] [-b | -t] [-s BLOCK_SIZE] [--viz] + [--visual_block_size VISUAL_BLOCK_SIZE] optional arguments: -h, --help show this help message and exit @@ -238,13 +241,35 @@ optional arguments: Specify the parititon -o OUTPUT, --output OUTPUT Specify slurm topology file output - -v, --use_vmss Use VMSS (default: True) + -v, --use_vmss Use VMSS to map Tree or Block topology along VMSS + boundaries without special network consideration + (default: True) -f, --use_fabric_manager + Use Fabric Manager to map Tree topology (Block + topology not allowed) according to SHARP network + topology tool(default: False) + -n, --use_nvlink_domain + Use NVlink domain to map Block topology (Tree topology + not allowed) according to NVLink Domain and Partition + for multi-node NVLink (default: False) + -b, --block Generate Block Topology output to use Block topology + plugin (default: False) + -t, --tree Generate Tree Topology output to use Tree topology + plugin(default: False) + -s BLOCK_SIZE, --block_size BLOCK_SIZE + Minimum block size required for each block (use with + --block or --use_nvlink_domain, default: 1) Use Fabric Manager (default: False) + --viz + Generate ASCII visualization for the topology + (default: False) + --visual_block_size VISUAL_BLOCK_SIZE + Block size for visualization (default: 18) ``` -To generate slurm topology using VMSS: +To generate slurm topology using VMSS you may optionally specify the type of topology which is defaulted as tree: ``` azslurm topology +azslurm topology -v -t azslurm topology -o topology.conf ``` This will print out a the topology in the tree plugin format slurm wants for topology.conf or create a file based on the output file given in the cli @@ -253,9 +278,10 @@ This will print out a the topology in the tree plugin format slurm wants for top SwitchName=htc Nodes=cluster-htc-1,cluster-htc-2,cluster-htc-3,cluster-htc-4,cluster-htc-5,cluster-htc-6,cluster-htc-7,cluster-htc-8,cluster-htc-9,cluster-htc-10,cluster-htc-11,cluster-htc-12,cluster-htc-13,cluster-htc-14,cluster-htc-15,cluster-htc-16,cluster-htc-17,cluster-htc-18,cluster-htc-19,cluster-htc-20,cluster-htc-21,cluster-htc-22,cluster-htc-23,cluster-htc-24,cluster-htc-25,cluster-htc-26,cluster-htc-27,cluster-htc-28,cluster-htc-29,cluster-htc-30,cluster-htc-31,cluster-htc-32,cluster-htc-33,cluster-htc-34,cluster-htc-35,cluster-htc-36,cluster-htc-37,cluster-htc-38,cluster-htc-39,cluster-htc-40,cluster-htc-41,cluster-htc-42,cluster-htc-43,cluster-htc-44,cluster-htc-45,cluster-htc-46,cluster-htc-47,cluster-htc-48,cluster-htc-49,cluster-htc-50 SwitchName=Standard_F2s_v2_pg0 Nodes=cluster-hpc-1,cluster-hpc-10,cluster-hpc-11,cluster-hpc-12,cluster-hpc-13,cluster-hpc-14,cluster-hpc-15,cluster-hpc-16,cluster-hpc-2,cluster-hpc-3,cluster-hpc-4,cluster-hpc-5,cluster-hpc-6,cluster-hpc-7,cluster-hpc-8,cluster-hpc-9 ``` -To generate slurm topology using Fabric Manager you need a SHARP enabled cluster and it is required you specify a partition: +To generate slurm topology using Fabric Manager you need a SHARP enabled cluster and it is required you specify a partition and you may optionally specify tree plugin which is the default: ``` azslurm topology -f -p gpu +azslurm topology -f -p gpu -t azslurm topology -f -p gpu -o topology.conf ``` ``` @@ -275,7 +301,167 @@ SwitchName=sw02 Nodes=ccw-gpu-192 SwitchName=sw03 Nodes=ccw-gpu-13,ccw-gpu-142,ccw-gpu-26,ccw-gpu-136,ccw-gpu-163,ccw-gpu-138,ccw-gpu-187,ccw-gpu-88 ``` -This either prints out the topology in slurm topology format or creates an output file with the topology +This either prints out the topology in slurm topology format or creates an output file with the topology. + +To generate slurm topology using NVLink Domain, you need to specifiy a partition and optionally specify a minimum block size (Default 1) as well as the block option which is the default : +``` +azslurm topology -n -p gpu +azslurm topology -n -p gpu -b -s 5 +azslurm topology -n -p gpu -b -s 5 -o topology.conf +``` +``` +# Number of Nodes in block1: 18 +# ClusterUUID and CliqueID: b78ed242-7b98-426f-b194-b76b8899f4ec 32766 +BlockName=block1 Nodes=ccw-1-3-gpu-21,ccw-1-3-gpu-407,ccw-1-3-gpu-333,ccw-1-3-gpu-60,ccw-1-3-gpu-387,ccw-1-3-gpu-145,ccw-1-3-gpu-190,ccw-1-3-gpu-205,ccw-1-3-gpu-115,ccw-1-3-gpu-236,ccw-1-3-gpu-164,ccw-1-3-gpu-180,ccw-1-3-gpu-195,ccw-1-3-gpu-438,ccw-1-3-gpu-305,ccw-1-3-gpu-255,ccw-1-3-gpu-14,ccw-1-3-gpu-400 +# Number of Nodes in block2: 16 +# ClusterUUID and CliqueID: cc79d754-915f-408b-b1c3-b8c3aa6668ab 32766 +BlockName=block2 Nodes=ccw-1-3-gpu-464,ccw-1-3-gpu-7,ccw-1-3-gpu-454,ccw-1-3-gpu-344,ccw-1-3-gpu-91,ccw-1-3-gpu-217,ccw-1-3-gpu-324,ccw-1-3-gpu-43,ccw-1-3-gpu-188,ccw-1-3-gpu-97,ccw-1-3-gpu-434,ccw-1-3-gpu-172,ccw-1-3-gpu-153,ccw-1-3-gpu-277,ccw-1-3-gpu-147,ccw-1-3-gpu-354 +# Number of Nodes in block3: 8 +# ClusterUUID and CliqueID: 0e568355-d588-4a53-8166-8200c2c1ef55 32766 +BlockName=block3 Nodes=ccw-1-3-gpu-31,ccw-1-3-gpu-52,ccw-1-3-gpu-297,ccw-1-3-gpu-319,ccw-1-3-gpu-349,ccw-1-3-gpu-62,ccw-1-3-gpu-394,ccw-1-3-gpu-122 +# Number of Nodes in block4: 9 +# ClusterUUID and CliqueID: e3656d04-00db-4ad6-9a42-5df790994e41 32766 +BlockName=block4 Nodes=ccw-1-3-gpu-5,ccw-1-3-gpu-17,ccw-1-3-gpu-254,ccw-1-3-gpu-284,ccw-1-3-gpu-249,ccw-1-3-gpu-37,ccw-1-3-gpu-229,ccw-1-3-gpu-109,ccw-1-3-gpu-294 +BlockSizes=5 +``` +This either prints out the topology in slurm topology format or creates an output file with the topology. + +To create a visualization of the topology output you may specify `-viz` after the command for `--use_nvlink_domain` and `--use_fabric_manager` that optionally can take `--visual_block_size` to create a grid for the block (Default 18) + +Example Block topology visual +``` +azslurm topology -n -p gpu --viz +``` +``` +block 1 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f723c +|-----------------|-----------------|-----------------| +| hpcbench-hpc-1 | hpcbench-hpc-5 | hpcbench-hpc-9 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-13 | hpcbench-hpc-17 | hpcbench-hpc-21 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-25 | hpcbench-hpc-29 | hpcbench-hpc-33 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-37 | hpcbench-hpc-41 | hpcbench-hpc-45 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-49 | hpcbench-hpc-53 | hpcbench-hpc-57 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-61 | hpcbench-hpc-65 | hpcbench-hpc-69 | +|-----------------|-----------------|-----------------| + +block 2 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f721e +|-----------------|-----------------|-----------------| +| hpcbench-hpc-2 | hpcbench-hpc-6 | hpcbench-hpc-10 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-14 | hpcbench-hpc-18 | hpcbench-hpc-22 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-26 | hpcbench-hpc-30 | hpcbench-hpc-34 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-38 | hpcbench-hpc-42 | hpcbench-hpc-46 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-50 | hpcbench-hpc-54 | hpcbench-hpc-58 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-62 | hpcbench-hpc-66 | hpcbench-hpc-70 | +|-----------------|-----------------|-----------------| + +block 3 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f724b +|-----------------|-----------------|-----------------| +| hpcbench-hpc-3 | hpcbench-hpc-7 | hpcbench-hpc-11 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-15 | hpcbench-hpc-19 | hpcbench-hpc-23 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-27 | hpcbench-hpc-31 | hpcbench-hpc-35 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-39 | hpcbench-hpc-43 | hpcbench-hpc-47 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-51 | hpcbench-hpc-55 | hpcbench-hpc-59 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-63 | hpcbench-hpc-67 | hpcbench-hpc-71 | +|-----------------|-----------------|-----------------| + +block 4 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f722a +|-----------------|-----------------|-----------------| +| hpcbench-hpc-4 | hpcbench-hpc-8 | hpcbench-hpc-12 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-16 | hpcbench-hpc-20 | hpcbench-hpc-24 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-28 | hpcbench-hpc-32 | hpcbench-hpc-36 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-40 | hpcbench-hpc-44 | hpcbench-hpc-48 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-52 | hpcbench-hpc-56 | hpcbench-hpc-60 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-64 | hpcbench-hpc-68 | hpcbench-hpc-72 | +|-----------------|-----------------|-----------------| +``` +Example Block topology visual for blocks with nodes less than `visual_block_size` +``` +block 1 : # of Nodes = 6 +ClusterUUID + CliqueID : N/A N/A +|---------------|---------------|---------------| +| vis0603-gpu-2 | vis0603-gpu-5 | vis0603-gpu-1 | +|---------------|---------------|---------------| +| vis0603-gpu-3 | vis0603-gpu-4 | vis0603-gpu-6 | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +``` +Example Block Topology visual with block with nodes less than min block size +``` +block 1 : # of Nodes = 6 +ClusterUUID + CliqueID : N/A N/A +** This block is ineligible for scheduling because # of nodes < min block size 8** +|---------------|---------------|---------------| +| vis0603-gpu-4 | vis0603-gpu-2 | vis0603-gpu-5 | +|---------------|---------------|---------------| +| vis0603-gpu-1 | vis0603-gpu-6 | vis0603-gpu-3 | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +| X | X | X | +|---------------|---------------|---------------| +``` +Example Tree Topology visual +``` +azslurm topology -f -p hpc --viz +``` +``` +Switch 5 (root) +├── Switch 0 (3 nodes) +│ ├── hpcbench-hpc-36 +│ ├── hpcbench-hpc-39 +│ └── hpcbench-hpc-42 +├── Switch 1 (6 nodes) +│ ├── hpcbench-hpc-1 +│ ├── hpcbench-hpc-35 +│ ├── hpcbench-hpc-38 +│ ├── hpcbench-hpc-41 +│ ├── hpcbench-hpc-44 +│ └── hpcbench-hpc-49 +├── Switch 2 (3 nodes) +│ ├── hpcbench-hpc-37 +│ ├── hpcbench-hpc-45 +│ └── hpcbench-hpc-46 +├── Switch 3 (2 nodes) +│ ├── hpcbench-hpc-40 +│ └── hpcbench-hpc-43 +└── Switch 4 (2 nodes) + ├── hpcbench-hpc-47 + └── hpcbench-hpc-48 +``` ### GB200 IMEX Support Cyclecloud Slurm clusters now include prolog and epilog scripts to enable and cleanup IMEX service on a per-job basis. The prolog script will attempt to kill an existing IMEX service before configuring a new instance that will be specific to the new, submitted job. The epilog script terminates the IMEX service. By default, these scripts will run for GB200 nodes and not run for non-GB200 nodes. A configurable parameter `slurm.imex.enabled` has been added to the slurm cluster configuration template to allow non-GB200 nodes to enable IMEX support for their jobs or allow GB200 nodes to disable IMEX support for their jobs. diff --git a/azure-slurm/slurmcc/cli.py b/azure-slurm/slurmcc/cli.py index b566fcc0..b3baa011 100644 --- a/azure-slurm/slurmcc/cli.py +++ b/azure-slurm/slurmcc/cli.py @@ -197,8 +197,10 @@ def topology_parser(self, parser: ArgumentParser) -> None: topology_group.add_argument('-b', '--block', action='store_true', default=False, help='Generate Block Topology output to use Block topology plugin (default: False)') topology_group.add_argument('-t', '--tree', action='store_true', default=False, help='Generate Tree Topology output to use Tree topology plugin(default: False)') parser.add_argument("-s", "--block_size", type=int, required=False, help="Minimum block size required for each block (use with --block or --use_nvlink_domain, default: 1)") + parser.add_argument("--viz", action='store_true', default=False, help="Generate ASCII visualization for the topology (default: False)") + parser.add_argument("--visual_block_size", type=int, default=18, help="Block size for visualization (default: 18)") - def topology(self, config: Dict, partition, output, use_vmss, use_fabric_manager, use_nvlink_domain, tree, block, block_size) -> None: + def topology(self, config: Dict, partition, output, use_vmss, use_fabric_manager, use_nvlink_domain, tree, block, block_size, viz, visual_block_size) -> None: """ Generates Topology Plugin Configuration """ @@ -210,7 +212,9 @@ def topology(self, config: Dict, partition, output, use_vmss, use_fabric_manager topo_type = topology.TopologyType.TREE config_dir = config.get("config_dir") topo = topology.Topology(partition, output, topology.TopologyInput.FABRIC, topo_type, config_dir) - topo.run() + content = topo.run() + if viz: + print(topo.visualize(content)) elif use_nvlink_domain: if not partition: raise ValueError("--partition is required when using --use_nvlink_domain") @@ -220,10 +224,12 @@ def topology(self, config: Dict, partition, output, use_vmss, use_fabric_manager topo_type = topology.TopologyType.BLOCK config_dir = config.get("config_dir") topo = topology.Topology(partition, output, topology.TopologyInput.NVLINK, topo_type, config_dir, block_size) - topo.run() + content = topo.run() + if viz: + print(topo.visualize(content, max_block_size=visual_block_size)) elif use_vmss: - if block or block_size: - raise ValueError("--block and --block_size are not supported with --use_vmss") + if block or block_size or partition or viz: + raise ValueError("--block, --block_size, --partition, and --viz are not supported with --use_vmss") if output: with open(output, 'w', encoding='utf-8') as file_writer: return _generate_topology(self._get_node_manager(config), file_writer) diff --git a/azure-slurm/slurmcc/topology.py b/azure-slurm/slurmcc/topology.py index 6706d9b6..7afe1a49 100644 --- a/azure-slurm/slurmcc/topology.py +++ b/azure-slurm/slurmcc/topology.py @@ -590,4 +590,21 @@ def run(self): else: print(content, end='') log.info("Printed slurm topology") - \ No newline at end of file + return content + + + + def visualize(self, topology_str: str, max_block_size: int = 1) -> str: + """ + Visualizes a block or tree topology string as ASCII art. + + Args: + topology_str (str): The topology string (output from write_block_topology or write_tree_topology). + max_block_size (int, optional): Maximum block size for grid visualization. Defaults to 1. + Returns: + str: ASCII visualization of the topology. + """ + if self.topo_type == TopologyType.BLOCK: + return slutil.visualize_block(topology_str, self.block_size, max_block_size) + else: + return slutil.visualize_tree(topology_str) diff --git a/azure-slurm/slurmcc/util.py b/azure-slurm/slurmcc/util.py index 51c7324d..217dedc5 100644 --- a/azure-slurm/slurmcc/util.py +++ b/azure-slurm/slurmcc/util.py @@ -9,6 +9,7 @@ import traceback from abc import ABC, abstractmethod from typing import Any, Callable, Dict, List, Optional, Union +import re from . import AzureSlurmError, custom_chaos_mode @@ -337,4 +338,170 @@ def is_autoscale_enabled() -> bool: if _IS_AUTOSCALE_ENABLED is not None: return _IS_AUTOSCALE_ENABLED logging.warning("Could not determine if autoscale is enabled. Assuming yes") - return True \ No newline at end of file + return True + +def visualize_block(topology_str: str, min_block_size: int, max_block_size: int = 1) -> str: + """ + Visualizes a block topology string as ASCII art. + + Args: + topology_str (str): The topology string (output from write_block_topology). + max_block_size (int, optional): Maximum block size for grid visualization. Defaults to 1. + Returns: + str: ASCII visualization of the block topology. + """ + block_pattern = re.compile( + r"# Number of Nodes in block(?P\d+): (?P\d+)\n" + r"# ClusterUUID and CliqueID: (?P.*)\n" + r"(?:# Warning:.*\n){0,2}" + r"(?P(#)?BlockName=block\d+ Nodes=(?P[^\n]+))" + ) + blocks = [] + for match in block_pattern.finditer(topology_str): + block_idx = int(match.group("block_idx")) + size = int(match.group("size")) + uuid = match.group("uuid").strip() + nodes = match.group("nodes").split(",") + block_line = match.group("block_line") + commented = block_line.strip().startswith("#BlockName") + blocks.append({ + "block_idx": block_idx, + "size": size, + "uuid": uuid, + "nodes": nodes, + "commented": commented, + }) + + if not blocks: + return "# No valid blocks found in topology string.\n" + + visualizations = [] + if max_block_size <= 0: + raise ValueError("max_block_size must be greater than 0") + best_rows, best_cols = max_block_size, 1 + min_diff = max_block_size - 1 + for cols in range(1, max_block_size + 1): + if max_block_size % cols == 0: + rows = max_block_size // cols + diff = abs(rows - cols) + if diff < min_diff or (diff == min_diff and cols > best_cols): + best_rows, best_cols = rows, cols + min_diff = diff + if best_rows < best_cols: + best_rows, best_cols = best_cols, best_rows + for block in blocks: + block_idx = block["block_idx"] + size = block["size"] + uuid = block["uuid"] + nodes = block["nodes"] + commented = block["commented"] + header = f"block {block_idx} : # of Nodes = {size}" + uuid_line = f"ClusterUUID + CliqueID : {uuid}" + ineligible_line = "" + if commented: + ineligible_line = ( + f"** This block is ineligible for scheduling because # of nodes < min block size {min_block_size}**" + ) + grid = [] + for r in range(best_rows): + row = [] + for c in range(best_cols): + idx = r * best_cols + c + if idx < len(nodes): + row.append(nodes[idx]) + else: + row.append("X") + grid.append(row) + col_width = max(7, max((len(n) for n in nodes), default=2) + 2) + sep = "|" + "|".join("-" * col_width for _ in range(best_cols)) + "|" + grid_lines = [sep] + for i, row in enumerate(grid): + grid_lines.append( + "|" + "|".join(f"{cell:^{col_width}}" for cell in row) + "|" + ) + grid_lines.append(sep) + visualizations.append( + "\n".join( + filter( + None, + [ + header, + uuid_line, + ineligible_line, + "\n".join(grid_lines), + ], + ) + ) + ) + return "\n\n".join(visualizations) + "\n" + +def visualize_tree(topology_str: str) -> str: + """ + Visualizes a tree topology string as ASCII art. + + Args: + topology_str (str): The topology string (output from write_tree_topology). + Returns: + str: ASCII visualization of the tree topology. + """ + switch_pattern = re.compile( + r"# Number of Nodes in sw(?P\d+): (?P\d+)\n" + r"SwitchName=sw(?P\d+) Nodes=(?P[^\n]+)" + ) + parent_switch_pattern = re.compile( + r"SwitchName=sw(?P\d+) Switches=(?P[^\n]+)" + ) + + switches = [] + for match in switch_pattern.finditer(topology_str): + sw_idx = int(match.group("sw_idx")) + size = int(match.group("size")) + nodes = match.group("nodes").split(",") + switches.append({ + "sw_idx": sw_idx, + "size": size, + "nodes": nodes, + }) + + parent_switch = None + parent_match = parent_switch_pattern.search(topology_str) + if parent_match: + parent_idx = int(parent_match.group("parent_idx")) + children = parent_match.group("children").split(",") + parent_switch = { + "parent_idx": parent_idx, + "children": children, + } + + if not switches: + return "# No valid switches found in topology string.\n" + + lines = [] + def draw_tree(parent, children, switches_dict, prefix=""): + for i, sw in enumerate(children): + is_last = (i == len(children) - 1) + branch = "└── " if is_last else "├── " + sw_idx = int(sw.replace("sw", "")) + sw_info = switches_dict.get(sw_idx) + if sw_info: + lines.append(f"{prefix}{branch}Switch {sw_idx} ({sw_info['size']} nodes)") + node_prefix = " " if is_last else "│ " + for j, node in enumerate(sw_info["nodes"]): + node_branch = "└── " if j == len(sw_info["nodes"]) - 1 else "├── " + lines.append(f"{prefix}{node_prefix}{node_branch}{node}") + else: + lines.append(f"{prefix}{branch}{sw}") + + switches_dict = {sw["sw_idx"]: sw for sw in switches} + + if parent_switch: + lines.append(f"Switch {parent_switch['parent_idx']} (root)") + draw_tree(parent_switch['parent_idx'], parent_switch['children'], switches_dict) + else: + for sw in switches: + lines.append(f"Switch {sw['sw_idx']} ({sw['size']} nodes)") + for j, node in enumerate(sw["nodes"]): + node_branch = "└── " if j == len(sw["nodes"]) - 1 else "├── " + lines.append(f" {node_branch}{node}") + + return "\n".join(lines) + "\n" \ No newline at end of file diff --git a/azure-slurm/test/slurmcc_test/topology_test.py b/azure-slurm/test/slurmcc_test/topology_test.py index d6e1faa6..58dd3f49 100644 --- a/azure-slurm/test/slurmcc_test/topology_test.py +++ b/azure-slurm/test/slurmcc_test/topology_test.py @@ -7,6 +7,7 @@ from slurmcc import util as slutil from pathlib import Path import pytest +import re TESTDIR="test/slurmcc_test/topology_test_output" @@ -363,6 +364,103 @@ def fake_run_get_rack_id_command_1(*args, **kwargs): assert content==actual assert len(racks)==4 +def test_visualize_block_topology(): + slutil.srun=run_parallel_cmd + slutil.run=run_command + test_obj = Topology("hpc",None,TopologyInput.NVLINK,TopologyType.BLOCK,TESTDIR) + def fake_run_get_rack_id_command_1(*args, **kwargs): + with open('test/slurmcc_test/topology_test_input/nodes_clusterUUIDs.txt', 'r', encoding='utf-8') as f: + stdout = f.read() + return ParallelOutputContainer(stdout, "", 0) + test_obj._run_get_rack_id_command = fake_run_get_rack_id_command_1 + content=test_obj.run() + vis = test_obj.visualize(content,18) + print(vis) + with open('test/slurmcc_test/topology_test_output/block_topo_vis.txt', 'w', encoding='utf-8') as f: + f.write(vis) + with open('test/slurmcc_test/topology_test_output/block_topo_vis.txt','r', encoding='utf-8') as file: + result= file.read() + with open('test/slurmcc_test/topology_test_input/block_topo_vis.txt','r', encoding='utf-8') as file: + actual= file.read() + assert result==actual + +def test_bfl_visualize_block_topology(): + test_obj = Topology("hpc",None,TopologyInput.NVLINK,TopologyType.BLOCK,TESTDIR) + content = ( + "# Number of Nodes in block1: 8\n" + "# ClusterUUID and CliqueID: 1db72bdb-8004-4407-a49a-a9a935a80529 32766\n" + "BlockName=block1 Nodes=ccw1-gpu-8,ccw1-gpu-22,ccw1-gpu-82,ccw1-gpu-179,ccw1-gpu-97,ccw1-gpu-146,ccw1-gpu-45,ccw1-gpu-112\n" + "# Number of Nodes in block2: 15\n" + "# ClusterUUID and CliqueID: 18f2ff51-040a-486c-90ce-138067a7379a 32766\n" + "BlockName=block2 Nodes=ccw1-gpu-247,ccw1-gpu-134,ccw1-gpu-17,ccw1-gpu-33,ccw1-gpu-110,ccw1-gpu-51,ccw1-gpu-240,ccw1-gpu-154,ccw1-gpu-186,ccw1-gpu-198,ccw1-gpu-95,ccw1-gpu-210,ccw1-gpu-169,ccw1-gpu-225,ccw1-gpu-227\n" + "# Number of Nodes in block3: 18\n" + "# ClusterUUID and CliqueID: f9eeccb9-69ff-4a55-95e8-c00f14b83c38 32766\n" + "BlockName=block3 Nodes=ccw1-gpu-1,ccw1-gpu-9,ccw1-gpu-242,ccw1-gpu-79,ccw1-gpu-209,ccw1-gpu-248,ccw1-gpu-128,ccw1-gpu-172,ccw1-gpu-218,ccw1-gpu-109,ccw1-gpu-76,ccw1-gpu-32,ccw1-gpu-61,ccw1-gpu-203,ccw1-gpu-138,ccw1-gpu-238,ccw1-gpu-163,ccw1-gpu-226\n" + "# Number of Nodes in block4: 17\n" + "# ClusterUUID and CliqueID: 50c636e0-94d8-4df8-97c5-515a399e7f60 32766\n" + "BlockName=block4 Nodes=ccw1-gpu-164,ccw1-gpu-73,ccw1-gpu-195,ccw1-gpu-207,ccw1-gpu-245,ccw1-gpu-129,ccw1-gpu-10,ccw1-gpu-144,ccw1-gpu-125,ccw1-gpu-221,ccw1-gpu-235,ccw1-gpu-3,ccw1-gpu-104,ccw1-gpu-215,ccw1-gpu-229,ccw1-gpu-53,ccw1-gpu-80\n" + "# Number of Nodes in block5: 13\n" + "# ClusterUUID and CliqueID: 475636da-3c86-4f45-bc03-7623e8ef7085 32766\n" + "BlockName=block5 Nodes=ccw1-gpu-31,ccw1-gpu-184,ccw1-gpu-40,ccw1-gpu-166,ccw1-gpu-156,ccw1-gpu-65,ccw1-gpu-5,ccw1-gpu-13,ccw1-gpu-87,ccw1-gpu-78,ccw1-gpu-75,ccw1-gpu-108,ccw1-gpu-136\n" + "# Number of Nodes in block6: 16\n" + "# ClusterUUID and CliqueID: 87e6be3b-cd1a-46cd-ae93-7ea1629a165f 32766\n" + "BlockName=block6 Nodes=ccw1-gpu-91,ccw1-gpu-74,ccw1-gpu-135,ccw1-gpu-34,ccw1-gpu-193,ccw1-gpu-116,ccw1-gpu-63,ccw1-gpu-160,ccw1-gpu-29,ccw1-gpu-202,ccw1-gpu-130,ccw1-gpu-126,ccw1-gpu-48,ccw1-gpu-165,ccw1-gpu-208,ccw1-gpu-211\n" + "# Number of Nodes in block7: 16\n" + "# ClusterUUID and CliqueID: cc79d754-915f-408b-b1c3-b8c3aa6668ab 32766\n" + "BlockName=block7 Nodes=ccw1-gpu-237,ccw1-gpu-182,ccw1-gpu-217,ccw1-gpu-90,ccw1-gpu-223,ccw1-gpu-201,ccw1-gpu-159,ccw1-gpu-243,ccw1-gpu-139,ccw1-gpu-192,ccw1-gpu-64,ccw1-gpu-120,ccw1-gpu-2,ccw1-gpu-231,ccw1-gpu-41,ccw1-gpu-47\n" + "# Number of Nodes in block8: 17\n" + "# ClusterUUID and CliqueID: 19ce407e-11b7-4236-97f7-0f9a5b62692b 32766\n" + "BlockName=block8 Nodes=ccw1-gpu-49,ccw1-gpu-77,ccw1-gpu-157,ccw1-gpu-137,ccw1-gpu-30,ccw1-gpu-102,ccw1-gpu-142,ccw1-gpu-181,ccw1-gpu-4,ccw1-gpu-71,ccw1-gpu-12,ccw1-gpu-88,ccw1-gpu-162,ccw1-gpu-20,ccw1-gpu-117,ccw1-gpu-98,ccw1-gpu-57\n" + "# Number of Nodes in block9: 17\n" + "# ClusterUUID and CliqueID: 809d6b94-4c14-40b4-8eee-b71539be6baf 32766\n" + "BlockName=block9 Nodes=ccw1-gpu-19,ccw1-gpu-212,ccw1-gpu-224,ccw1-gpu-241,ccw1-gpu-197,ccw1-gpu-46,ccw1-gpu-105,ccw1-gpu-23,ccw1-gpu-222,ccw1-gpu-188,ccw1-gpu-249,ccw1-gpu-133,ccw1-gpu-56,ccw1-gpu-89,ccw1-gpu-94,ccw1-gpu-234,ccw1-gpu-153\n" + "# Number of Nodes in block10: 14\n" + "# ClusterUUID and CliqueID: 5110b53c-8898-4b07-bb98-9c7159894d5a 32766\n" + "BlockName=block10 Nodes=ccw1-gpu-15,ccw1-gpu-18,ccw1-gpu-14,ccw1-gpu-7,ccw1-gpu-22,ccw1-gpu-11,ccw1-gpu-16,ccw1-gpu-21,ccw1-gpu-27,ccw1-gpu-28,ccw1-gpu-24,ccw1-gpu-26,ccw1-gpu-25,ccw1-gpu-39\n" + "# Number of Nodes in block11: 16\n" + "# ClusterUUID and CliqueID: ad2669de-e798-494b-b2fb-ab9445b17e75 32766\n" + "BlockName=block11 Nodes=ccw1-gpu-250,ccw1-gpu-244,ccw1-gpu-50,ccw1-gpu-72,ccw1-gpu-143,ccw1-gpu-84,ccw1-gpu-114,ccw1-gpu-194,ccw1-gpu-220,ccw1-gpu-206,ccw1-gpu-158,ccw1-gpu-28,ccw1-gpu-185,ccw1-gpu-232,ccw1-gpu-228,ccw1-gpu-176\n" + "# Number of Nodes in block12: 16\n" + "# ClusterUUID and CliqueID: 8a2e3316-46f1-4772-882f-a53e2448892b 32766\n" + "BlockName=block12 Nodes=ccw1-gpu-36,ccw1-gpu-55,ccw1-gpu-27,ccw1-gpu-151,ccw1-gpu-122,ccw1-gpu-118,ccw1-gpu-18,ccw1-gpu-141,ccw1-gpu-103,ccw1-gpu-70,ccw1-gpu-131,ccw1-gpu-60,ccw1-gpu-175,ccw1-gpu-92,ccw1-gpu-161,ccw1-gpu-171\n" + "# Number of Nodes in block13: 16\n" + "# ClusterUUID and CliqueID: a3b58aab-21bc-4a89-8eaf-cf7ce06c1609 32766\n" + "BlockName=block13 Nodes=ccw1-gpu-85,ccw1-gpu-149,ccw1-gpu-239,ccw1-gpu-38,ccw1-gpu-204,ccw1-gpu-58,ccw1-gpu-213,ccw1-gpu-67,ccw1-gpu-100,ccw1-gpu-173,ccw1-gpu-219,ccw1-gpu-24,ccw1-gpu-177,ccw1-gpu-189,ccw1-gpu-115,ccw1-gpu-233\n" + "# Number of Nodes in block14: 18\n" + "# ClusterUUID and CliqueID: b78ed242-7b98-426f-b194-b76b8899f4ec 32766\n" + "BlockName=block14 Nodes=ccw1-gpu-123,ccw1-gpu-132,ccw1-gpu-69,ccw1-gpu-83,ccw1-gpu-15,ccw1-gpu-26,ccw1-gpu-147,ccw1-gpu-93,ccw1-gpu-52,ccw1-gpu-107,ccw1-gpu-113,ccw1-gpu-152,ccw1-gpu-62,ccw1-gpu-44,ccw1-gpu-7,ccw1-gpu-127,ccw1-gpu-35,ccw1-gpu-167\n" + "# Number of Nodes in block15: 16\n" + "# ClusterUUID and CliqueID: acd0098d-5416-4fe2-b3b9-edebcd0d9374 32766\n" + "BlockName=block15 Nodes=ccw1-gpu-246,ccw1-gpu-14,ccw1-gpu-37,ccw1-gpu-214,ccw1-gpu-236,ccw1-gpu-119,ccw1-gpu-200,ccw1-gpu-66,ccw1-gpu-191,ccw1-gpu-148,ccw1-gpu-124,ccw1-gpu-42,ccw1-gpu-99,ccw1-gpu-216,ccw1-gpu-168,ccw1-gpu-230\n" + "# Number of Nodes in block16: 16\n" + "# ClusterUUID and CliqueID: ef2f7639-552c-437e-bcee-56af9d562415 32766\n" + "BlockName=block16 Nodes=ccw1-gpu-25,ccw1-gpu-205,ccw1-gpu-170,ccw1-gpu-11,ccw1-gpu-39,ccw1-gpu-196,ccw1-gpu-86,ccw1-gpu-6,ccw1-gpu-111,ccw1-gpu-187,ccw1-gpu-155,ccw1-gpu-145,ccw1-gpu-183,ccw1-gpu-121,ccw1-gpu-54,ccw1-gpu-59\n" + "BlockSizes=1\n" + + ) + vis = test_obj.visualize(content,18) + with open('test/slurmcc_test/topology_test_output/bfl_block_topo_vis.txt', 'w', encoding='utf-8') as f: + f.write(vis) + with open('test/slurmcc_test/topology_test_output/bfl_block_topo_vis.txt','r', encoding='utf-8') as file: + result= file.read() + with open('test/slurmcc_test/topology_test_input/bfl_block_topo_vis.txt','r', encoding='utf-8') as file: + actual= file.read() + assert result==actual + +def test_visualize_tree_topology(): + slutil.srun=run_parallel_cmd + slutil.run=run_command + test_obj = Topology("hpc",None,TopologyInput.FABRIC,TopologyType.TREE,TESTDIR) + content=test_obj.run() + vis = test_obj.visualize(content,18) + with open('test/slurmcc_test/topology_test_output/tree_topo_vis.txt', 'w', encoding='utf-8') as f: + f.write(vis) + with open('test/slurmcc_test/topology_test_output/tree_topo_vis.txt','r', encoding='utf-8') as file: + result= file.read() + with open('test/slurmcc_test/topology_test_input/tree_topo_vis.txt','r', encoding='utf-8') as file: + actual= file.read() + assert result==actual + def test_group_hosts_per_rack_single_rack(monkeypatch): """ Test group_hosts_per_rack when all hosts belong to the same rack. @@ -616,3 +714,264 @@ def test_run_nvlink_multiple_racks(monkeypatch): test_obj.group_hosts_per_rack = lambda: racks result = test_obj.run_nvlink() assert result == racks + +@pytest.mark.parametrize( + "topo_type,topology_str,max_block_size,expected_lines,block_size", + [ + # BLOCK: one block, not enough nodes, commented + ( + TopologyType.BLOCK, + "# Number of Nodes in block1: 1\n" + "# ClusterUUID and CliqueID: rackA\n" + "# Warning: Block 1 has less than 2 nodes, commenting out\n" + "#BlockName=block1 Nodes=node1\n" + "BlockSizes=2\n", + 2, + [ + "block 1 : # of Nodes = 1", + "ClusterUUID + CliqueID : rackA", + "** This block is ineligible for scheduling because # of nodes < min block size 2**", + "|-------|", + "| node1 |", + "|-------|", + "| X |", + "|-------|", + ], + 2 + ), + ( + TopologyType.BLOCK, + "# Number of Nodes in block1: 6\n" + "# ClusterUUID and CliqueID: N/A N/A\n" + "# Warning: Block 1 has unknown ClusterUUID and CliqueID\n" + "# Warning: Block 1 has less than 8 nodes, commenting out\n" + "#BlockName=block1 Nodes=vis0603-gpu-4,vis0603-gpu-6,vis0603-gpu-3,vis0603-gpu-5,vis0603-gpu-1,vis0603-gpu-2\n" + "BlockSizes=8", + 18, + [ + "block 1 : # of Nodes = 6", + "ClusterUUID + CliqueID : N/A N/A", + "** This block is ineligible for scheduling because # of nodes < min block size 8**", + "|---------------|---------------|---------------|", + "| vis0603-gpu-4 | vis0603-gpu-6 | vis0603-gpu-3 |", + "|---------------|---------------|---------------|", + "| vis0603-gpu-5 | vis0603-gpu-1 | vis0603-gpu-2 |", + "|---------------|---------------|---------------|", + "| X | X | X |", + "|---------------|---------------|---------------|", + "| X | X | X |", + "|---------------|---------------|---------------|", + "| X | X | X |", + "|---------------|---------------|---------------|", + "| X | X | X |" + ], + 8 + ), + # BLOCK: no valid blocks + ( + TopologyType.BLOCK, + "", + 2, + [ + "# No valid blocks found in topology string.", + ], + 2 + ), + ( + TopologyType.BLOCK, + ( + "# Number of Nodes in block1: 4\n" + "# ClusterUUID and CliqueID: rackA\n" + "BlockName=block1 Nodes=node1,node2,node3,node4\n" + "BlockSizes=4\n" + ), + 4, + [ + "block 1 : # of Nodes = 4", + "ClusterUUID + CliqueID : rackA", + "|-------|-------|", + "| node1 | node2 |", + "|-------|-------|", + "| node3 | node4 |", + "|-------|-------|", + ], + 2 + ), + ( + TopologyType.BLOCK, + ( + "# Number of Nodes in block1: 2\n" + "# ClusterUUID and CliqueID: rackB\n" + "BlockName=block1 Nodes=nodeA,nodeB\n" + "BlockSizes=2\n" + ), + 2, + [ + "block 1 : # of Nodes = 2", + "ClusterUUID + CliqueID : rackB", + "|-------|", + "| nodeA |", + "|-------|", + "| nodeB |", + "|-------|", + ], + 2 + ), + ( + TopologyType.BLOCK, + ( + "# Number of Nodes in block1: 1\n" + "# ClusterUUID and CliqueID: rackC\n" + "BlockName=block1 Nodes=host1\n" + "BlockSizes=3\n" + ), + 3, + [ + "block 1 : # of Nodes = 1", + "ClusterUUID + CliqueID : rackC", + "|-------|", + "| host1 |", + "|-------|", + "| X |", + "|-------|", + "| X |", + "|-------|", + ], + 2 + ), + ( + TopologyType.BLOCK, + ( + "# Number of Nodes in block1: 0\n" + "# ClusterUUID and CliqueID: rackD\n" + "BlockName=block1 Nodes=\n" + "BlockSizes=2\n" + ), + 2, + [ + "# No valid blocks found in topology string." + ], + 2 + ), + ( + TopologyType.BLOCK, + ( + "# Number of Nodes in block1: 3\n" + "# ClusterUUID and CliqueID: rackE\n" + "BlockName=block1 Nodes=node1,node2,node3\n" + "BlockSizes=2\n" + ), + 5, + [ + "block 1 : # of Nodes = 3", + "ClusterUUID + CliqueID : rackE", + "|-------|", + "| node1 |", + "|-------|", + "| node2 |", + "|-------|", + "| node3 |", + "|-------|", + "| X |", + "|-------|", + "| X |", + "|-------|" + ], + 2 + ), + # TREE: two switches, one parent + ( + TopologyType.TREE, + "# Number of Nodes in sw01: 2\n" + "SwitchName=sw01 Nodes=node1,node2\n" + "# Number of Nodes in sw02: 1\n" + "SwitchName=sw02 Nodes=node3\n" + "SwitchName=sw03 Switches=sw01,sw02\n", + 2, + [ + "Switch 3 (root)", + "├── Switch 1 (2 nodes)", + "│ ├── node1", + "│ └── node2", + "└── Switch 2 (1 nodes)", + " └── node3", + ], + 2 + ), + # TREE: no valid switches + ( + TopologyType.TREE, + "", + 2, + [ + "# No valid switches found in topology string.", + ], + 2 + ), + + ] +) +def test_visualize_various_cases(topo_type, topology_str, max_block_size, expected_lines,block_size): + test_obj = Topology("hpc", None, TopologyInput.FABRIC, topo_type, "testdir", block_size) + result = test_obj.visualize(topology_str, max_block_size) + print(result) + for line in expected_lines: + assert line in result + +def test_visualize_block_grid_shape(): + # Test grid shape for block size 6 (should be 3x2 or 2x3) + topology_str = ( + "# Number of Nodes in block1: 6\n" + "# ClusterUUID and CliqueID: rackA\n" + "BlockName=block1 Nodes=n1,n2,n3,n4,n5,n6\n" + "BlockSizes=6\n" + ) + test_obj = Topology("hpc", None, TopologyInput.NVLINK, TopologyType.BLOCK, "testdir", block_size=6) + vis = test_obj.visualize(topology_str, 6) + # Should have 3 rows and 2 columns or 2 rows and 3 columns + assert vis.count("|") > 0 + assert "n1" in vis and "n6" in vis + +def test_visualize_block_invalid_max_block_size(): + topology_str = ( + "# Number of Nodes in block1: 2\n" + "# ClusterUUID and CliqueID: rackA\n" + "BlockName=block1 Nodes=node1,node2\n" + "BlockSizes=2\n" + ) + test_obj = Topology("hpc", None, TopologyInput.NVLINK, TopologyType.BLOCK, "testdir", block_size=2) + with pytest.raises(ValueError): + test_obj.visualize(topology_str, 0) + +def test_visualize_tree_no_parent_switch(): + topology_str = ( + "# Number of Nodes in sw01: 2\n" + "SwitchName=sw01 Nodes=node1,node2\n" + "# Number of Nodes in sw02: 1\n" + "SwitchName=sw02 Nodes=node3\n" + ) + test_obj = Topology("hpc", None, TopologyInput.FABRIC, TopologyType.TREE, "testdir") + vis = test_obj.visualize(topology_str, 2) + assert "Switch 1 (2 nodes)" in vis + assert "Switch 2 (1 nodes)" in vis + assert "node1" in vis and "node2" in vis and "node3" in vis + +def test_visualize_tree_with_parent_switch(): + topology_str = ( + "# Number of Nodes in sw01: 2\n" + "SwitchName=sw01 Nodes=node1,node2\n" + "# Number of Nodes in sw02: 1\n" + "SwitchName=sw02 Nodes=node3\n" + "SwitchName=sw03 Switches=sw01,sw02\n" + ) + test_obj = Topology("hpc", None, TopologyInput.FABRIC, TopologyType.TREE, "testdir") + vis = test_obj.visualize(topology_str, 2) + assert "Switch 3 (root)" in vis + assert "Switch 1 (2 nodes)" in vis + assert "Switch 2 (1 nodes)" in vis + assert "node1" in vis and "node2" in vis and "node3" in vis + + + + + diff --git a/azure-slurm/test/slurmcc_test/topology_test_input/bfl_block_topo_vis.txt b/azure-slurm/test/slurmcc_test/topology_test_input/bfl_block_topo_vis.txt new file mode 100644 index 00000000..73e137f9 --- /dev/null +++ b/azure-slurm/test/slurmcc_test/topology_test_input/bfl_block_topo_vis.txt @@ -0,0 +1,255 @@ +block 1 : # of Nodes = 8 +ClusterUUID + CliqueID : 1db72bdb-8004-4407-a49a-a9a935a80529 32766 +|--------------|--------------|--------------| +| ccw1-gpu-8 | ccw1-gpu-22 | ccw1-gpu-82 | +|--------------|--------------|--------------| +| ccw1-gpu-179 | ccw1-gpu-97 | ccw1-gpu-146 | +|--------------|--------------|--------------| +| ccw1-gpu-45 | ccw1-gpu-112 | X | +|--------------|--------------|--------------| +| X | X | X | +|--------------|--------------|--------------| +| X | X | X | +|--------------|--------------|--------------| +| X | X | X | +|--------------|--------------|--------------| + +block 2 : # of Nodes = 15 +ClusterUUID + CliqueID : 18f2ff51-040a-486c-90ce-138067a7379a 32766 +|--------------|--------------|--------------| +| ccw1-gpu-247 | ccw1-gpu-134 | ccw1-gpu-17 | +|--------------|--------------|--------------| +| ccw1-gpu-33 | ccw1-gpu-110 | ccw1-gpu-51 | +|--------------|--------------|--------------| +| ccw1-gpu-240 | ccw1-gpu-154 | ccw1-gpu-186 | +|--------------|--------------|--------------| +| ccw1-gpu-198 | ccw1-gpu-95 | ccw1-gpu-210 | +|--------------|--------------|--------------| +| ccw1-gpu-169 | ccw1-gpu-225 | ccw1-gpu-227 | +|--------------|--------------|--------------| +| X | X | X | +|--------------|--------------|--------------| + +block 3 : # of Nodes = 18 +ClusterUUID + CliqueID : f9eeccb9-69ff-4a55-95e8-c00f14b83c38 32766 +|--------------|--------------|--------------| +| ccw1-gpu-1 | ccw1-gpu-9 | ccw1-gpu-242 | +|--------------|--------------|--------------| +| ccw1-gpu-79 | ccw1-gpu-209 | ccw1-gpu-248 | +|--------------|--------------|--------------| +| ccw1-gpu-128 | ccw1-gpu-172 | ccw1-gpu-218 | +|--------------|--------------|--------------| +| ccw1-gpu-109 | ccw1-gpu-76 | ccw1-gpu-32 | +|--------------|--------------|--------------| +| ccw1-gpu-61 | ccw1-gpu-203 | ccw1-gpu-138 | +|--------------|--------------|--------------| +| ccw1-gpu-238 | ccw1-gpu-163 | ccw1-gpu-226 | +|--------------|--------------|--------------| + +block 4 : # of Nodes = 17 +ClusterUUID + CliqueID : 50c636e0-94d8-4df8-97c5-515a399e7f60 32766 +|--------------|--------------|--------------| +| ccw1-gpu-164 | ccw1-gpu-73 | ccw1-gpu-195 | +|--------------|--------------|--------------| +| ccw1-gpu-207 | ccw1-gpu-245 | ccw1-gpu-129 | +|--------------|--------------|--------------| +| ccw1-gpu-10 | ccw1-gpu-144 | ccw1-gpu-125 | +|--------------|--------------|--------------| +| ccw1-gpu-221 | ccw1-gpu-235 | ccw1-gpu-3 | +|--------------|--------------|--------------| +| ccw1-gpu-104 | ccw1-gpu-215 | ccw1-gpu-229 | +|--------------|--------------|--------------| +| ccw1-gpu-53 | ccw1-gpu-80 | X | +|--------------|--------------|--------------| + +block 5 : # of Nodes = 13 +ClusterUUID + CliqueID : 475636da-3c86-4f45-bc03-7623e8ef7085 32766 +|--------------|--------------|--------------| +| ccw1-gpu-31 | ccw1-gpu-184 | ccw1-gpu-40 | +|--------------|--------------|--------------| +| ccw1-gpu-166 | ccw1-gpu-156 | ccw1-gpu-65 | +|--------------|--------------|--------------| +| ccw1-gpu-5 | ccw1-gpu-13 | ccw1-gpu-87 | +|--------------|--------------|--------------| +| ccw1-gpu-78 | ccw1-gpu-75 | ccw1-gpu-108 | +|--------------|--------------|--------------| +| ccw1-gpu-136 | X | X | +|--------------|--------------|--------------| +| X | X | X | +|--------------|--------------|--------------| + +block 6 : # of Nodes = 16 +ClusterUUID + CliqueID : 87e6be3b-cd1a-46cd-ae93-7ea1629a165f 32766 +|--------------|--------------|--------------| +| ccw1-gpu-91 | ccw1-gpu-74 | ccw1-gpu-135 | +|--------------|--------------|--------------| +| ccw1-gpu-34 | ccw1-gpu-193 | ccw1-gpu-116 | +|--------------|--------------|--------------| +| ccw1-gpu-63 | ccw1-gpu-160 | ccw1-gpu-29 | +|--------------|--------------|--------------| +| ccw1-gpu-202 | ccw1-gpu-130 | ccw1-gpu-126 | +|--------------|--------------|--------------| +| ccw1-gpu-48 | ccw1-gpu-165 | ccw1-gpu-208 | +|--------------|--------------|--------------| +| ccw1-gpu-211 | X | X | +|--------------|--------------|--------------| + +block 7 : # of Nodes = 16 +ClusterUUID + CliqueID : cc79d754-915f-408b-b1c3-b8c3aa6668ab 32766 +|--------------|--------------|--------------| +| ccw1-gpu-237 | ccw1-gpu-182 | ccw1-gpu-217 | +|--------------|--------------|--------------| +| ccw1-gpu-90 | ccw1-gpu-223 | ccw1-gpu-201 | +|--------------|--------------|--------------| +| ccw1-gpu-159 | ccw1-gpu-243 | ccw1-gpu-139 | +|--------------|--------------|--------------| +| ccw1-gpu-192 | ccw1-gpu-64 | ccw1-gpu-120 | +|--------------|--------------|--------------| +| ccw1-gpu-2 | ccw1-gpu-231 | ccw1-gpu-41 | +|--------------|--------------|--------------| +| ccw1-gpu-47 | X | X | +|--------------|--------------|--------------| + +block 8 : # of Nodes = 17 +ClusterUUID + CliqueID : 19ce407e-11b7-4236-97f7-0f9a5b62692b 32766 +|--------------|--------------|--------------| +| ccw1-gpu-49 | ccw1-gpu-77 | ccw1-gpu-157 | +|--------------|--------------|--------------| +| ccw1-gpu-137 | ccw1-gpu-30 | ccw1-gpu-102 | +|--------------|--------------|--------------| +| ccw1-gpu-142 | ccw1-gpu-181 | ccw1-gpu-4 | +|--------------|--------------|--------------| +| ccw1-gpu-71 | ccw1-gpu-12 | ccw1-gpu-88 | +|--------------|--------------|--------------| +| ccw1-gpu-162 | ccw1-gpu-20 | ccw1-gpu-117 | +|--------------|--------------|--------------| +| ccw1-gpu-98 | ccw1-gpu-57 | X | +|--------------|--------------|--------------| + +block 9 : # of Nodes = 17 +ClusterUUID + CliqueID : 809d6b94-4c14-40b4-8eee-b71539be6baf 32766 +|--------------|--------------|--------------| +| ccw1-gpu-19 | ccw1-gpu-212 | ccw1-gpu-224 | +|--------------|--------------|--------------| +| ccw1-gpu-241 | ccw1-gpu-197 | ccw1-gpu-46 | +|--------------|--------------|--------------| +| ccw1-gpu-105 | ccw1-gpu-23 | ccw1-gpu-222 | +|--------------|--------------|--------------| +| ccw1-gpu-188 | ccw1-gpu-249 | ccw1-gpu-133 | +|--------------|--------------|--------------| +| ccw1-gpu-56 | ccw1-gpu-89 | ccw1-gpu-94 | +|--------------|--------------|--------------| +| ccw1-gpu-234 | ccw1-gpu-153 | X | +|--------------|--------------|--------------| + +block 10 : # of Nodes = 14 +ClusterUUID + CliqueID : 5110b53c-8898-4b07-bb98-9c7159894d5a 32766 +|-------------|-------------|-------------| +| ccw1-gpu-15 | ccw1-gpu-18 | ccw1-gpu-14 | +|-------------|-------------|-------------| +| ccw1-gpu-7 | ccw1-gpu-22 | ccw1-gpu-11 | +|-------------|-------------|-------------| +| ccw1-gpu-16 | ccw1-gpu-21 | ccw1-gpu-27 | +|-------------|-------------|-------------| +| ccw1-gpu-28 | ccw1-gpu-24 | ccw1-gpu-26 | +|-------------|-------------|-------------| +| ccw1-gpu-25 | ccw1-gpu-39 | X | +|-------------|-------------|-------------| +| X | X | X | +|-------------|-------------|-------------| + +block 11 : # of Nodes = 16 +ClusterUUID + CliqueID : ad2669de-e798-494b-b2fb-ab9445b17e75 32766 +|--------------|--------------|--------------| +| ccw1-gpu-250 | ccw1-gpu-244 | ccw1-gpu-50 | +|--------------|--------------|--------------| +| ccw1-gpu-72 | ccw1-gpu-143 | ccw1-gpu-84 | +|--------------|--------------|--------------| +| ccw1-gpu-114 | ccw1-gpu-194 | ccw1-gpu-220 | +|--------------|--------------|--------------| +| ccw1-gpu-206 | ccw1-gpu-158 | ccw1-gpu-28 | +|--------------|--------------|--------------| +| ccw1-gpu-185 | ccw1-gpu-232 | ccw1-gpu-228 | +|--------------|--------------|--------------| +| ccw1-gpu-176 | X | X | +|--------------|--------------|--------------| + +block 12 : # of Nodes = 16 +ClusterUUID + CliqueID : 8a2e3316-46f1-4772-882f-a53e2448892b 32766 +|--------------|--------------|--------------| +| ccw1-gpu-36 | ccw1-gpu-55 | ccw1-gpu-27 | +|--------------|--------------|--------------| +| ccw1-gpu-151 | ccw1-gpu-122 | ccw1-gpu-118 | +|--------------|--------------|--------------| +| ccw1-gpu-18 | ccw1-gpu-141 | ccw1-gpu-103 | +|--------------|--------------|--------------| +| ccw1-gpu-70 | ccw1-gpu-131 | ccw1-gpu-60 | +|--------------|--------------|--------------| +| ccw1-gpu-175 | ccw1-gpu-92 | ccw1-gpu-161 | +|--------------|--------------|--------------| +| ccw1-gpu-171 | X | X | +|--------------|--------------|--------------| + +block 13 : # of Nodes = 16 +ClusterUUID + CliqueID : a3b58aab-21bc-4a89-8eaf-cf7ce06c1609 32766 +|--------------|--------------|--------------| +| ccw1-gpu-85 | ccw1-gpu-149 | ccw1-gpu-239 | +|--------------|--------------|--------------| +| ccw1-gpu-38 | ccw1-gpu-204 | ccw1-gpu-58 | +|--------------|--------------|--------------| +| ccw1-gpu-213 | ccw1-gpu-67 | ccw1-gpu-100 | +|--------------|--------------|--------------| +| ccw1-gpu-173 | ccw1-gpu-219 | ccw1-gpu-24 | +|--------------|--------------|--------------| +| ccw1-gpu-177 | ccw1-gpu-189 | ccw1-gpu-115 | +|--------------|--------------|--------------| +| ccw1-gpu-233 | X | X | +|--------------|--------------|--------------| + +block 14 : # of Nodes = 18 +ClusterUUID + CliqueID : b78ed242-7b98-426f-b194-b76b8899f4ec 32766 +|--------------|--------------|--------------| +| ccw1-gpu-123 | ccw1-gpu-132 | ccw1-gpu-69 | +|--------------|--------------|--------------| +| ccw1-gpu-83 | ccw1-gpu-15 | ccw1-gpu-26 | +|--------------|--------------|--------------| +| ccw1-gpu-147 | ccw1-gpu-93 | ccw1-gpu-52 | +|--------------|--------------|--------------| +| ccw1-gpu-107 | ccw1-gpu-113 | ccw1-gpu-152 | +|--------------|--------------|--------------| +| ccw1-gpu-62 | ccw1-gpu-44 | ccw1-gpu-7 | +|--------------|--------------|--------------| +| ccw1-gpu-127 | ccw1-gpu-35 | ccw1-gpu-167 | +|--------------|--------------|--------------| + +block 15 : # of Nodes = 16 +ClusterUUID + CliqueID : acd0098d-5416-4fe2-b3b9-edebcd0d9374 32766 +|--------------|--------------|--------------| +| ccw1-gpu-246 | ccw1-gpu-14 | ccw1-gpu-37 | +|--------------|--------------|--------------| +| ccw1-gpu-214 | ccw1-gpu-236 | ccw1-gpu-119 | +|--------------|--------------|--------------| +| ccw1-gpu-200 | ccw1-gpu-66 | ccw1-gpu-191 | +|--------------|--------------|--------------| +| ccw1-gpu-148 | ccw1-gpu-124 | ccw1-gpu-42 | +|--------------|--------------|--------------| +| ccw1-gpu-99 | ccw1-gpu-216 | ccw1-gpu-168 | +|--------------|--------------|--------------| +| ccw1-gpu-230 | X | X | +|--------------|--------------|--------------| + +block 16 : # of Nodes = 16 +ClusterUUID + CliqueID : ef2f7639-552c-437e-bcee-56af9d562415 32766 +|--------------|--------------|--------------| +| ccw1-gpu-25 | ccw1-gpu-205 | ccw1-gpu-170 | +|--------------|--------------|--------------| +| ccw1-gpu-11 | ccw1-gpu-39 | ccw1-gpu-196 | +|--------------|--------------|--------------| +| ccw1-gpu-86 | ccw1-gpu-6 | ccw1-gpu-111 | +|--------------|--------------|--------------| +| ccw1-gpu-187 | ccw1-gpu-155 | ccw1-gpu-145 | +|--------------|--------------|--------------| +| ccw1-gpu-183 | ccw1-gpu-121 | ccw1-gpu-54 | +|--------------|--------------|--------------| +| ccw1-gpu-59 | X | X | +|--------------|--------------|--------------| diff --git a/azure-slurm/test/slurmcc_test/topology_test_input/block_topo_vis.txt b/azure-slurm/test/slurmcc_test/topology_test_input/block_topo_vis.txt new file mode 100644 index 00000000..756ee051 --- /dev/null +++ b/azure-slurm/test/slurmcc_test/topology_test_input/block_topo_vis.txt @@ -0,0 +1,63 @@ +block 1 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f723c +|-----------------|-----------------|-----------------| +| hpcbench-hpc-1 | hpcbench-hpc-5 | hpcbench-hpc-9 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-13 | hpcbench-hpc-17 | hpcbench-hpc-21 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-25 | hpcbench-hpc-29 | hpcbench-hpc-33 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-37 | hpcbench-hpc-41 | hpcbench-hpc-45 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-49 | hpcbench-hpc-53 | hpcbench-hpc-57 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-61 | hpcbench-hpc-65 | hpcbench-hpc-69 | +|-----------------|-----------------|-----------------| + +block 2 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f721e +|-----------------|-----------------|-----------------| +| hpcbench-hpc-2 | hpcbench-hpc-6 | hpcbench-hpc-10 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-14 | hpcbench-hpc-18 | hpcbench-hpc-22 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-26 | hpcbench-hpc-30 | hpcbench-hpc-34 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-38 | hpcbench-hpc-42 | hpcbench-hpc-46 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-50 | hpcbench-hpc-54 | hpcbench-hpc-58 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-62 | hpcbench-hpc-66 | hpcbench-hpc-70 | +|-----------------|-----------------|-----------------| + +block 3 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f724b +|-----------------|-----------------|-----------------| +| hpcbench-hpc-3 | hpcbench-hpc-7 | hpcbench-hpc-11 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-15 | hpcbench-hpc-19 | hpcbench-hpc-23 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-27 | hpcbench-hpc-31 | hpcbench-hpc-35 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-39 | hpcbench-hpc-43 | hpcbench-hpc-47 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-51 | hpcbench-hpc-55 | hpcbench-hpc-59 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-63 | hpcbench-hpc-67 | hpcbench-hpc-71 | +|-----------------|-----------------|-----------------| + +block 4 : # of Nodes = 18 +ClusterUUID + CliqueID : 5e797fc6-0f46-421a-8724-0e102c0f722a +|-----------------|-----------------|-----------------| +| hpcbench-hpc-4 | hpcbench-hpc-8 | hpcbench-hpc-12 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-16 | hpcbench-hpc-20 | hpcbench-hpc-24 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-28 | hpcbench-hpc-32 | hpcbench-hpc-36 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-40 | hpcbench-hpc-44 | hpcbench-hpc-48 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-52 | hpcbench-hpc-56 | hpcbench-hpc-60 | +|-----------------|-----------------|-----------------| +| hpcbench-hpc-64 | hpcbench-hpc-68 | hpcbench-hpc-72 | +|-----------------|-----------------|-----------------| diff --git a/azure-slurm/test/slurmcc_test/topology_test_input/tree_topo_vis.txt b/azure-slurm/test/slurmcc_test/topology_test_input/tree_topo_vis.txt new file mode 100644 index 00000000..b0ecf2e4 --- /dev/null +++ b/azure-slurm/test/slurmcc_test/topology_test_input/tree_topo_vis.txt @@ -0,0 +1,22 @@ +Switch 5 (root) +├── Switch 0 (3 nodes) +│ ├── hpcbench-hpc-36 +│ ├── hpcbench-hpc-39 +│ └── hpcbench-hpc-42 +├── Switch 1 (6 nodes) +│ ├── hpcbench-hpc-1 +│ ├── hpcbench-hpc-35 +│ ├── hpcbench-hpc-38 +│ ├── hpcbench-hpc-41 +│ ├── hpcbench-hpc-44 +│ └── hpcbench-hpc-49 +├── Switch 2 (3 nodes) +│ ├── hpcbench-hpc-37 +│ ├── hpcbench-hpc-45 +│ └── hpcbench-hpc-46 +├── Switch 3 (2 nodes) +│ ├── hpcbench-hpc-40 +│ └── hpcbench-hpc-43 +└── Switch 4 (2 nodes) + ├── hpcbench-hpc-47 + └── hpcbench-hpc-48