Skip to content

(WIP) A virtual orchestrator for managing multiple MLX AI models in a memory-constrained system

License

Notifications You must be signed in to change notification settings

f1yn/mlx-maestro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


mlx-maestro

(WIP) A virtual orchestrator for managing multiple MLX AI models in a memory-constrained system

How it works

Maestro (mlx-maestro) is an experimental AI orchestration tool for running multiple MLX models AI in provisioned memory space, so only a certain combination of agents can run concurrently at a given time. Instead of cancelling requests when busy, the MLX-Maestro implementation will hold the connection open, until memory space can be safely allocated for a given model

Services that are not handling requests after a specified timeout will automatically have their process terminated to open the memory queue for other services. Likewise, a space partitioning mechanism (which I refer to as volume) is used to ensure that certain services only allocate themselves once the space for them to do so, actually exists.

With a given timeout (defaults to 300 seconds) any service has idled - with no other services competing for their memory space, is terminated to restore usable memory back to the host.

This is using a generous model of volume-only (and high determinism). In a future release, I intend on having a failure more when the a volume of 1.0 is no longer possible due to other tenants on the host.

From my own manual testing - this has allowed me to run high level sequences of various models, without having to manually manage or free memory when certain models are not required for a given integration (code, IO workflows, e.c.t)

Installation

Project

I haven't setup a daemon-like operating mode for this script yet, so it needs to be installed manually (preferably with uv) as a standalone project.

git clone https://github.com/f1yn/mlx-maestro
cd mlx-maestro
uv sync

Installing models

Configured models (see Configuration) can be installed concurrently using the included prime.py script:

 ./.venv/bin/python ./src/prime.py

Sidecar support (via mlx-openai-server)

For using sidecar models (like embeddings and TTS/STT), I would strongly recommend also installing mlx-openai-server to provide the API abstraction and needs for such a system. This needs to be installed in the execution path, and paired directly with the versioning of the models you hope to run.

uv tool install mlx-openai-server

Configuration

The service is configured via a single config.toml file in the working directory. Below is the schema and usage notes.


[config] Section

Key Type Default Description
host string "0.0.0.0" Interface to bind the HTTP server. Use 127.0.0.1 for local-only access.
port_offset integer 10 Offset applied to base service ports (e.g., 10101 + 10 = 10111). Prevents port conflicts when running multiple instances.
port_bind_timeout integer 20 Maximum seconds to wait for a port to become available before failing startup.
sidecar_port_start integer 10200 First port number allocated for sidecar processes. Must be ≥1024 and avoid system ports (e.g., 5000, 8080).
sidecar_timeout integer 300 Seconds of inactivity after last request before a sidecar process is terminated.
service_default_timeout integer 300 Default idle timeout (seconds) for primary services. Can be overridden per service.
request_gap integer 15 Minimum time (seconds) between the last request and when a model becomes eligible for unloading. Prevents flapping during bursty traffic.

[[service]] Entries

Key Type Required Description
name string yes Human-readable identifier (e.g., "Vision", "Primary-1"). Used in logs and metrics.
port integer yes Base port for the service (e.g., 10105). Final port = port + port_offset.
model_id string yes Hugging Face model ID or local path.
volume float yes Weight used for request routing (e.g., 0.7 = 70% of traffic to this service).
timeout integer no Overrides service_default_timeout for this service.
extra_args array no Additional CLI arguments passed to the service command (e.g., ["--trust-remote-code"]).
command array no Custom command template. Supports {model} and {port} substitution. If omitted, defaults to mlx-openai-server launch --model-path {model} --port {port}.
sidecars table no Map of HTTP path prefixes to sidecar names (e.g., { "/v1/embeddings" = "Embedder" }).

Critical nuance:

  • port is the base port — actual binding uses port + port_offset.
  • sidecars are only spawned when a request hits their mapped path.
  • volume is used for routing weight — higher values receive proportionally more requests.

[[sidecar]] Entries

Key Type Required Description
name string yes Unique sidecar name (e.g., "Embedder", "Audio").
model_id string yes Hugging Face model ID or local path.
volume float yes Weight used to prioritize sidecar spawning (e.g., 0.4 = higher priority than 0.08).
command array yes Command template. Must include {model} and {port}.

Sidecar lifecycle:

  1. First request to a mapped path triggers command execution.
  2. Sidecar is kept alive for sidecar_timeout seconds after last request.
  3. If unused, it is terminated; if reused within sidecar_timeout, it is reused.

Example: Minimal config.toml

[config]
host = '127.0.0.1'
port_offset = 0
sidecar_port_start = 10200
sidecar_timeout = 60
service_default_timeout = 120
request_gap = 5

[[service]]
name = "Text"
port = 8000
model_id = "my-org/text-model"
volume = 1.0
sidecars = { "/v1/embeddings" = "Embedder" }

[[sidecar]]
name = "Embedder"
model_id = "my-org/embeddings"
volume = 0.5
command = ["mlx-openai-server", "launch", "--model-path", "{model}", "--port", "{port}"]

Important notes

Memory ceiling

You will need to increase the memory ceiling to something that's more geared towards volatile memory and AI models in general. On my 64 GB system, this looks to be ~58 GB of possible wired memory for a given agent (based on my own needs). You'd need to calculate this depending on what workflows you intend to use. Your volume settings in the config.toml should be calibrated to match the expectation that a 1.0 allocation is a fully utilized backend.

sudo sysctl iogpu.wired_limit_mb=58400

Future versions of this software will detect a lower ceiling and raise HTTP errors (instead of holding the port or hitting you with an OOM). That said, you probably shouldn't be running extensive memory workflows on macOS while running concurrent models via MLX. Just a feeling.

Ideal use case

This implementation, to lower complexity, was deliberately setup to use local ports for each model. I also use a load balancer in from of each port to manage each service through SSL/TLS, and make that available selectively through a VPN. The load balancer also enforced authentication over that same secure connection.

This is, without a doubt, the best case scenario for local model usage. Never host your own models without external authentication!

Disclaimer

This project solves a real issue for me in my network stack - and has given me a practical way to refined my understanding of modern python code. But as I am still becoming more familiar with asyncio (and how to manage processes in python), I suspect this software to be UNSTABLE until more QUlaity of Life additions are added. As such, USE AT YOUR OWN RISK.

Running

 ./.venv/bin/python ./src/main.py

Example output

A recent test run using some higher-end models.

~/L/dev-ai (main|✚8) $ ./.venv/bin/python ./src/main.py
00:54:06.384 | Config | INFO | Detected 5 services and 2
00:54:06.384 | MemoryRail.RT | INFO | Registered Sidecar "Embedder" on internal port 10200
00:54:06.384 | MemoryRail.RT | INFO | Registered Sidecar "Ear" on internal port 10201
00:54:06.385 | Proxy | INFO | System ready. Capacity: 1.0
00:54:10.710 | MemoryRail.RT | INFO | Capacity OK (1.00 > 0.70). Starting "Architect"...
00:54:10.710 | MemoryRail.RT.Architect | INFO | Booting process...
00:54:11.259 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:11,259 - INFO - HTTP Request: GET https://huggingface.co/api/models/nightmedia/Qwen3-Coder-Next-mxfp4-mlx/revision/main "HTTP/1.1 200 OK"
Fetching 17 files: 100%|██████████| 17/17 [00:00<00:00, 50930.83it/s]tching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]
00:54:17.805 | MemoryRail.RT.Architect | INFO | process::stderr -> /Users/flynn/LocalProjects/dev-ai/.venv/lib/python3.11/site-packages/mlx_lm/server.py:1695: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
00:54:17.805 | MemoryRail.RT.Architect | INFO | process::stderr -> warnings.warn(
00:54:17.805 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:17,805 - INFO - Starting httpd at 127.0.0.1 on port 10111...
00:54:18.252 | MemoryRail.RT.Architect | INFO | Port 10111 is open.
00:54:18.252 | MemoryRail.RT | INFO | Acquired lock for "Architect" after 7.54s wait.
00:54:18.253 | MemoryRail.RT | INFO | Acquired lock for "Architect" after 1.78s wait.
00:54:18.327 | MemoryRail.RT.Architect | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:54:18] "POST /v1/chat/completions HTTP/1.1" 200 -
00:54:18.373 | MemoryRail.RT.Architect | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:54:18] "POST /v1/chat/completions HTTP/1.1" 200 -
00:54:18.497 | MemoryRail.RT | INFO | Capacity OK (0.30 > 0.15). Starting "Autocomplete"...
00:54:18.497 | MemoryRail.RT.Autocomplete | INFO | Booting process...
00:54:19.252 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 2026-02-19 00:54:19,252 - INFO - HTTP Request: GET https://huggingface.co/api/models/LiquidAI/LFM2.5-1.2B-Instruct/revision/main "HTTP/1.1 200 OK"
Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 146070.29it/s]-> Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]
00:54:19.753 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> /Users/flynn/LocalProjects/dev-ai/.venv/lib/python3.11/site-packages/mlx_lm/server.py:1695: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
00:54:19.753 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> warnings.warn(
00:54:19.753 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 2026-02-19 00:54:19,753 - INFO - Starting httpd at 127.0.0.1 on port 10112...
00:54:20.005 | MemoryRail.RT.Autocomplete | INFO | Port 10112 is open.
00:54:20.005 | MemoryRail.RT | INFO | Acquired lock for "Autocomplete" after 1.51s wait.
00:54:20.021 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:54:20] "POST /v1/chat/completions HTTP/1.1" 200 -
00:54:20.151 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 2026-02-19 00:54:20,151 - INFO - Prompt processing progress: 190/191
00:54:24.392 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:24,392 - INFO - Prompt processing progress: 380/380
00:54:24.392 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:24,392 - INFO - Prompt processing progress: 2048/16361
00:54:30.413 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:30,412 - INFO - Prompt processing progress: 380/380
00:54:30.413 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:30,413 - INFO - Prompt processing progress: 4096/16361
00:54:36.706 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:36,706 - INFO - Prompt processing progress: 380/380
00:54:36.706 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:36,706 - INFO - Prompt processing progress: 6144/16361
00:54:43.248 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:43,248 - INFO - Prompt processing progress: 380/380
00:54:43.248 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:43,248 - INFO - Prompt processing progress: 8192/16361
00:54:50.044 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:50,044 - INFO - Prompt processing progress: 380/380
00:54:50.044 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:50,044 - INFO - Prompt processing progress: 10240/16361
00:54:52.514 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:54:52.514 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:54:57.088 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:57,088 - INFO - Prompt processing progress: 380/380
00:54:57.088 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:57,088 - INFO - Prompt processing progress: 12288/16361
00:54:58.521 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:54:58.521 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:04.389 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:04,389 - INFO - Prompt processing progress: 380/380
00:55:04.389 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:04,389 - INFO - Prompt processing progress: 14336/16361
00:55:04.527 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:04.527 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:10.534 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:10.534 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:11.921 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:11,920 - INFO - Prompt processing progress: 380/380
00:55:11.923 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:11,920 - INFO - Prompt processing progress: 16360/16361
00:55:11.981 | MemoryRail.RT.Architect | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:55:11] "POST /v1/chat/completions HTTP/1.1" 200 -
00:55:12.787 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:12,787 - INFO - Prompt processing progress: 381/382
00:55:15.690 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:15.690 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:21.695 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:21.695 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:23.276 | MemoryRail.GC | ERROR | Collecting garbage (detected 1 services to be closed)
00:55:23.276 | MemoryRail.GC | INFO | Now unloading "Autocomplete."
00:55:23.276 | MemoryRail.RT.Autocomplete | INFO | Stopping PID 6520...
00:55:23.304 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> /Users/flynn/.local/share/uv/python/cpython-3.11.14-macos-aarch64-none/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
00:55:23.304 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> warnings.warn('resource_tracker: There appear to be %d '
00:55:24.309 | MemoryRail.RT.Autocomplete | INFO | [Autocomplete] Process terminated.
00:55:28.312 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:28.312 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY]
00:55:33.716 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Warm: 11s)
00:55:33.716 | MemoryRail.RT | INFO | Current rail state: [Architect: WARM(11s)]
00:55:39.723 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Warm: 5s)
00:55:39.723 | MemoryRail.RT | INFO | Current rail state: [Architect: WARM(5s)]
00:55:45.731 | MemoryRail.RT | INFO | Evicting "Architect" (State: COLD) to make room for "Director"...
00:55:45.731 | MemoryRail.RT.Architect | INFO | Stopping PID 6510...
00:55:45.991 | MemoryRail.RT.Architect | INFO | process::stderr -> /Users/flynn/.local/share/uv/python/cpython-3.11.14-macos-aarch64-none/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
00:55:45.991 | MemoryRail.RT.Architect | INFO | process::stderr -> warnings.warn('resource_tracker: There appear to be %d '
00:55:46.996 | MemoryRail.RT.Architect | INFO | [Architect] Process terminated.
00:55:46.996 | MemoryRail.RT | INFO | Capacity OK (1.00 > 0.40). Starting "Director"...
00:55:46.997 | MemoryRail.RT.Director | INFO | Booting process...
00:55:47.883 | MemoryRail.RT.Director | INFO | process::stderr -> 2026-02-19 00:55:47,883 - INFO - HTTP Request: GET https://huggingface.co/api/models/mlx-community/gemma-3-27b-it-qat-4bit/revision/main "HTTP/1.1 200 OK"
Fetching 15 files: 100%|██████████| 15/15 [00:00<00:00, 114598.47it/s]hing 15 files:   0%|          | 0/15 [00:00<?, ?it/s]
00:55:51.415 | MemoryRail.RT.Director | INFO | process::stderr -> /Users/flynn/LocalProjects/dev-ai/.venv/lib/python3.11/site-packages/mlx_lm/server.py:1695: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
00:55:51.415 | MemoryRail.RT.Director | INFO | process::stderr -> warnings.warn(
00:55:51.415 | MemoryRail.RT.Director | INFO | process::stderr -> 2026-02-19 00:55:51,415 - INFO - Starting httpd at 127.0.0.1 on port 10114...
00:55:51.528 | MemoryRail.RT.Director | INFO | Port 10114 is open.
00:55:51.528 | MemoryRail.RT | INFO | Acquired lock for "Director" after 59.01s wait.
00:55:51.560 | MemoryRail.RT.Director | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:55:51] "POST /v1/chat/completions HTTP/1.1" 200 -
00:55:52.475 | MemoryRail.RT.Director | INFO | process::stderr -> 2026-02-19 00:55:52,474 - INFO - Prompt processing progress: 96/97
01:01:01.703 | MemoryRail.GC | ERROR | Collecting garbage (detected 1 services to be closed)
01:01:01.703 | MemoryRail.GC | INFO | Now unloading "Director."
01:01:01.704 | MemoryRail.RT.Director | INFO | Stopping PID 6591...
01:01:01.837 | MemoryRail.RT.Director | INFO | process::stderr -> /Users/flynn/.local/share/uv/python/cpython-3.11.14-macos-aarch64-none/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
01:01:01.837 | MemoryRail.RT.Director | INFO | process::stderr -> warnings.warn('resource_tracker: There appear to be %d '
01:01:02.841 | MemoryRail.RT.Director | INFO | [Director] Process terminated.

About

(WIP) A virtual orchestrator for managing multiple MLX AI models in a memory-constrained system

Topics

Resources

License

Stars

Watchers

Forks

Languages