mlx-maestro

(WIP) A virtual orchestrator for managing multiple MLX AI models in a memory-constrained system

How it works

Maestro (mlx-maestro) is an experimental AI orchestration tool for running multiple MLX models AI in provisioned memory space, so only a certain combination of agents can run concurrently at a given time. Instead of cancelling requests when busy, the MLX-Maestro implementation will hold the connection open, until memory space can be safely allocated for a given model

Services that are not handling requests after a specified timeout will automatically have their process terminated to open the memory queue for other services. Likewise, a space partitioning mechanism (which I refer to as volume) is used to ensure that certain services only allocate themselves once the space for them to do so, actually exists.

With a given timeout (defaults to 300 seconds) any service has idled - with no other services competing for their memory space, is terminated to restore usable memory back to the host.

This is using a generous model of volume-only (and high determinism). In a future release, I intend on having a failure more when the a volume of 1.0 is no longer possible due to other tenants on the host.

From my own manual testing - this has allowed me to run high level sequences of various models, without having to manually manage or free memory when certain models are not required for a given integration (code, IO workflows, e.c.t)

Installation

Project

I haven't setup a daemon-like operating mode for this script yet, so it needs to be installed manually (preferably with uv) as a standalone project.

git clone https://github.com/f1yn/mlx-maestro
cd mlx-maestro
uv sync

Installing models

Configured models (see Configuration) can be installed concurrently using the included prime.py script:

 ./.venv/bin/python ./src/prime.py

Sidecar support (via `mlx-openai-server`)

For using sidecar models (like embeddings and TTS/STT), I would strongly recommend also installing mlx-openai-server to provide the API abstraction and needs for such a system. This needs to be installed in the execution path, and paired directly with the versioning of the models you hope to run.

uv tool install mlx-openai-server

Configuration

The service is configured via a single config.toml file in the working directory. Below is the schema and usage notes.

`[config]` Section

Key	Type	Default	Description
`host`	string	`"0.0.0.0"`	Interface to bind the HTTP server. Use `127.0.0.1` for local-only access.
`port_offset`	integer	`10`	Offset applied to base service ports (e.g., `10101 + 10 = 10111`). Prevents port conflicts when running multiple instances.
`port_bind_timeout`	integer	`20`	Maximum seconds to wait for a port to become available before failing startup.
`sidecar_port_start`	integer	`10200`	First port number allocated for sidecar processes. Must be ≥1024 and avoid system ports (e.g., 5000, 8080).
`sidecar_timeout`	integer	`300`	Seconds of inactivity after last request before a sidecar process is terminated.
`service_default_timeout`	integer	`300`	Default idle timeout (seconds) for primary services. Can be overridden per service.
`request_gap`	integer	`15`	Minimum time (seconds) between the last request and when a model becomes eligible for unloading. Prevents flapping during bursty traffic.

`[[service]]` Entries

Key	Type	Required	Description
`name`	string	yes	Human-readable identifier (e.g., `"Vision"`, `"Primary-1"`). Used in logs and metrics.
`port`	integer	yes	Base port for the service (e.g., `10105`). Final port = `port + port_offset`.
`model_id`	string	yes	Hugging Face model ID or local path.
`volume`	float	yes	Weight used for request routing (e.g., `0.7` = 70% of traffic to this service).
`timeout`	integer	no	Overrides `service_default_timeout` for this service.
`extra_args`	array	no	Additional CLI arguments passed to the service command (e.g., `["--trust-remote-code"]`).
`command`	array	no	Custom command template. Supports `{model}` and `{port}` substitution. If omitted, defaults to `mlx-openai-server launch --model-path {model} --port {port}`.
`sidecars`	table	no	Map of HTTP path prefixes to sidecar names (e.g., `{ "/v1/embeddings" = "Embedder" }`).

Critical nuance:

port is the base port — actual binding uses port + port_offset.

sidecars are only spawned when a request hits their mapped path.

volume is used for routing weight — higher values receive proportionally more requests.

`[[sidecar]]` Entries

Key	Type	Required	Description
`name`	string	yes	Unique sidecar name (e.g., `"Embedder"`, `"Audio"`).
`model_id`	string	yes	Hugging Face model ID or local path.
`volume`	float	yes	Weight used to prioritize sidecar spawning (e.g., `0.4` = higher priority than `0.08`).
`command`	array	yes	Command template. Must include `{model}` and `{port}`.

Sidecar lifecycle:

First request to a mapped path triggers command execution.

Sidecar is kept alive for sidecar_timeout seconds after last request.

If unused, it is terminated; if reused within sidecar_timeout, it is reused.

Example: Minimal `config.toml`

[config]
host = '127.0.0.1'
port_offset = 0
sidecar_port_start = 10200
sidecar_timeout = 60
service_default_timeout = 120
request_gap = 5

[[service]]
name = "Text"
port = 8000
model_id = "my-org/text-model"
volume = 1.0
sidecars = { "/v1/embeddings" = "Embedder" }

[[sidecar]]
name = "Embedder"
model_id = "my-org/embeddings"
volume = 0.5
command = ["mlx-openai-server", "launch", "--model-path", "{model}", "--port", "{port}"]

Important notes

Memory ceiling

You will need to increase the memory ceiling to something that's more geared towards volatile memory and AI models in general. On my 64 GB system, this looks to be ~58 GB of possible wired memory for a given agent (based on my own needs). You'd need to calculate this depending on what workflows you intend to use. Your volume settings in the config.toml should be calibrated to match the expectation that a 1.0 allocation is a fully utilized backend.

sudo sysctl iogpu.wired_limit_mb=58400

Future versions of this software will detect a lower ceiling and raise HTTP errors (instead of holding the port or hitting you with an OOM). That said, you probably shouldn't be running extensive memory workflows on macOS while running concurrent models via MLX. Just a feeling.

Ideal use case

This implementation, to lower complexity, was deliberately setup to use local ports for each model. I also use a load balancer in from of each port to manage each service through SSL/TLS, and make that available selectively through a VPN. The load balancer also enforced authentication over that same secure connection.

This is, without a doubt, the best case scenario for local model usage. Never host your own models without external authentication!

Disclaimer

This project solves a real issue for me in my network stack - and has given me a practical way to refined my understanding of modern python code. But as I am still becoming more familiar with asyncio (and how to manage processes in python), I suspect this software to be UNSTABLE until more QUlaity of Life additions are added. As such, USE AT YOUR OWN RISK.

Running

 ./.venv/bin/python ./src/main.py

Example output

A recent test run using some higher-end models.

~/L/dev-ai (main|✚8) $ ./.venv/bin/python ./src/main.py
00:54:06.384 | Config | INFO | Detected 5 services and 2
00:54:06.384 | MemoryRail.RT | INFO | Registered Sidecar "Embedder" on internal port 10200
00:54:06.384 | MemoryRail.RT | INFO | Registered Sidecar "Ear" on internal port 10201
00:54:06.385 | Proxy | INFO | System ready. Capacity: 1.0
00:54:10.710 | MemoryRail.RT | INFO | Capacity OK (1.00 > 0.70). Starting "Architect"...
00:54:10.710 | MemoryRail.RT.Architect | INFO | Booting process...
00:54:11.259 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:11,259 - INFO - HTTP Request: GET https://huggingface.co/api/models/nightmedia/Qwen3-Coder-Next-mxfp4-mlx/revision/main "HTTP/1.1 200 OK"
Fetching 17 files: 100%|██████████| 17/17 [00:00<00:00, 50930.83it/s]tching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]
00:54:17.805 | MemoryRail.RT.Architect | INFO | process::stderr -> /Users/flynn/LocalProjects/dev-ai/.venv/lib/python3.11/site-packages/mlx_lm/server.py:1695: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
00:54:17.805 | MemoryRail.RT.Architect | INFO | process::stderr -> warnings.warn(
00:54:17.805 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:17,805 - INFO - Starting httpd at 127.0.0.1 on port 10111...
00:54:18.252 | MemoryRail.RT.Architect | INFO | Port 10111 is open.
00:54:18.252 | MemoryRail.RT | INFO | Acquired lock for "Architect" after 7.54s wait.
00:54:18.253 | MemoryRail.RT | INFO | Acquired lock for "Architect" after 1.78s wait.
00:54:18.327 | MemoryRail.RT.Architect | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:54:18] "POST /v1/chat/completions HTTP/1.1" 200 -
00:54:18.373 | MemoryRail.RT.Architect | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:54:18] "POST /v1/chat/completions HTTP/1.1" 200 -
00:54:18.497 | MemoryRail.RT | INFO | Capacity OK (0.30 > 0.15). Starting "Autocomplete"...
00:54:18.497 | MemoryRail.RT.Autocomplete | INFO | Booting process...
00:54:19.252 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 2026-02-19 00:54:19,252 - INFO - HTTP Request: GET https://huggingface.co/api/models/LiquidAI/LFM2.5-1.2B-Instruct/revision/main "HTTP/1.1 200 OK"
Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 146070.29it/s]-> Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]
00:54:19.753 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> /Users/flynn/LocalProjects/dev-ai/.venv/lib/python3.11/site-packages/mlx_lm/server.py:1695: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
00:54:19.753 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> warnings.warn(
00:54:19.753 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 2026-02-19 00:54:19,753 - INFO - Starting httpd at 127.0.0.1 on port 10112...
00:54:20.005 | MemoryRail.RT.Autocomplete | INFO | Port 10112 is open.
00:54:20.005 | MemoryRail.RT | INFO | Acquired lock for "Autocomplete" after 1.51s wait.
00:54:20.021 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:54:20] "POST /v1/chat/completions HTTP/1.1" 200 -
00:54:20.151 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> 2026-02-19 00:54:20,151 - INFO - Prompt processing progress: 190/191
00:54:24.392 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:24,392 - INFO - Prompt processing progress: 380/380
00:54:24.392 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:24,392 - INFO - Prompt processing progress: 2048/16361
00:54:30.413 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:30,412 - INFO - Prompt processing progress: 380/380
00:54:30.413 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:30,413 - INFO - Prompt processing progress: 4096/16361
00:54:36.706 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:36,706 - INFO - Prompt processing progress: 380/380
00:54:36.706 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:36,706 - INFO - Prompt processing progress: 6144/16361
00:54:43.248 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:43,248 - INFO - Prompt processing progress: 380/380
00:54:43.248 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:43,248 - INFO - Prompt processing progress: 8192/16361
00:54:50.044 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:50,044 - INFO - Prompt processing progress: 380/380
00:54:50.044 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:50,044 - INFO - Prompt processing progress: 10240/16361
00:54:52.514 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:54:52.514 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:54:57.088 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:57,088 - INFO - Prompt processing progress: 380/380
00:54:57.088 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:54:57,088 - INFO - Prompt processing progress: 12288/16361
00:54:58.521 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:54:58.521 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:04.389 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:04,389 - INFO - Prompt processing progress: 380/380
00:55:04.389 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:04,389 - INFO - Prompt processing progress: 14336/16361
00:55:04.527 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:04.527 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:10.534 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:10.534 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:11.921 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:11,920 - INFO - Prompt processing progress: 380/380
00:55:11.923 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:11,920 - INFO - Prompt processing progress: 16360/16361
00:55:11.981 | MemoryRail.RT.Architect | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:55:11] "POST /v1/chat/completions HTTP/1.1" 200 -
00:55:12.787 | MemoryRail.RT.Architect | INFO | process::stderr -> 2026-02-19 00:55:12,787 - INFO - Prompt processing progress: 381/382
00:55:15.690 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:15.690 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:21.695 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:21.695 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY, Autocomplete: COLD]
00:55:23.276 | MemoryRail.GC | ERROR | Collecting garbage (detected 1 services to be closed)
00:55:23.276 | MemoryRail.GC | INFO | Now unloading "Autocomplete."
00:55:23.276 | MemoryRail.RT.Autocomplete | INFO | Stopping PID 6520...
00:55:23.304 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> /Users/flynn/.local/share/uv/python/cpython-3.11.14-macos-aarch64-none/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
00:55:23.304 | MemoryRail.RT.Autocomplete | INFO | process::stderr -> warnings.warn('resource_tracker: There appear to be %d '
00:55:24.309 | MemoryRail.RT.Autocomplete | INFO | [Autocomplete] Process terminated.
00:55:28.312 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Busy)
00:55:28.312 | MemoryRail.RT | INFO | Current rail state: [Architect: BUSY]
00:55:33.716 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Warm: 11s)
00:55:33.716 | MemoryRail.RT | INFO | Current rail state: [Architect: WARM(11s)]
00:55:39.723 | MemoryRail.RT | INFO | Director is waiting... Blocked by: Architect (Warm: 5s)
00:55:39.723 | MemoryRail.RT | INFO | Current rail state: [Architect: WARM(5s)]
00:55:45.731 | MemoryRail.RT | INFO | Evicting "Architect" (State: COLD) to make room for "Director"...
00:55:45.731 | MemoryRail.RT.Architect | INFO | Stopping PID 6510...
00:55:45.991 | MemoryRail.RT.Architect | INFO | process::stderr -> /Users/flynn/.local/share/uv/python/cpython-3.11.14-macos-aarch64-none/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
00:55:45.991 | MemoryRail.RT.Architect | INFO | process::stderr -> warnings.warn('resource_tracker: There appear to be %d '
00:55:46.996 | MemoryRail.RT.Architect | INFO | [Architect] Process terminated.
00:55:46.996 | MemoryRail.RT | INFO | Capacity OK (1.00 > 0.40). Starting "Director"...
00:55:46.997 | MemoryRail.RT.Director | INFO | Booting process...
00:55:47.883 | MemoryRail.RT.Director | INFO | process::stderr -> 2026-02-19 00:55:47,883 - INFO - HTTP Request: GET https://huggingface.co/api/models/mlx-community/gemma-3-27b-it-qat-4bit/revision/main "HTTP/1.1 200 OK"
Fetching 15 files: 100%|██████████| 15/15 [00:00<00:00, 114598.47it/s]hing 15 files:   0%|          | 0/15 [00:00<?, ?it/s]
00:55:51.415 | MemoryRail.RT.Director | INFO | process::stderr -> /Users/flynn/LocalProjects/dev-ai/.venv/lib/python3.11/site-packages/mlx_lm/server.py:1695: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
00:55:51.415 | MemoryRail.RT.Director | INFO | process::stderr -> warnings.warn(
00:55:51.415 | MemoryRail.RT.Director | INFO | process::stderr -> 2026-02-19 00:55:51,415 - INFO - Starting httpd at 127.0.0.1 on port 10114...
00:55:51.528 | MemoryRail.RT.Director | INFO | Port 10114 is open.
00:55:51.528 | MemoryRail.RT | INFO | Acquired lock for "Director" after 59.01s wait.
00:55:51.560 | MemoryRail.RT.Director | INFO | process::stderr -> 127.0.0.1 - - [19/Feb/2026 00:55:51] "POST /v1/chat/completions HTTP/1.1" 200 -
00:55:52.475 | MemoryRail.RT.Director | INFO | process::stderr -> 2026-02-19 00:55:52,474 - INFO - Prompt processing progress: 96/97
01:01:01.703 | MemoryRail.GC | ERROR | Collecting garbage (detected 1 services to be closed)
01:01:01.703 | MemoryRail.GC | INFO | Now unloading "Director."
01:01:01.704 | MemoryRail.RT.Director | INFO | Stopping PID 6591...
01:01:01.837 | MemoryRail.RT.Director | INFO | process::stderr -> /Users/flynn/.local/share/uv/python/cpython-3.11.14-macos-aarch64-none/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
01:01:01.837 | MemoryRail.RT.Director | INFO | process::stderr -> warnings.warn('resource_tracker: There appear to be %d '
01:01:02.841 | MemoryRail.RT.Director | INFO | [Director] Process terminated.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
.vscode		.vscode
src		src
.gitignore		.gitignore
.python-version		.python-version
INTENT		INTENT
LICENCE		LICENCE
README.md		README.md
example.config.toml		example.config.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-maestro

How it works

Installation

Project

Installing models

Sidecar support (via `mlx-openai-server`)

Configuration

`[config]` Section

`[[service]]` Entries

`[[sidecar]]` Entries

Example: Minimal `config.toml`

Important notes

Memory ceiling

Ideal use case

Disclaimer

Running

Example output

About

Uh oh!

Languages

License

f1yn/mlx-maestro

Folders and files

Latest commit

History

Repository files navigation

mlx-maestro

How it works

Installation

Project

Installing models

Sidecar support (via mlx-openai-server)

Configuration

[config] Section

[[service]] Entries

[[sidecar]] Entries

Example: Minimal config.toml

Important notes

Memory ceiling

Ideal use case

Disclaimer

Running

Example output

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Sidecar support (via `mlx-openai-server`)

`[config]` Section

`[[service]]` Entries

`[[sidecar]]` Entries

Example: Minimal `config.toml`