Connected Autonomous Vehicles (CAVs) promise to reduce congestion in future urban networks, potentially by optimizing their routing decisions. Unlike for human drivers, these decisions can be made with collective, data-driven policies, developed by machine learning algorithms. Reinforcement learning (RL) can facilitate the development of such collective routing strategies, yet standardized and realistic benchmarks are missing. To that end, we present URB: Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. URB is a comprehensive benchmarking environment that unifies evaluation across 29 real-world traffic networks paired with realistic demand patterns. URB comes with a catalog of predefined tasks, five state-of-the-art multi-agent RL (MARL) algorithm implementations, three baseline methods, ten domain-specific performance metrics, and a modular configuration scheme. Experimental results initiate the first leaderboard for MARL in large-scale urban routing optimization and reveal that current approaches struggle to scale, emphasizing the urgent need for advancements in this domain.
Via the broad experimental scheme, URB aims to:
- understand which state-of-the-art algorithms outperform others in this class of tasks,
- open competition for future algorithmic improvements,
- better understand impact of collective CAV routing for future cities (on congestion, emissions and sustainability) to equip policymakers with solid arguments for CAV regulations.
Task: In a given network with a fixed demand pattern, experienced human agents have learned their route-choice strategies (minimized travel times). At some point, a given share of them mutate to CAVs and delegate routing decisions. Then, for a given number of training episodes, the agents develop routing strategies to minimize their delay using MARL.
Travel times relative to human baseline (
| Algorithm | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| IPPO | 3.15 | 3.28 | 3.34 | 3.24 | 0.6 | 0.12 | 1.33 | -0.31 | 0.05 | 0% |
| IQL | 3.15 | 3.34 | 3.49 | 3.24 | 0.66 | 0.14 | 1.44 | -0.42 | 0.08 | 0% |
| MAPPO | 3.15 | 3.32 | 3.43 | 3.25 | 0.66 | 0.14 | 1.45 | -0.27 | 0.08 | 0% |
| QMIX | 3.15 | 3.2 | 3.12 | 3.25 | 0.65 | 0.13 | 1.43 | -0.2 | 0.01 | 66% |
| Human | 3.15 | 3.15 | 3.15 | 3.15 | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 100% |
| AON | 3.15 | 3.15 | 3.01 | 3.25 | 0.55 | 0.09 | 1.21 | -0.06 | -0.0 | 100% |
| Random | 3.15 | 3.38 | 3.58 | 3.25 | 0.6 | 0.09 | 1.36 | -0.33 | 0.1 | 0% |
| Algorithm | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| IPPO | 2.8 | 2.88 | 2.92 | 2.85 | 0.53 | 0.26 | 0.93 | -0.4 | 0.03 | 0% |
| IQL | 2.8 | 2.92 | 3.03 | 2.84 | 1.48 | 0.98 | 2.23 | -0.52 | 0.06 | 0% |
| MAPPO | 2.8 | 2.92 | 3.03 | 2.85 | 1.23 | 0.81 | 1.87 | -0.64 | 0.06 | 0% |
| QMIX | 2.8 | 2.96 | 3.14 | 2.85 | 0.88 | 0.54 | 1.41 | -0.8 | 0.07 | 0% |
| Human | 2.8 | 2.8 | 2.8 | 2.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 100% |
| AON | 2.8 | 2.81 | 2.76 | 2.84 | 0.47 | 0.19 | 0.99 | -0.14 | -0.0 | 100% |
| Random | 2.8 | 2.93 | 3.04 | 2.85 | 0.51 | 0.22 | 0.95 | -0.62 | 0.06 | 0% |
| Algorithm | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| IPPO | 4.21 | 4.4 | 4.72 | 4.18 | 1.76 | 1.22 | 2.56 | -0.37 | 0.07 | 0% |
| IQL | 4.21 | 4.45 | 4.8 | 4.22 | 1.68 | 1.08 | 2.59 | -0.62 | 0.07 | 0% |
| MAPPO | 4.21 | 4.44 | 4.81 | 4.2 | 1.82 | 1.21 | 2.75 | -0.62 | 0.07 | 0% |
| QMIX | 4.21 | 4.36 | 4.55 | 4.23 | 1.2 | 0.67 | 1.98 | -0.73 | 0.02 | 0% |
| Human | 4.21 | 4.21 | 4.21 | 4.21 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 100% |
| AON | 4.21 | 4.29 | 4.37 | 4.23 | 0.87 | 0.55 | 0.24 | -0.45 | -0.01 | 0% |
| Random | 4.21 | 4.45 | 4.81 | 4.22 | 0.99 | 0.49 | 1.74 | -0.68 | 0.07 | 0% |
For parameterization details, visit the dedicated repository.
URB (as depicted in the above figure):
- Runs an experiment script using the
TrafficEnvironmentfromRouteRL, - With a RL algorithm (possibly
TorchRLimplementation) or a baseline method (frombaseline_models/), - Opens algorithm, environment and task configuration files from
config/, - Loads the network and demand from
networks - Executes a typical
RouteRLroutine of- first learning of human drivers,
- which then 'mutate` to CAVs,
- are trained to optimize routing policies with the implemented algorithm.
- When the training is finished, it uses raw results to compute a wide-set of KPIs.
Collaborative fleet minimizing group average travel time with 40% market share with non-adaptive human routing behaviour.
In a French small town of Saint-Arnoult, where human routing decision were stable and close to the well-known User Equilibrium, 40% of drivers decide to switch on the autonomous driving mode, delegating their routing decisions. Then, each machine will apply some kind of algorithm to select the route maximising the reward function: group average travel time.
- Fully autonomous fleet
- Altruistic fleet (minimize time for humans)
- Malicious fleet (maximize time for humans)
- Modal shifts - humans join or leave the fleet according to its performance
- Human adaptations - drivers react to actions of the fleet and change their behaviour
With this repository, URB comes with 6 traffic networks and associated demand data to experiment with. Two examples:
![]() |
![]() |
|---|---|
| Gretz Armainvilliers | Nangis |
More networks and demand data are available here. User can download the network folder of their choice, place the folder in
networks/, and use it as described below.
For a quickstart interaction with URB, we provide an executable code capsule on Code Ocean that runs a concise demonstrative experiment using the QMIX algorithm in the St. Arnoult network.
This environment includes all necessary dependencies (including SUMO) preinstalled, enabling reproducibility with a single click. We invite those interested to explore this capsule to examine the experimental workflow and output formats in a fully isolated and controlled setting.
- Visit the capsule link.
- Create a free CodeOcean account (if you donβt have one).
- Click Reproducible Run to execute the code in a controlled and reproducible environment.
Make sure you have SUMO installed in your system. This procedure should be carried out separately, by following the instructions provided here.
Clone the URB repository from github by
git clone https://github.com/COeXISTENCE-PROJECT/URB.git- Option 1 (Recommended): Create a virtual enviroment with
venv:
python3.13.1 -m venv .venvand then install dependencies by:
cd URB
pip install --force-reinstall --no-cache-dir -r requirements.txt- Option 2 (Alternative): Use conda environment with
conda:
conda create -n URB python=3.13.1and then install dependencies by:
cd URB
conda activate URB
pip install --force-reinstall --no-cache-dir -r requirements.txtTo use URB while using RL algorithm, you have to provide in the command line the following command:
python scripts/<script_name> --id <exp_id> --alg-conf <hyperparam_id> --env-conf <env_conf_id> --task-conf <task_id> --net <net_name> --env-seed <env_seed> --torch-seed <torch_seed>where
<scipt_name>is the script you wish to run, available scripts areippo_torchrl,iql_torchrl,mappo_torchrl,vdn_torchrlandqmix_torchrl,<exp_id>is your own experiment identifier, for instancerandom_ing,<hyperparam_id>is the hyperparameterization identifier, it must correspond to a.jsonfilename (without extension) inconfig/algo_config. Provided scripts automatically select the algorithm-specific subfolder in this directory.<env_conf_id>is the environment configuration identifier. It must correspond to a.jsonfilename (without extension) inconfig/env_config. It is used to parameterize environment-specific processes, such as path generation, disk operations, etc. It is optional and by default is set toconfig1.<task_id>is the task configuration identifier. It must correspond to a.jsonfilename (without extension) inconfig/task_config. It is used to parameterize the simulated scenario, such as portion of AVs, duration of human learning, AV behavior, etc.<net_name>is the name of the network you wish to use. Must be one of the folder names innetworks/i.e.ingolstadt_custom,nangis,nemours,provinsorsaint_arnoult,<env_seed>is reproducibility random seed for the traffic environment, default seed is set to be 42,<torch_seed>is reproducibility random seed for PyTorch, it is optional and by default is set to 42.
For example, the following command runs an experiment using:
- QMIX algorithm, hyperparameterized by
config/algo_config/qmix/config3.json, - The task specified in
config/task_config/config4.json, - The environment parameterization specified in
config/env_config/config1.json(by default), - Experiment identifier
sai_qmix_0, which will be used as the folder name inresults/to save the experiment data, - Saint Arnoult network and demand, from
networks/saint_arnoult, - Environment (also used for
randomandnumpy) and PyTorch seeds 42 and 0, respectively.
python scripts/qmix_torchrl.py --id sai_qmix_0 --alg-conf config3 --task-conf config4 --net saint_arnoult --env-seed 42 --torch-seed 0Similarly as for RL algorithms, you have to provide command, but there is one additional flag model for scripts/baselines.py, instead of torch-seed, then you have command of form:
python scripts/baselines.py --id <exp_id> --alg-conf <hyperparam_id> --env-conf <env_conf_id> --task-conf <task_id> --net <net_name> --env-seed <env_seed> --model <model_name>And <model_name> should be one of random, aon (included in baseline_models) or gawron (from RouteRL).
For example:
python scripts/baselines.py --id ing_aon --alg-conf config1 --task-conf config2 --net ingolstadt_custom --model aonEach experiment outputs set of raw records, which are then processed with the script in this folder for a set of performance indicators which we report and several additional metrics that track the quality of the solution and its impact to the system.
To use the analysis script, you have to provide in the command line the following command:
python analysis/metrics.py --id <exp_id> --verbose <verbose> --results-folder <results-folder> --skip-clearing <skip-clearing> --skip-collecting <skip-collecting>that will collect the results from the experiment with identifier <exp_id> and save them in the folder <exp_id>/metrics/. The --verbose flag is optional and if set to True will print additional information about the analysis process. Flag --results-folder is optional and if set to True will use the folder <results-folder> instead of the default one results/. The flags --skip-clearing and --skip-collecting are optional and if set to True will skip clearing and collecting the results from the experiment, respectively. Those operations have to be done only once, so if you are running the analysis script multiple times, you can skip them.
The core metric is the travel time
- CAV advantage as
$\hat{t}_{HDV}^{post}$ /$\hat{t}_{CAV}$ , - Effect of changing to CAV as
${\hat{t}_{HDV}^{pre}}/{\hat{t}_{CAV}}$ , and - Effect of remaining HDV as
${\hat{t}_{HDV}^{pre}}/{\hat{t}_{HDV}^{test}}$ , which reflect the relative performance of HDVs and the CAV fleet from the point of view of individual agents.
To better understand the causes of the changes in travel time, we track the Average speed and Average mileage (directly extracted from SUMO).
We measure the Cost of training, expressed as the average of:
We provide templates for extending the possible experiments that can be conducted using URB.
Users can define and use their own baseline methods by creating a new model by extending baseline_models/BaseLearningModel.
Users can add new experiment scripts for testing different algorithms, different implementations and different training pipelines. The recommended script structure is provided in scripts/base_script.py.
Users can extend possible experiment configurations by adding:
- Algorithm hyperparameterization in
config/algo_config, - Experiment setting in
config/env_config, and - New tasks in
config/task_config.
@misc{URB,
author = {A. O. Akman, A. Psarou, M. Hoffmann, Ε. Gorczyca, Ε. Kowalski, P. Gora, G. JamrΓ³z, R. Kucharski},
doi = {10.48550/arXiv.2505.17734},
month = {5},
title = {{Urban Routing Benchmark}},
url = {https://github.com/COeXISTENCE-PROJECT/URB},
version = {1.0.0},
year = {2025}
}





