Hao Bai1,2*, Alexey Taymanov1, Tong Zhang2,
Aviral Kumar3, Spencer Whitehead1
1Microsoft 2UIUC 3CMU
*Work partly done during internship at Microsoft
WebGym is a scaled, production-ready, and rollout-efficient RL training framework for web automation agents powered by vision-language models (VLMs).
You're highly suggested to read the content below to understand the designs of this system. These contents are the most valuable insights we summarized that could save you hours if not days for how you should properly make the best uses of the architecture & components of this framework.
- Fully asynchronous rollout framework: Our web-agent trajectory rollout framework, Omniboxes, is fully asynchronous. Among the trajectories being collected in a batch, there is no synchronization between either steps or episodes. This can average the workload on the machine and avoid the thundering herd problem. We benchmarked the speed and found WebGym is 4-5x faster than the naive synchronous implementation.
- Portable & scalable rollout server: We adopt a server-client architecture to allow the user to host the Omniboxes server anywhere. The rollout server (essentially an orchestrated set of browser instances which actually consumes resources) can be hosted on a separate CPU machine, a CPU-GPU machine where you use GPU for some other jobs, or just your PC. We provide multi-node support so that you can even wire everything up. Just provide the public IP or interconnected internal IP, and make one node the master node. This server code can be easily adapted to your own applications as well, not just for web.
- Monitored rollout client. The client runs the rollout collection loop, which involves generating an action and send to the rollout server, implemented with a two-level FIFO queue. The server will process the action and generate an updated state. As this process involves a ton of threads, we monitor them by representing each thread as a cell. Here we introduce the WebGym Task Monitor - it displays all the information you care about the individual rollout threads, like which step it is in, what status is it in (is it generating the action? is it waiting for server to respond with a state?), and the statistics of these statuses over all threads.
- Multi-node fully-synchronous RL framework: We fully support multi-node rollout collection and training. In each iteration, if you have k CPU-GPU machines, the framework collects trajectories on k machines, then frees the memory on all these k machines, and then trains on trajectories on k machines. We assume these machines have some shared storage. For rollout collection, we use vLLM as inference server. For training, we use LLaMA-Factory. We manage the RL loop with a
run.shscript. The RL loop is fully synchronous, without any token-level or checkpoint-level asynchronous designs, to make off-policyness as minimal as possible. WebGym avoids using any RL-framework for you to easily extract the rollout framework code to apply to your own RL framework (veRL, Slime, PipelineRL, etc.).
-
Policy and interaction: We natively support the Qwen-VL model series as the agent policy. We support two interaction modes: Set-of-Marks (SoM) and coordinate-based interaction. In the SoM mode, the rollout server will respond with the screenshot with interactive elements and an accessibility tree of the webpage, which allows the agent to use element ids to refer to the elements in the screenshot. In the coordinate-based interaction mode, only the screenshot will appear in the response, and the policy has to output coordinates instead of labels. By default, we use the coordinate-based interaction mode, following recent discussions led by Yutori AI.
-
Context Management: Following the Qwen-VL convention of web agentic training & inference, we set multi-turn conversations between the agent and the environment to be managed through a sliding window of 4 rounds. Each round is a (state, action) pair, where the state is a screenshot and the response is the model's parsed output. The last 4 rounds are kept as full multimodal messages with images, while older history is compressed into concise text summaries (action taken, effect on page, and webpage name) to stay within the context window budget. Following the LLaMA-Factory convention, for thinking-variant models (e.g., Qwen3-VL-Think),
<think>blocks are stripped from conversation history so the model never sees its own prior thoughts. On the output side, responses are parsed via regex into structured fields — Thought, Action, tool_call JSON, and Memory — then the tool_call is extracted into a standardized action format and converted to a browser command.
Please refer to our ReadTheDocs documentation for a quick start. It contains comprehensive information about this codebase. You can also explore the codebase interactively via DeepWiki.
In case you're not able to open this online link, feel free to host it on yourself.
# Install documentation dependencies
pip install -r docs/requirements-docs.txt
pip install sphinx-autobuild
# Build the documentation
cd docs
bash build.sh
# Host with live reload (optional)
bash host.sh # Serves at http://localhost:8000The host.sh script uses sphinx-autobuild to automatically rebuild and serve the documentation with live reload whenever you make changes to the source files.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Please refer to SECURITY.md for information on reporting security vulnerabilities.
We hope you can kickstart exciting projects with WebGym! If you use the WebGym code for your project, please kindly cite with:
@article{bai2026webgym,
title={WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks},
author={Bai, Hao and Taymanov, Alexey and Zhang, Tong and Kumar, Aviral and Whitehead, Spencer},
journal={arXiv preprint arXiv:2601.02439},
year={2026}
}



