From a041bcd745eb2cb41f42970614a1a1689e4a285f Mon Sep 17 00:00:00 2001
From: Joseph Schlesinger <joe@schlesinger.io>
Date: Thu, 1 Jan 2026 07:39:10 +0000
Subject: [PATCH] Add Modal documentation

- Source: llms.txt (modal.com/llms.txt)
- Files: 1 file (modal-full.md)
- Size: 1.7 MB
- Path: docs/llms-txt/modal/
- Description: Modal serverless platform documentation covering functions, GPU acceleration, deployment, scheduled jobs, volumes, and containerization

Generated with Claude Code
---
 docs/llms-txt/modal/modal-full.md | 50791 ++++++++++++++++++++++++++++
 index.yaml                        |   859 +-
 scripts/llms-sites.yaml           |     3 +
 3 files changed, 51253 insertions(+), 400 deletions(-)
 create mode 100644 docs/llms-txt/modal/modal-full.md

diff --git a/docs/llms-txt/modal/modal-full.md b/docs/llms-txt/modal/modal-full.md
new file mode 100644
index 0000000000..c2e98c6318
--- /dev/null
+++ b/docs/llms-txt/modal/modal-full.md
@@ -0,0 +1,50791 @@
+# Modal Documentation
+
+Source: https://modal.com/llms-full.txt
+
+---
+
+# Modal llms-full.txt
+
+> Modal is a platform for running Python code in the cloud with minimal
+> configuration, especially for serving AI models and high-performance batch
+> processing. It supports fast prototyping, serverless APIs, scheduled jobs,
+> GPU inference, distributed volumes, and sandboxes.
+
+Important notes:
+
+- Modal's primitives are embedded in Python and tailored for AI/GPU use cases,
+    but they can be used for general-purpose cloud compute.
+- Modal is a serverless platform, meaning you are only billed for resources used
+    and can spin up containers on demand in seconds.
+
+You can sign up for free at [https://modal.com] and get $30/month of credits.
+
+## Guides
+
+### Custom container images
+
+#### Defining Images
+
+# Images
+
+This guide walks you through how to define a Modal Image, the environment your Modal code runs in.
+
+The typical flow for defining an Image in Modal is
+[method chaining](https://jugad2.blogspot.com/2016/02/examples-of-method-chaining-in-python.html)
+starting from a base Image, like this:
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.13")
+    .apt_install("git")
+    .uv_pip_install("torch<3")
+    .env({"HALT_AND_CATCH_FIRE": "0"})
+    .run_commands("git clone https://github.com/modal-labs/agi && echo 'ready to go!'")
+)
+```
+
+If you have your own container image defintions, like a Dockerfile or a registry link, you can use those too!
+See [this guide](https://modal.com/docs/guide/existing-images).
+
+This page is a high-level guide to using Modal Images.
+For reference documentation on the `modal.Image` object, see
+[this page](https://modal.com/docs/reference/modal.Image).
+
+## What are Images?
+
+Your code on Modal runs in _containers_. Containers are like light-weight
+virtual machines -- container engines use
+[operating system tricks](https://earthly.dev/blog/chroot/) to isolate programs
+from each other ("containing" them), making them work as though they were
+running on their own hardware with their own filesystem. This makes execution
+environments more reproducible, for example by preventing accidental
+cross-contamination of environments on the same machine. For added security,
+Modal runs containers using the sandboxed
+[gVisor container runtime](https://cloud.google.com/blog/products/identity-security/open-sourcing-gvisor-a-sandboxed-container-runtime).
+
+Containers are started up from a stored "snapshot" of their filesystem state
+called an _image_. Producing the image for a container is called _building_ the
+image.
+
+By default, Modal Functions and Sandboxes run in a
+[Debian Linux](https://en.wikipedia.org/wiki/Debian) container with a basic
+Python installation of the same minor version `v3.x` as your local Python
+interpreter.
+
+To make your Apps and Functions useful, you will probably need some third party system packages
+or Python libraries. Modal provides a number of options to customize your container images at
+different levels of abstraction and granularity, from high-level convenience
+methods like `pip_install` through wrappers of core container image build
+features like `RUN` and `ENV`. We'll cover each of these in this guide,
+along with tips and tricks for building Images effectively when using each tool.
+
+## Add Python packages
+
+The simplest and most common Image modification is to add a third party
+Python package, like [`pandas`](https://pandas.pydata.org/).
+
+You can add Python packages to the environment by passing all the packages you
+need to the [`Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install) method,
+which installs packages with [`uv`](https://docs.astral.sh/uv/):
+
+```python
+import modal
+
+datascience_image = (
+    modal.Image.debian_slim()
+    .uv_pip_install("pandas==2.2.0", "numpy")
+)
+
+@app.function(image=datascience_image)
+def my_function():
+    import pandas as pd
+    import numpy as np
+
+    df = pd.DataFrame()
+    ...
+```
+
+You can include
+[Python dependency version specifiers](https://peps.python.org/pep-0508/),
+like `"torch<3"`, in the arguments. But we recommend pinning dependencies
+tightly, like `"torch==2.8.0"`, to improve the reproducibility and robustness
+of your builds.
+
+If you run into any issues with
+[`Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install), then
+you can fallback to [`Image.pip_install`](https://modal.com/docs/reference/modal.Image#pip_install) which
+uses standard [`pip`](https://pip.pypa.io/en/stable/user_guide/):
+
+```python
+datascience_image = (
+    modal.Image.debian_slim(python_version="3.13")
+    .pip_install("pandas==2.2.0", "numpy")
+)
+```
+
+Note that because you can define a different environment for each and every
+function if you so choose, you don't need to worry about virtual
+environment management. Containers make for much better separation of concerns!
+
+If you want to run a specific version of Python remotely rather than just
+matching the one you're running locally, provide the `python_version` as a
+string when constructing the base image, like we did above.
+
+## Add local files with `add_local_dir` and `add_local_file`
+
+Sometimes your containers need a dependency that's not available on the Internet,
+like configuration files or code on your laptop.
+
+To forward files from your local system use the
+`image.add_local_dir` and `image.add_local_file` Image methods.
+
+```python
+image = modal.Image.debian_slim().add_local_dir("/user/erikbern/.aws", remote_path="/root/.aws")
+```
+
+By default, these files are added to your container as it starts up rather than introducing
+a new Image layer. This means that the redeployment after making changes is really quick, but
+also means you can't run additional build steps after. You can specify a `copy=True` argument
+to the `add_local_` methods to instead force the files to be included in the built Image.
+
+### Add local Python code with `add_local_python_source`
+
+You can add Python code that's importable locally to your container
+by providing the module name to
+[`Image.add_local_python_source`](https://modal.com/docs/reference/modal.Image#add_local_python_source).
+
+```python
+image_with_module = modal.Image.debian_slim().add_local_python_source("local_module")
+
+@app.function(image=image_with_module)
+def f():
+    import local_module
+
+    local_module.do_stuff()
+```
+
+The difference from `add_local_dir` is that `add_local_python_source` takes module names as arguments
+instead of a file system path and looks up the local package's or module's location via Python's importing
+mechanism. The files are then added to directories that make them importable in containers in the
+same way as they are locally.
+
+This is intended for pure Python auxiliary modules that are part of your project and that your code imports.
+Third party packages should be installed via
+[`Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install) or similar.
+
+### What if I have different Python packages locally and remotely?
+
+You might want to use packages inside your Modal code that you don't have on
+your local computer. In the example above, we build a container that uses
+`pandas`. But if we don't have `pandas` locally, on the computer building the
+Modal App, we can't put `import pandas` at the top of the script, since it would
+cause an `ImportError`.
+
+The easiest solution to this is to put `import pandas` in the function body
+instead, as you can see above. This means that `pandas` is only imported when
+running inside the remote Modal container, which has `pandas` installed.
+
+Be careful about what you return from Modal Functions that have different
+packages installed than the ones you have locally! Modal Functions return Python
+objects, like `pandas.DataFrame`s, and if your local machine doesn't have
+`pandas` installed, it won't be able to handle a `pandas` object (the error
+message you see will mention
+[serialization](https://hazelcast.com/glossary/serialization/)/[deserialization](https://hazelcast.com/glossary/deserialization/)).
+
+If you have a lot of Functions and a lot of Python packages, you might want to
+keep the imports in the global scope so that every function can use the same
+imports. In that case, you can use the
+[`Image.imports`](https://modal.com/docs/reference/modal.Image#imports) context manager:
+
+```python
+pandas_image = modal.Image.debian_slim().pip_install("pandas", "numpy")
+
+with pandas_image.imports():
+    import pandas as pd
+    import numpy as np
+
+@app.function(image=pandas_image)
+def my_function():
+    df = pd.DataFrame()
+    ...
+```
+
+Because these imports happen before a new container processes its first input,
+you can combine this decorator with [memory snapshots](https://modal.com/docs/guide/memory-snapshot)
+to improve [cold start performance](https://modal.com/docs/guide/cold-start#share-initialization-work-across-cold-starts-with-memory-snapshots)
+for Functions that frequently scale from zero.
+
+## Install system packages with `.apt_install`
+
+You can install Linux packages with the [`apt` package manager](https://www.debian.org/doc/manuals/apt-guide/index.en.html)
+using [`Image.apt_install`](https://modal.com/docs/reference/modal.Image#apt_install):
+
+```python
+image = modal.Image.debian_slim().apt_install("git", "curl")
+```
+
+## Set environment variables with `.env`
+
+You can change the environment variables that your code sees
+(in, e.g., [`os.environ`](https://docs.python.org/3/library/os.html#os.environ))
+by passing a dictionary to [`Image.env`](https://modal.com/docs/reference/modal.Image#env):
+
+```python
+image = modal.Image.debian_slim().env({"PORT": "6443"})
+```
+
+Environment variable names and values must be strings.
+
+## Run shell commands with `.run_commands`
+
+You can supply shell commands that should be executed when building the
+Image to [`Image.run_commands`](https://modal.com/docs/reference/modal.Image#run_commands):
+
+```python
+image_with_repo = (
+    modal.Image.debian_slim().apt_install("git").run_commands(
+        "git clone https://github.com/modal-labs/gpu-glossary"
+    )
+)
+```
+
+## Run a Python function during your build with `.run_function`
+
+You can run Python code as a build step using the
+[`Image.run_function`](https://modal.com/docs/reference/modal.Image#run_function) method.
+
+For example, you can use this to download model parameters from Hugging Face into
+your Image:
+
+```python
+import os
+
+def download_models() -> None:
+    import diffusers
+
+    model_name = "segmind/small-sd"
+    pipe = diffusers.StableDiffusionPipeline.from_pretrained(
+        model_name, use_auth_token=os.environ["HF_TOKEN"]
+    )
+
+hf_cache = modal.Volume.from_name("hf-cache")
+
+image = (
+    modal.Image.debian_slim()
+        .pip_install("diffusers[torch]", "transformers", "ftfy", "accelerate")
+        .run_function(
+            download_models,
+            secrets=[modal.Secret.from_name("huggingface-secret")],
+            volumes={"/root/.cache/huggingface": hf_cache},
+        )
+)
+```
+
+For details on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+Essentially, this is equivalent to running a Modal Function and snapshotting the
+resulting filesystem as a new Image. Any kwargs accepted by [`@app.function`](https://modal.com/docs/reference/modal.App#function)
+([`Volume`s](https://modal.com/docs/guide/volumes), [`Secret`s](https://modal.com/docs/guide/secrets), specifications of
+resources like [GPUs](https://modal.com/docs/guide/gpu)) can be supplied here.
+
+Whenever you change other features of your Image, like the base Image or the
+version of a Python package, the Image will automatically be rebuilt the next
+time it is used. This is a bit more complicated when changing the contents of
+functions. See the
+[reference documentation](https://modal.com/docs/reference/modal.Image#run_function) for details.
+
+## Attach GPUs during setup
+
+If a step in the setup of your Image should be run on an instance with
+a GPU (e.g., so that a package can query the GPU to set compilation flags), pass the
+desired GPU type when defining that step:
+
+```python
+image = (
+    modal.Image.debian_slim()
+    .pip_install("bitsandbytes", gpu="H100")
+)
+```
+
+## Use `mamba` instead of `pip` with `micromamba_install`
+
+`pip` installs Python packages, but some Python workloads require the
+coordinated installation of system packages as well. The `mamba` package manager
+can install both. Modal provides a pre-built
+[Micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html)
+base image that makes it easy to work with `micromamba`:
+
+```python
+app = modal.App("bayes-pgm")
+
+numpyro_pymc_image = (
+    modal.Image.micromamba()
+    .micromamba_install("pymc==5.10.4", "numpyro==0.13.2", channels=["conda-forge"])
+)
+
+@app.function(image=numpyro_pymc_image)
+def sample():
+    import pymc as pm
+    import numpyro as np
+
+    print(f"Running on PyMC v{pm.__version__} with JAX/numpyro v{np.__version__} backend")
+    ...
+```
+
+## Image caching and rebuilds
+
+Modal uses the definition of an Image to determine whether it needs to be
+rebuilt. If the definition hasn't changed since the last time you ran or
+deployed your App, the previous version will be pulled from the cache.
+
+Images are cached per layer (i.e., per `Image` method call), and breaking
+the cache on a single layer will cause cascading rebuilds for all subsequent
+layers. You can shorten iteration cycles by defining frequently-changing
+layers last so that the cached version of all other layers can be used.
+
+In some cases, you may want to force an Image to rebuild, even if the
+definition hasn't changed. You can do this by adding the `force_build=True`
+argument to any of the Image building methods.
+
+```python
+image = (
+    modal.Image.debian_slim()
+    .apt_install("git")
+    .pip_install("slack-sdk", force_build=True)
+    .run_commands("echo hi")
+)
+```
+
+As in other cases where a layer's definition changes, both the `pip_install` and
+`run_commands` layers will rebuild, but the `apt_install` will not. Remember to
+remove `force_build=True` after you've rebuilt the Image, or it will
+rebuild every time you run your code.
+
+Alternatively, you can set the `MODAL_FORCE_BUILD` environment variable (e.g.
+`MODAL_FORCE_BUILD=1 modal run ...`) to rebuild all images attached to your App.
+But note that when you rebuild a base layer, the cache will be invalidated for _all_
+Images that depend on it, and they will rebuild the next time you run or deploy
+any App that uses that base. If you're debugging an issue with your Image, a better
+option might be using `MODAL_IGNORE_CACHE=1`. This will rebuild the Image from the
+top without breaking the Image cache or affecting subsequent builds.
+
+## Image builder updates
+
+Because changes to base images will cause cascading rebuilds, Modal is
+conservative about updating the base definitions that we provide. But many
+things are baked into these definitions, like the specific versions of the Image
+OS, the included Python, and the Modal client dependencies.
+
+We provide a separate mechanism for keeping base images up-to-date without
+causing unpredictable rebuilds: the "Image Builder Version". This is a workspace
+level-configuration that will be used for every Image built in your workspace.
+We release a new Image Builder Version every few months but allow you to update
+your workspace's configuration when convenient. After updating, your next
+deployment will take longer, because your Images will rebuild. You may also
+encounter problems, especially if your Image definition does not pin the version
+of the third-party libraries that it installs (as your new Image will get the
+latest version of these libraries, which may contain breaking changes).
+
+You can set the Image Builder Version for your workspace by going to your
+[workspace settings](https://modal.com/settings/image-config). This page also documents the
+important updates in each version.
+
+#### Using existing container images
+
+# Using existing images
+
+This guide walks you through how to use an existing container image as a Modal Image.
+
+```python notest
+sklearn_image = modal.Image.from_registry("huanjason/scikit-learn")
+custom_image = modal.Image.from_dockerfile("./src/Dockerfile")
+```
+
+## Load an image from a public registry with `.from_registry`
+
+To load an image from a public registry, just pass the image name, including any tags, to [`Image.from_registry`](https://modal.com/docs/reference/modal.Image#from_registry):
+
+```python
+sklearn_image = modal.Image.from_registry("huanjason/scikit-learn")
+
+@app.function(image=sklearn_image)
+def fit_knn():
+    from sklearn.neighbors import KNeighborsClassifier
+    ...
+```
+
+The `from_registry` method can load images from all public registries, such as
+[Nvidia's `nvcr.io`](https://catalog.ngc.nvidia.com/containers),
+[AWS ECR](https://aws.amazon.com/ecr/), and
+[GitHub's `ghcr.io`](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry).
+
+You can further modify the image [just like any other Modal Image](https://modal.com/docs/guide/images):
+
+```python continuation
+data_science_image = sklearn_image.uv_pip_install("polars", "datasette")
+```
+
+You can use external images so long as
+
+- The image is built for the
+  [`linux/amd64` platform](https://unix.stackexchange.com/questions/53415/why-are-64-bit-distros-often-called-amd64)
+- The image has a [compatible `ENTRYPOINT`](#entrypoint)
+
+Additionally, to be used with a Modal Function, the image needs to have `python` and `pip`
+installed and available on the `$PATH`.
+If an existing image does not have either `python` or `pip` set up compatibly, you
+can still use it. Just provide a version number as the `add_python` argument to
+install a reproducible
+[standalone build](https://github.com/indygreg/python-build-standalone)
+of Python:
+
+```python
+ubuntu_image = modal.Image.from_registry("ubuntu:22.04", add_python="3.11")
+valhalla_image = modal.Image.from_registry("gisops/valhalla:latest", add_python="3.12")
+```
+
+There are some additional restrictions for older versions of the Modal image builder.
+Image builder version is set at a workspace level via the settings page [here](https://modal.com/settings/image-config).
+See the migration guides on that page for details on any additional restrictions on images.
+
+## Load images from private registries
+
+You can also use images defined in private container registries on Modal.
+The exact method depends on the registry you are using.
+
+### Docker Hub (Private)
+
+To pull container images from private Docker Hub repositories,
+[create an access token](https://docs.docker.com/security/for-developers/access-tokens/)
+with "Read-Only" permissions and use this token value and your Docker Hub
+username to create a Modal [Secret](https://modal.com/docs/guide/secrets).
+
+```
+REGISTRY_USERNAME=my-dockerhub-username
+REGISTRY_PASSWORD=dckr_pat_REDACTED_FOR_SECURITY
+```
+
+Use this Secret with the
+[`modal.Image.from_registry`](https://modal.com/docs/reference/modal.Image#from_registry) method.
+
+### Elastic Container Registry (ECR)
+
+You can pull images from your AWS ECR account by specifying the full image URI
+as follows:
+
+```python
+import modal
+
+aws_secret = modal.Secret.from_name("my-aws-secret")
+image = (
+    modal.Image.from_aws_ecr(
+        "000000000000.dkr.ecr.us-east-1.amazonaws.com/my-private-registry:latest",
+        secret=aws_secret,
+    )
+    .pip_install("torch", "huggingface")
+)
+
+app = modal.App(image=image)
+```
+
+As shown above, you also need to use a [Modal Secret](https://modal.com/docs/guide/secrets)
+containing the environment variables `AWS_ACCESS_KEY_ID`,
+`AWS_SECRET_ACCESS_KEY`, and `AWS_REGION`. The AWS IAM user account associated
+with those keys must have access to the private registry you want to access.
+
+Alternatively, you can use [OIDC token authentication](https://modal.com/docs/guide/oidc-integration#pull-images-from-aws-elastic-container-registry-ecr).
+
+The user needs to have the following read-only policies:
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Action": ["ecr:GetAuthorizationToken"],
+      "Effect": "Allow",
+      "Resource": "*"
+    },
+    {
+      "Effect": "Allow",
+      "Action": [
+        "ecr:BatchCheckLayerAvailability",
+        "ecr:GetDownloadUrlForLayer",
+        "ecr:GetRepositoryPolicy",
+        "ecr:DescribeRepositories",
+        "ecr:ListImages",
+        "ecr:DescribeImages",
+        "ecr:BatchGetImage",
+        "ecr:GetLifecyclePolicy",
+        "ecr:GetLifecyclePolicyPreview",
+        "ecr:ListTagsForResource",
+        "ecr:DescribeImageScanFindings"
+      ],
+      "Resource": ""
+    }
+  ]
+}
+```
+
+You can use the IAM configuration above as a template for creating an IAM user.
+You can then
+[generate an access key](https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/)
+and create a Modal Secret using the AWS integration option. Modal will use your
+access keys to generate an ephemeral ECR token. That token is only used to pull
+image layers at the time a new image is built. We don't store this token but
+will cache the image once it has been pulled.
+
+Images on ECR must be private and follow
+[image configuration requirements](https://modal.com/docs/reference/modal.Image#from_aws_ecr).
+
+### Google Artifact Registry and Google Container Registry
+
+For further detail on how to pull images from Google's image registries, see
+[`modal.Image.from_gcp_artifact_registry`](https://modal.com/docs/reference/modal.Image#from_gcp_artifact_registry).
+
+## Bring your own image definition with `.from_dockerfile`
+
+You can define an Image from an existing Dockerfile by passing its path to
+[`Image.from_dockerfile`](https://modal.com/docs/reference/modal.Image#from_dockerfile):
+
+```python
+dockerfile_image = modal.Image.from_dockerfile("Dockerfile")
+
+@app.function(image=dockerfile_image)
+def fit():
+    import sklearn
+    ...
+```
+
+Note that you can still extend this Image using image builder methods!
+See [the guide](https://modal.com/docs/guide/images) for details.
+
+### Dockerfile command compatibility
+
+Since Modal doesn't use Docker to build containers, we have our own
+implementation of the
+[Dockerfile specification](https://docs.docker.com/engine/reference/builder/).
+Most Dockerfiles should work out of the box, but there are some differences to
+be aware of.
+
+First, a few minor Dockerfile commands and flags have not been implemented yet.
+These include `ONBUILD`, `STOPSIGNAL`, and `VOLUME`.
+Please reach out to us if your use case requires any of these.
+
+Next, there are some command-specific things that may be useful when porting a
+Dockerfile to Modal.
+
+#### `ENTRYPOINT`
+
+While the
+[`ENTRYPOINT`](https://docs.docker.com/engine/reference/builder/#entrypoint)
+command is supported, there is an additional constraint to the entrypoint script
+provided: when used with a Modal Function, it must also `exec` the arguments passed to it at some point.
+This is so the Modal Function runtime's Python entrypoint can run after your own. Most entrypoint
+scripts in Docker containers are wrappers over other scripts, so this is likely
+already the case.
+
+If you wish to write your own entrypoint script, you can use the following as a
+template:
+
+```bash
+#!/usr/bin/env bash
+
+# Your custom startup commands here.
+
+exec "$@" # Runs the command passed to the entrypoint script.
+```
+
+If the above file is saved as `/usr/bin/my_entrypoint.sh` in your container,
+then you can register it as an entrypoint with
+`ENTRYPOINT ["/usr/bin/my_entrypoint.sh"]` in your Dockerfile, or with
+[`entrypoint`](https://modal.com/docs/reference/modal.Image#entrypoint) as an
+Image build step.
+
+```python
+import modal
+
+image = (
+    modal.Image.debian_slim()
+    .pip_install("foo")
+    .entrypoint(["/usr/bin/my_entrypoint.sh"])
+)
+```
+
+#### `ENV`
+
+We currently don't support default values in
+[interpolations](https://docs.docker.com/compose/compose-file/12-interpolation/),
+such as `${VAR:-default}`
+
+#### Fast pull from registry
+
+# Fast pull from registry
+
+The performance of pulling public and private images from registries into Modal
+can be significantly improved by adopting the [eStargz](https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md) compression format.
+
+By applying eStargz compression during your image build and push, Modal will be much
+more efficient at pulling down your image from the registry.
+
+## How to use estargz
+
+If you have [Buildkit](https://docs.docker.com/build/buildkit/) version greater than `0.10.0`, adopting `estargz` is as simple as
+adding some flags to your `docker buildx build` command:
+
+- `type=registry` flag will instruct BuildKit to push the image after building.
+  - If you do not push the image from immediately after build and instead attempt to push it later with docker push, the image will be converted to a standard gzip image.
+- `compression=estargz` specifies that we are using the [eStargz](https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md) compression format.
+- `oci-mediatypes=true` specifies that we are using the OCI media types, which is required for eStargz.
+- `force-compression=true` will recompress the entire image and convert the base image to eStargz if it is not already.
+
+```bash
+docker buildx build --tag "<registry>/<namespace>/<repo>:<version>" \
+--output type=registry,compression=estargz,force-compression=true,oci-mediatypes=true \
+.
+```
+
+Then reference the container image as normal in your Modal code.
+
+```python notest
+app = modal.App(
+    "example-estargz-pull",
+    image=modal.Image.from_registry(
+        "public.ecr.aws/modal/estargz-example-images:text-generation-v1-esgz"
+    )
+)
+```
+
+At build time you should see the eStargz-enabled puller activate:
+
+```
+Building image im-TinABCTIf12345ydEwTXYZ
+
+=> Step 0: FROM public.ecr.aws/modal/estargz-example-images:text-generation-v1-esgz
+Using estargz to speed up image pull (index loaded in 1.86s)...
+Progress: 10% complete... (1.11s elapsed)
+Progress: 20% complete... (3.10s elapsed)
+Progress: 30% complete... (4.18s elapsed)
+Progress: 40% complete... (4.76s elapsed)
+Progress: 50% complete... (5.51s elapsed)
+Progress: 62% complete... (6.17s elapsed)
+Progress: 74% complete... (6.99s elapsed)
+Progress: 81% complete... (7.23s elapsed)
+Progress: 99% complete... (8.90s elapsed)
+Progress: 100% complete... (8.90s elapsed)
+Copying image...
+Copied image in 5.81s
+```
+
+## Supported registries
+
+Currently, Modal supports fast estargz pulling images with the following registries:
+
+- AWS Elastic Container Registry (ECR)
+- Docker Hub (docker.io)
+- Google Artifact Registry (gcr.io, pkg.dev)
+
+We are working on adding support for GitHub Container Registry (ghcr.io).
+
+### GPUs and other resources
+
+#### GPU acceleration
+
+# GPU acceleration
+
+Modal makes it easy to run your code on [GPUs](https://modal.com/gpu-glossary/readme).
+
+## Quickstart
+
+Here's a simple example of a Function running on an A100 in Modal:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().pip_install("torch")
+app = modal.App(image=image)
+
+@app.function(gpu="A100")
+def run():
+    import torch
+
+    assert torch.cuda.is_available()
+```
+
+## Specifying GPU type
+
+You can pick a specific GPU type for your Function via the `gpu` argument.
+Modal supports the following values for this parameter:
+
+- `T4`
+- `L4`
+- `A10`
+- `A100`
+- `A100-40GB`
+- `A100-80GB`
+- `L40S`
+- `H100`/`H100!`
+- `H200`
+- `B200`
+
+For instance, to use a B200, you can use `@app.function(gpu="B200")`.
+
+Refer to our [pricing page](https://modal.com/pricing) for the latest pricing on each GPU type.
+
+## Specifying GPU count
+
+You can specify more than 1 GPU per container by appending `:n` to the GPU
+argument. For instance, to run a Function with eight H100s:
+
+```python
+
+@app.function(gpu="H100:8")
+def run_llama_405b_fp8():
+    ...
+```
+
+Currently B200, H200, H100, A100, L4, T4 and L40S instances support up to 8 GPUs (up to 1,536 GB GPU RAM),
+and A10 instances support up to 4 GPUs (up to 96 GB GPU RAM). Note that requesting
+more than 2 GPUs per container will usually result in larger wait times. These
+GPUs are always attached to the same physical machine.
+
+## Picking a GPU
+
+For running, rather than training, neural networks, we recommend starting off
+with the [L40S](https://resources.nvidia.com/en-us-l40s/l40s-datasheet-28413),
+which offers an excellent trade-off of cost and performance and 48 GB of GPU
+RAM for storing model weights and activations.
+
+For more on how to pick a GPU for use with neural networks like LLaMA or Stable
+Diffusion, and for tips on how to make that GPU go brrr, check out
+[Tim Dettemers' blog post](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/)
+or the
+[Full Stack Deep Learning page on Cloud GPUs](https://fullstackdeeplearning.com/cloud-gpus/).
+
+## B200 GPUs
+
+Modal's most powerful GPUs are the [B200s](https://www.nvidia.com/en-us/data-center/dgx-b200/),
+NVIDIA's flagship data center chip for the Blackwell [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture).
+
+To request a B200, set the `gpu` argument to `"B200"`
+
+```python
+@app.function(gpu="B200:8")
+def run_deepseek():
+    ...
+```
+
+Check out [this example](https://modal.com/docs/examples/llm_inference) to see how you can use B200s to max out vLLM serving performance for LLaMA 3.1-8B.
+
+Before you jump for the most powerful (and so most expensive) GPU, make sure you
+understand where the bottlenecks are in your computations. For example, running
+language models with small batch sizes (e.g. one prompt at a time) results in a
+[bottleneck on memory, not arithmetic](https://kipp.ly/transformer-inference-arithmetic/).
+Since arithmetic throughput has risen faster than memory throughput in recent
+hardware generations, speedups for memory-bound GPU jobs are not as extreme and
+may not be worth the extra cost.
+
+## H200 and H100 GPUs
+
+[H200s](https://www.nvidia.com/en-us/data-center/h200/) and [H100s](https://www.nvidia.com/en-us/data-center/h100/) are the previous
+generation of top-of-the-line data center chips from NVIDIA, based on the Hopper [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture).
+These GPUs have better software support than do Blackwell GPUs (e.g. popular libraries include pre-compiled kernels for Hopper, but not Blackwell),
+and they often get the job done at a competitive cost, so they are a common choice of accelerator, on and off Modal.
+
+All H100 GPUs on the Modal platform are of the SXM variant, as can be verified by examining the
+[power draw](https://modal.com/docs/guide/gpu-metrics) in the dashboard or with `nvidia-smi`.
+
+### Automatic upgrades to H200s
+
+Modal may automatically upgrade a `gpu="H100"` request to run on an H200.
+This automatic upgrade does _not_ change the cost of the GPU.
+
+Kernels [compatible](https://modal.com/gpu-glossary/device-software/compute-capability) with H200s are also compatible with H100s,
+so your code will still run, just faster, so long as it doesn't make strict assumptions about memory capacity.
+An H200’s [HBM3e memory](https://modal.com/gpu-glossary/device-hardware/gpu-ram)
+has a capacity of 141 GB and a bandwidth of 4.8TB/s, 1.75x larger and 1.4x faster than an NVIDIA H100 with HBM3.
+
+In cases where an automatic upgrade to H200 would not be helpful (for instance, benchmarking) you can pass
+`gpu=H100!` to avoid it.
+
+## A100 GPUs
+
+[A100s](https://www.nvidia.com/en-us/data-center/a100/) are based on NVIDIA's Ampere [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture).
+Modal offers two versions of the A100: one with 40 GB of RAM and another with 80 GB of RAM.
+
+To request an A100 with 40 GB of [GPU memory](https://modal.com/gpu-glossary/device-hardware/gpu-ram), use `gpu="A100"`:
+
+```python
+@app.function(gpu="A100")
+def qwen_7b():
+    ...
+```
+
+Modal may automatically upgrade a `gpu="A100"` request to run on an 80 GB A100.
+This automatic upgrade does _not_ change the cost of the GPU.
+
+You can specifically request a 40GB A100 with the string `A100-40GB`.
+To specifically request an 80 GB A100, use the string `A100-80GB`:
+
+```python
+@app.function(gpu="A100-80GB")
+def llama_70b_fp8():
+    ...
+```
+
+## GPU fallbacks
+
+Modal allows specifying a list of possible GPU types, suitable for Functions that are
+compatible with multiple options. Modal respects the ordering of this list and
+will try to allocate the most preferred GPU type before falling back to less
+preferred ones.
+
+```python
+@app.function(gpu=["H100", "A100-40GB:2"])
+def run_on_80gb():
+    ...
+```
+
+See [this example](https://modal.com/docs/examples/gpu_fallbacks) for more detail.
+
+## Multi GPU training
+
+Modal currently supports multi-GPU training on a single node, with multi-node training in closed beta ([contact us](https://modal.com/slack) for access).
+Depending on which framework you are using, you may need to use different techniques to train on multiple GPUs.
+
+If the framework re-executes the entrypoint of the Python process (like [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/index.html)) you need to either set the strategy to `ddp_spawn` or `ddp_notebook` if you wish to invoke the training directly. Another option is to run the training script as a subprocess instead.
+
+```python
+@app.function(gpu="A100:2")
+def run():
+    import subprocess
+    import sys
+    subprocess.run(
+        ["python", "train.py"],
+        stdout=sys.stdout, stderr=sys.stderr,
+        check=True,
+    )
+```
+
+## Examples and more resources
+
+For more information about GPUs in general, check out our [GPU Glossary](https://modal.com/gpu-glossary/readme).
+
+Or take a look some examples of Modal apps using GPUs:
+
+- [Fine-tune a character LoRA for your pet](https://modal.com/docs/examples/diffusers_lora_finetune)
+- [Fast LLM inference on big GPUs](https://modal.com/docs/examples/llm_inference)
+- [Stable Diffusion with a CLI, API, and web UI](https://modal.com/docs/examples/text_to_image)
+- [Rendering Blender videos](https://modal.com/docs/examples/blender_video)
+
+#### Using CUDA on Modal
+
+# Using CUDA on Modal
+
+Modal makes it easy to accelerate your workloads with datacenter-grade NVIDIA GPUs.
+
+To take advantage of the hardware, you need to use matching software: the CUDA stack.
+This guide explains the components of that stack and how to install them on Modal.
+For more on which GPUs are available on Modal and how to choose a GPU for your use case,
+see [this guide](https://modal.com/docs/guide/gpu). For a deep dive on both the
+[GPU hardware](https://modal.com/gpu-glossary/device-hardware) and [software](https://modal.com/gpu-glossary/device-software)
+and for even more detail on [the CUDA stack](https://modal.com/gpu-glossary/host-software/),
+see our [GPU Glossary](https://modal.com/gpu-glossary/readme).
+
+Here's the tl;dr:
+
+- The [NVIDIA Accelerated Graphics Driver for Linux-x86_64](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#driver-installation), version 575.57.08,
+  and [CUDA Driver API](https://docs.nvidia.com/cuda/archive/12.9.0/cuda-driver-api/index.html), version 12.9, are already installed.
+  You can call `nvidia-smi` or run compiled CUDA programs from any Modal Function with access to a GPU.
+- That means you can install many popular libraries like `torch` that bundle their other CUDA dependencies [with a simple `pip_install`](#install-gpu-accelerated-torch-and-transformers-with-pip_install).
+- For bleeding-edge libraries like `flash-attn`, you may need to install CUDA dependencies manually.
+  To make your life easier, [use an existing image](#for-more-complex-setups-use-an-officially-supported-cuda-image).
+
+## What is CUDA?
+
+When someone refers to "installing CUDA" or "using CUDA",
+they are referring not to a library, but to a
+[stack](https://modal.com/gpu-glossary/host-software/cuda-software-platform) with multiple layers.
+Your application code (and its dependencies) can interact
+with the stack at different levels.
+
+![The CUDA stack](../../assets/docs/cuda-stack-diagram.png)
+
+This leads to a lot of confusion. To help clear that up, the following sections explain each component in detail.
+
+### Level 0: Kernel-mode driver components
+
+At the lowest level are the [_kernel-mode driver components_](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#nvidia-open-gpu-kernel-modules).
+The Linux kernel is essentially a single program operating the entire machine and all of its hardware.
+To add hardware to the machine, this program is extended by loading new modules into it.
+These components communicate directly with hardware -- in this case the GPU.
+
+Because they are kernel modules, these driver components are tightly integrated with the host operating system
+that runs your containerized Modal Functions and are not something you can inspect or change yourself.
+
+### Level 1: User-mode driver API
+
+All action in Linux that doesn't occur in the kernel occurs in [user space](https://en.wikipedia.org/wiki/User_space).
+To talk to the kernel drivers from our user space programs, we need _user-mode driver components_.
+
+Most prominently, that includes:
+
+- the [CUDA Driver API](https://modal.com/gpu-glossary/host-software/cuda-driver-api),
+  a [shared object](https://en.wikipedia.org/wiki/Shared_library) called `libcuda.so`.
+  This object exposes functions like [`cuMemAlloc`](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467),
+  for allocating GPU memory.
+- the [NVIDIA management library](https://developer.nvidia.com/management-library-nvml), `libnvidia-ml.so`, and its command line interface [`nvidia-smi`](https://developer.nvidia.com/system-management-interface).
+  You can use these tools to check the status of the system's GPU(s).
+
+These components are installed on all Modal machines with access to GPUs.
+Because they are user-level components, you can use them directly:
+
+```python runner:ModalRunner
+import modal
+
+app = modal.App()
+
+@app.function(gpu="any")
+def check_nvidia_smi():
+    import subprocess
+    output = subprocess.check_output(["nvidia-smi"], text=True)
+    assert "Driver Version:" in output
+    assert "CUDA Version:" in output
+    print(output)
+    return output
+```
+
+### Level 2: CUDA Toolkit
+
+Wrapping the CUDA Driver API is the [CUDA Runtime API](https://modal.com/gpu-glossary/host-software/cuda-runtime-api), the `libcudart.so` shared library.
+This API includes functions like [`cudaLaunchKernel`](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-runtime-api/group__CUDART__HIGHLEVEL.html#group__CUDART__HIGHLEVEL_1g7656391f2e52f569214adbfc19689eb3)
+and is more commonly used in CUDA programs (see [this HackerNews comment](https://news.ycombinator.com/item?id=20616385) for color commentary on why).
+This shared library is _not_ installed by default on Modal.
+
+The CUDA Runtime API is generally installed as part of the larger [NVIDIA CUDA Toolkit](https://docs.nvidia.com/cuda/index.html),
+which includes the [NVIDIA CUDA compiler driver](https://modal.com/gpu-glossary/host-software/nvcc) (`nvcc`) and its toolchain
+and a number of [useful goodies](https://modal.com/gpu-glossary/host-software/cuda-binary-utilities) for writing and debugging CUDA programs (`cuobjdump`, `cudnn`, profilers, etc.).
+
+Contemporary GPU-accelerated machine learning workloads like LLM inference frequently make use of many components of the CUDA Toolkit,
+such as the run-time compilation library [`nvrtc`](https://docs.nvidia.com/cuda/archive/12.8.0/nvrtc/index.html).
+
+So why aren't these components installed along with the drivers?
+A compiled CUDA program can run without the CUDA Runtime API installed on the system,
+by [statically linking](https://en.wikipedia.org/wiki/Static_library) the CUDA Runtime API into the program binary,
+though this is fairly uncommon for CUDA-accelerated Python programs.
+Additionally, older versions of these components are needed for some applications
+and some application deployments even use several versions at once.
+Both patterns are compatible with the host machine driver provided on Modal.
+
+## Install GPU-accelerated `torch` and `transformers` with `pip_install`
+
+The components of the CUDA Toolkit can be installed via `pip`,
+via PyPI packages like [`nvidia-cuda-runtime-cu12`](https://pypi.org/project/nvidia-cuda-runtime-cu12/)
+and [`nvidia-cuda-nvrtc-cu12`](https://pypi.org/project/nvidia-cuda-nvrtc-cu12/).
+These components are listed as dependencies of some popular GPU-accelerated Python libraries, like `torch`.
+
+Because Modal already includes the lower parts of the CUDA stack, you can install these libraries
+with [the `pip_install` method of `modal.Image`](https://modal.com/docs/guide/images#add-python-packages-with-pip_install), just like any other Python library:
+
+```python
+image = modal.Image.debian_slim().pip_install("torch")
+
+@app.function(gpu="any", image=image)
+def run_torch():
+    import torch
+    has_cuda = torch.cuda.is_available()
+    print(f"It is {has_cuda} that torch can access CUDA")
+    return has_cuda
+```
+
+Many libraries for running open-weights models, like `transformers` and `vllm`,
+use `torch` under the hood and so can be installed in the same way:
+
+```python
+image = modal.Image.debian_slim().pip_install("transformers[torch]")
+image = image.apt_install("ffmpeg")  # for audio processing
+
+@app.function(gpu="any", image=image)
+def run_transformers():
+    from transformers import pipeline
+    transcriber = pipeline(model="openai/whisper-tiny.en", device="cuda")
+    result = transcriber("https://modal-cdn.com/mlk.flac")
+    print(result["text"])  # I have a dream that one day this nation will rise up live out the true meaning of its creed
+```
+
+## For more complex setups, use an officially-supported CUDA image
+
+The disadvantage of installing the CUDA stack via `pip` is that
+many other libraries that depend on its components being installed as normal system packages cannot find them.
+
+For these cases, we recommend you use an image that already has the full CUDA stack installed as system packages
+and all environment variables set correctly, like the [`nvidia/cuda:*-devel-*` images on Docker Hub](https://hub.docker.com/r/nvidia/cuda).
+
+[TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/overview.html) is an inference engine that accelerates and optimizes performance for the large language models. It requires the full CUDA toolkit for installation.
+
+```python
+cuda_version = "12.8.1"  # should be no greater than host CUDA version
+flavor = "devel"  # includes full CUDA toolkit
+operating_sys = "ubuntu24.04"
+tag = f"{cuda_version}-{flavor}-{operating_sys}"
+HF_CACHE_PATH = "/cache"
+
+image = (
+    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
+    .entrypoint([])  # remove verbose logging by base image on entry
+    .apt_install("libopenmpi-dev")  # required for tensorrt
+    .pip_install("tensorrt-llm==0.19.0", "pynvml", extra_index_url="https://pypi.nvidia.com")
+    .pip_install("hf-transfer", "huggingface_hub[hf_xet]")
+    .env({"HF_HUB_CACHE": HF_CACHE_PATH, "HF_HUB_ENABLE_HF_TRANSFER": "1", "PMIX_MCA_gds": "hash"})
+)
+
+app = modal.App("tensorrt-llm", image=image)
+hf_cache_volume = modal.Volume.from_name("hf_cache_tensorrt", create_if_missing=True)
+
+@app.function(gpu="A10G", volumes={HF_CACHE_PATH: hf_cache_volume})
+def run_tiny_model():
+    from tensorrt_llm import LLM, SamplingParams
+
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+
+    output = llm.generate("The capital of France is", sampling_params)
+    print(f"Generated text: {output.outputs[0].text}")
+    return output.outputs[0].text
+```
+
+Make sure to choose a version of CUDA that is no greater than the version provided by the host machine.
+Older minor (`12.*`) versions are guaranteed to be compatible with the host machine's driver,
+but older major (`11.*`, `10.*`, etc.) versions may not be.
+
+## What next?
+
+For more on accessing and choosing GPUs on Modal, check out [this guide](https://modal.com/docs/guide/gpu).
+To dive deep on GPU internals, check out our [GPU Glossary](https://modal.com/gpu-glossary/readme).
+
+To see these installation patterns in action, check out these examples:
+
+- [Fast LLM inference on big GPUs](https://modal.com/docs/examples/llm_inference)
+- [Finetune a character LoRA for your pet](https://modal.com/docs/examples/diffusers_lora_finetune)
+- [Optimized Flux inference](https://modal.com/docs/examples/flux)
+
+#### Reserving CPU and memory
+
+# Reserving CPU and memory
+
+Each Modal container has a default reservation of 0.125 CPU cores and 128 MiB of memory.
+Containers can exceed this minimum if the worker has available CPU or memory.
+You can also guarantee access to more resources by requesting a higher reservation.
+
+## CPU cores
+
+If you have code that must run on a larger number of cores, you can
+request that using the `cpu` argument. This allows you to specify a
+floating-point number of CPU cores:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function(cpu=8.0)
+def my_function():
+    # code here will have access to at least 8.0 cores
+    ...
+```
+
+Note that this value corresponds to physical cores, not vCPUs.
+
+Modal also will set several environment variables that control multi-threading
+behavior in linear algebra libraries (e.g., `OPENBLAS_NUM_THREADS`,
+`OMP_NUM_THREADS`, `MKL_NUM_THREADS`) based on your CPU reservation.
+
+## Memory
+
+If you have code that needs more guaranteed memory, you can request it using the
+`memory` argument. This expects an integer number of megabytes:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function(memory=32768)
+def my_function():
+    # code here will have access to at least 32 GiB of RAM
+    ...
+```
+
+## How much can I request?
+
+For both CPU and memory, a maximum is enforced at Function creation time to
+ensure your containers can be scheduled for execution. Requests exceeding the
+maximum will be rejected with an
+[`InvalidError`](https://modal.com/docs/reference/modal.exception#modalexceptioninvaliderror).
+
+## Billing
+
+For CPU and memory, you'll be charged based on whichever is higher: your reservation or actual usage.
+
+Disk requests are billed by increasing the memory request at a 20:1 ratio. For example, requesting 500 GiB of disk will increase the memory request to 25 GiB, if it is not already set higher.
+
+## Resource limits
+
+### CPU limits
+
+Modal containers have a default soft CPU limit that is set at 16 physical cores above the CPU request.
+Given that the default CPU request is 0.125 cores, the default soft CPU limit is 16.125 cores.
+Above this limit, the host will begin to throttle the CPU usage of the container.
+
+You can alternatively set the CPU limit explicitly:
+
+```python
+cpu_request = 1.0
+cpu_limit = 4.0
+@app.function(cpu=(cpu_request, cpu_limit))
+def f():
+    ...
+```
+
+### Memory limits
+
+Modal containers can have a hard memory limit which will 'Out of Memory' (OOM) kill
+containers which attempt to exceed the limit. This functionality is useful when a process
+has a serious memory leak. You can set the limit and have the container killed to avoid paying
+for the leaked GBs of memory.
+
+Specify this limit using the [`memory` parameter](https://modal.com/docs/reference/modal.App#function):
+
+```python
+mem_request = 1024
+mem_limit = 2048
+@app.function(
+    memory=(mem_request, mem_limit),
+)
+def f():
+    ...
+```
+
+### Disk limits
+
+Running Modal containers have access to many GBs of SSD disk, but the amount
+of writes is limited by:
+
+1. The size of the underlying worker's SSD disk capacity
+2. A per-container disk quota that is set in the 100s of GBs.
+
+Hitting either limit will cause the container's disk writes to be rejected, which
+typically manifests as an `OSError`.
+
+Increased disk sizes can be requested with the [`ephemeral_disk` parameter](https://modal.com/docs/reference/modal.App#function). The maximum
+disk size is 3.0 TiB (3,145,728 MiB). Larger disks are intended to be used for [dataset processing](https://modal.com/docs/guide/dataset-ingestion).
+
+### Scaling out
+
+#### Scaling out
+
+# Scaling out
+
+Modal makes it trivially easy to scale compute across thousands of containers.
+You won't have to worry about your App crashing if it goes viral or need to wait
+a long time for your batch jobs to complete.
+
+For the the most part, scaling out will happen automatically, and you won't need
+to think about it. But it can be helpful to understand how Modal's autoscaler
+works and how you can control its behavior when you need finer control.
+
+## How does autoscaling work on Modal?
+
+Every Modal Function corresponds to an autoscaling pool of containers. The size
+of the pool is managed by Modal's autoscaler. The autoscaler will spin up new
+containers when there is no capacity available for new inputs, and it will spin
+down containers when resources are idling. By default, Modal Functions will
+scale to zero when there are no inputs to process.
+
+Autoscaling decisions are made quickly and frequently so that your batch jobs
+can ramp up fast and your deployed Apps can respond to any sudden changes in
+traffic.
+
+## Configuring autoscaling behavior
+
+Modal exposes a few settings that allow you to configure the autoscaler's
+behavior. These settings can be passed to the `@app.function` or `@app.cls`
+decorators:
+
+- `max_containers`: The upper limit on containers for the specific Function.
+- `min_containers`: The minimum number of containers that should be kept warm,
+  even when the Function is inactive.
+- `buffer_containers`: The size of the buffer to maintain while the Function is
+  active, so that additional inputs will not need to queue for a new container.
+- `scaledown_window`: The maximum duration (in seconds) that individual
+  containers can remain idle when scaling down.
+
+In general, these settings allow you to trade off cost and latency. Maintaining
+a larger warm pool or idle buffer will increase costs but reduce the chance that
+inputs will need to wait for a new container to start.
+
+Similarly, a longer scaledown window will let containers idle for longer, which
+might help avoid unnecessary churn for Apps that receive regular but infrequent
+inputs. Note that containers may not wait for the entire scaledown window before
+shutting down if the App is substantially overprovisioned.
+
+## Dynamic autoscaler updates
+
+It's also possible to update the autoscaler settings dynamically (i.e., without redeploying
+the App) using the [`Function.update_autoscaler()`](https://modal.com/docs/reference/modal.Function#update_autoscaler)
+method:
+
+```python notest
+f = modal.Function.from_name("my-app", "f")
+f.update_autoscaler(max_containers=100)
+```
+
+The autoscaler settings will revert to the configuration in the function
+decorator the next time you deploy the App. Or they can be overridden by
+further dynamic updates:
+
+```python notest
+f.update_autoscaler(min_containers=2, max_containers=10)
+f.update_autoscaler(min_containers=4)  # max_containers=10 will still be in effect
+```
+
+A common pattern is to run this method in a [scheduled function](https://modal.com/docs/guide/cron)
+that adjusts the size of the warm pool (or container buffer) based on the time of day:
+
+```python
+@app.function()
+def inference_server():
+    ...
+
+@app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York"))
+def increase_warm_pool():
+    inference_server.update_autoscaler(min_containers=4)
+
+@app.function(schedule=modal.Cron("0 22 * * *", timezone="America/New_York"))
+def decrease_warm_pool():
+    inference_server.update_autoscaler(min_containers=0)
+```
+
+When you have a [`modal.Cls`](https://modal.com/docs/reference/modal.Cls), `update_autoscaler`
+is a method on an _instance_ and will control the autoscaling behavior of
+containers serving the Function with that specific set of parameters:
+
+```python notest
+MyClass = modal.Cls.from_name("my-app", "MyClass")
+obj = MyClass(model_version="3.5")
+obj.update_autoscaler(buffer_containers=2)  # type: ignore
+```
+
+Note that it's necessary to disable type checking on this line, because the
+object will appear as an instance of the class that you defined rather than the
+Modal wrapper type.
+
+## Parallel execution of inputs
+
+If your code is running the same function repeatedly with different independent
+inputs (e.g., a grid search), the easiest way to increase performance is to run
+those function calls in parallel using Modal's
+[`Function.map()`](https://modal.com/docs/reference/modal.Function#map) method.
+
+Here is an example if we had a function `evaluate_model` that takes a single
+argument:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function()
+def evaluate_model(x):
+    ...
+
+@app.local_entrypoint()
+def main():
+    inputs = list(range(100))
+    for result in evaluate_model.map(inputs):  # runs many inputs in parallel
+        ...
+```
+
+In this example, `evaluate_model` will be called with each of the 100 inputs
+(the numbers 0 - 99 in this case) roughly in parallel and the results are
+returned as an iterable with the results ordered in the same way as the inputs.
+
+### Exceptions
+
+By default, if any of the function calls raises an exception, the exception will
+be propagated. To treat exceptions as successful results and aggregate them in
+the results list, pass in
+[`return_exceptions=True`](https://modal.com/docs/reference/modal.Function#map).
+
+```python
+@app.function()
+def my_func(a):
+    if a == 2:
+        raise Exception("ohno")
+    return a ** 2
+
+@app.local_entrypoint()
+def main():
+    print(list(my_func.map(range(3), return_exceptions=True, wrap_returned_exceptions=False)))
+    # [0, 1, Exception('ohno'))]
+```
+
+Note: prior to version 1.0.5, the returned exceptions inadvertently leaked an internal
+wrapper type (`modal.exceptions.UserCodeException`). To avoid breaking any user code that
+was checking exception types, we're taking a gradual approach to fixing this bug. Adding
+`wrap_returned_exceptions=False` will opt-in to the future default behavior and return the
+underlying exception type without a wrapper.
+
+### Starmap
+
+If your function takes multiple variable arguments, you can either use
+[`Function.map()`](https://modal.com/docs/reference/modal.Function#map) with one input iterator
+per argument, or [`Function.starmap()`](https://modal.com/docs/reference/modal.Function#starmap)
+with a single input iterator containing sequences (like tuples) that can be
+spread over the arguments. This works similarly to Python's built in `map` and
+`itertools.starmap`.
+
+```python
+@app.function()
+def my_func(a, b):
+    return a + b
+
+@app.local_entrypoint()
+def main():
+    assert list(my_func.starmap([(1, 2), (3, 4)])) == [3, 7]
+```
+
+### Gotchas
+
+Note that `.map()` is a method on the modal function object itself, so you don't
+explicitly _call_ the function.
+
+Incorrect usage:
+
+```python notest
+results = evaluate_model(inputs).map()
+```
+
+Modal's map is also not the same as using Python's builtin `map()`. While the
+following will technically work, it will execute all inputs in sequence rather
+than in parallel.
+
+Incorrect usage:
+
+```python notest
+results = map(evaluate_model, inputs)
+```
+
+## Asynchronous usage
+
+All Modal APIs are available in both blocking and asynchronous variants. If you
+are comfortable with asynchronous programming, you can use it to create
+arbitrary parallel execution patterns, with the added benefit that any Modal
+functions will be executed remotely. See the [async guide](https://modal.com/docs/guide/async) or
+the examples for more information about asynchronous usage.
+
+## GPU acceleration
+
+Sometimes you can speed up your applications by utilizing GPU acceleration. See
+the [gpu section](https://modal.com/docs/guide/gpu) for more information.
+
+## Scaling Limits
+
+Modal enforces the following limits for every function:
+
+- 2,000 pending inputs (inputs that haven't been assigned to a container yet)
+- 25,000 total inputs (which include both running and pending inputs)
+
+For inputs created with `.spawn()` for async jobs, Modal allows up to 1 million pending inputs instead of 2,000.
+
+If you try to create more inputs and exceed these limits, you'll receive a `Resource Exhausted` error, and you should retry your request later. If you need higher limits, please reach out!
+
+Additionally, each `.map()` invocation can process at most 1000 inputs concurrently.
+
+#### Input concurrency
+
+# Input concurrency
+
+As traffic to your application increases, Modal will automatically scale up the
+number of containers running your Function:
+
+<div class="flex justify-center"></div>
+
+By default, each container will be assigned one input at a time. Autoscaling
+across containers allows your Function to process inputs in parallel. This is
+ideal when the operations performed by your Function are CPU-bound.
+
+For some workloads, though, it is inefficient for containers to process inputs
+one-by-one. Modal supports these workloads with its _input concurrency_ feature,
+which allows individual containers to process multiple inputs at the same time:
+
+<div class="flex justify-center"></div>
+
+When used effectively, input concurrency can reduce latency and lower costs.
+
+## Use cases
+
+Input concurrency can be especially effective for workloads that are primarily
+I/O-bound, e.g.:
+
+- Querying a database
+- Making external API requests
+- Making remote calls to other Modal Functions
+
+For such workloads, individual containers may be able to concurrently process
+large numbers of inputs with minimal additional latency. This means that your
+Modal application will be more efficient overall, as it won't need to scale
+containers up and down as traffic ebbs and flows.
+
+Another use case is to leverage _continuous batching_ on GPU-accelerated
+containers. Frameworks such as [vLLM](https://modal.com/docs/examples/llm_inference) can
+achieve the benefits of batching across multiple inputs even when those
+inputs do not arrive simultaneously (because new batches are formed for each
+forward pass of the model).
+
+Note that for CPU-bound workloads, input concurrency will likely not be as
+effective (or will even be counterproductive), and you may want to use
+Modal's [_dynamic batching_ feature](https://modal.com/docs/guide/dynamic-batching) instead.
+
+## Enabling input concurrency
+
+To enable input concurrency, add the `@modal.concurrent` decorator:
+
+```python
+@app.function()
+@modal.concurrent(max_inputs=100)
+def my_function(input: str):
+    ...
+
+```
+
+When using the class pattern, the decorator should be applied at the level of
+the _class_, not on individual methods:
+
+```python
+@app.cls()
+@modal.concurrent(max_inputs=100)
+class MyCls:
+
+    @modal.method()
+    def my_method(self, input: str):
+        ...
+```
+
+Because all methods on a class will be served by the same containers, a class
+with input concurrency enabled will concurrently run distinct methods in
+addition to multiple inputs for the same method.
+
+**Note:** The `@modal.concurrent` decorator was added in v0.73.148 of the Modal
+Python SDK. Input concurrency could previously be enabled by setting the
+`allow_concurrent_inputs` parameter on the `@app.function` decorator.
+
+## Setting a concurrency target
+
+When using the `@modal.concurrent` decorator, you must always configure the
+maximum number of inputs that each container will concurrently process. If
+demand exceeds this limit, Modal will automatically scale up more containers.
+
+Additional inputs may need to queue up while these additional containers cold
+start. To help avoid degraded latency during scaleup, the `@modal.concurrent`
+decorator has a separate `target_inputs` parameter. When set, Modal's autoscaler
+will aim for this target as it provisions resources. If demand increases faster
+than new containers can spin up, the active containers will be allowed to burst
+above the target up to the `max_inputs` limit:
+
+```python
+@app.function()
+@modal.concurrent(max_inputs=120, target_inputs=100)  # Allow a 20% burst
+def my_function(input: str):
+    ...
+```
+
+It may take some experimentation to find the right settings for these parameters
+in your particular application. Our suggestion is to set the `target_inputs`
+based on your desired latency and the `max_inputs` based on resource constraints
+(i.e., to avoid GPU OOM). You may also consider the relative latency cost of
+scaling up a new container versus overloading the existing containers.
+
+## Concurrency mechanisms
+
+Modal uses different concurrency mechanisms to execute your Function depending
+on whether it is defined as synchronous or asynchronous. Each mechanism imposes
+certain requirements on the Function implementation. Input concurrency is an
+advanced feature, and it's important to make sure that your implementation
+complies with these requirements to avoid unexpected behavior.
+
+For synchronous Functions, Modal will execute concurrent inputs on separate
+threads. _This means that the Function implementation must be thread-safe._
+
+```python
+# Each container can execute up to 10 inputs in separate threads
+@app.function()
+@modal.concurrent(max_inputs=10)
+def sleep_sync():
+    # Function must be thread-safe
+    time.sleep(1)
+```
+
+For asynchronous Functions, Modal will execute concurrent inputs using
+separate `asyncio` tasks on a single thread. This does not require thread
+safety, but it does mean that the Function needs to participate in
+collaborative multitasking (i.e., it should not block the event loop).
+
+```python
+# Each container can execute up to 10 inputs with separate async tasks
+@app.function()
+@modal.concurrent(max_inputs=10)
+async def sleep_async():
+    # Function must not block the event loop
+    await asyncio.sleep(1)
+```
+
+## Gotchas
+
+Input concurrency is a powerful feature, but there are a few caveats that can
+be useful to be aware of before adopting it.
+
+### Input cancellations
+
+Synchronous and asynchronous Functions handle input cancellations differently.
+Modal will raise a `modal.exception.InputCancellation` exception in synchronous
+Functions and an `asyncio.CancelledError` in asynchronous Functions.
+
+When using input concurrency with a synchronous Function, a single input
+cancellation will terminate the entire container. If your workflow depends on
+graceful input cancellations, we recommend using an asynchronous
+implementation.
+
+### Concurrent logging
+
+The separate threads or tasks that are executing the concurrent inputs will
+write any logs to the same stream. This makes it difficult to associate logs
+with a specific input, and filtering for a specific function call in Modal's web
+dashboard will show logs for all inputs running at the same time.
+
+To work around this, we recommend including a unique identifier in the messages
+you log (either your own identifier or the `modal.current_input_id()`) so that
+you can use the search functionality to surface logs for a specific input:
+
+```python
+@app.function()
+@modal.concurrent(max_inputs=10)
+async def better_concurrent_logging(x: int):
+    logger.info(f"{modal.current_input_id()}: Starting work with {x}")
+```
+
+#### Batch processing
+
+# Batch Processing
+
+Modal is optimized for large-scale batch processing, allowing functions to scale to thousands of parallel containers with zero additional configuration. Function calls can be submitted asynchronously for background execution, eliminating the need to wait for jobs to finish or tune resource allocation.
+
+This guide covers Modal's batch processing capabilities, from basic invocation to integration with existing pipelines.
+
+## Background Execution with `.spawn_map`
+
+The fastest way to submit multiple jobs for asynchronous processing is by invoking a function with `.spawn_map`. When combined with the [`--detach`](https://modal.com/docs/reference/cli/run) flag, your App continues running until all jobs are completed.
+
+Here's an example of submitting 100,000 videos for parallel embedding. You can disconnect after submission, and the processing will continue to completion in the background:
+
+```python
+# Kick off asynchronous jobs with `modal run --detach batch_processing.py`
+import modal
+
+app = modal.App("batch-processing-example")
+volume = modal.Volume.from_name("video-embeddings", create_if_missing=True)
+
+@app.function(volumes={"/data": volume})
+def embed_video(video_id: int):
+    # Business logic:
+    # - Load the video from the volume
+    # - Embed the video
+    # - Save the embedding to the volume
+    ...
+
+@app.local_entrypoint()
+def main():
+    embed_video.spawn_map(range(100_000))
+```
+
+This pattern works best for jobs that store results externally—for example, in a [Modal Volume](https://modal.com/docs/guide/volumes), [Cloud Bucket Mount](https://modal.com/docs/guide/cloud-bucket-mounts), or your own database\*.
+
+_\* For database connections, consider using [Modal Proxy](https://modal.com/docs/guide/proxy-ips) to maintain a static IP across thousands of containers._
+
+## Parallel Processing with `.map`
+
+Using `.map` allows you to offload expensive computations to powerful machines while gathering results. This is particularly useful for pipeline steps with bursty resource demands. Modal handles all infrastructure provisioning and de-provisioning automatically.
+
+Here's how to implement parallel video similarity queries as a single Modal function call:
+
+```python
+# Run jobs and collect results with `modal run gather.py`
+import modal
+
+app = modal.App("gather-results-example")
+
+@app.function(gpu="L40S")
+def compute_video_similarity(query: str, video_id: int) -> tuple[int, int]:
+    # Embed video with GPU acceleration & compute similarity with query
+    return video_id, score
+
+@app.local_entrypoint()
+def main():
+    import itertools
+
+    queries = itertools.repeat("Modal for batch processing")
+    video_ids = range(100_000)
+
+    for video_id, score in compute_video_similarity.map(queries, video_ids):
+        # Process results (e.g., extract top 5 most similar videos)
+        pass
+```
+
+This example runs `compute_video_similarity` on an autoscaling pool of L40S GPUs, returning scores to a local process for further processing.
+
+## Integration with Existing Systems
+
+The recommended way to use Modal Functions within your existing data pipeline is through [deployed function invocation](https://modal.com/docs/guide/trigger-deployed-functions). After deployment, you can call Modal functions from external systems:
+
+```python
+def external_function(inputs):
+    compute_similarity = modal.Function.from_name(
+        "gather-results-example",
+        "compute_video_similarity"
+    )
+    for result in compute_similarity.map(inputs):
+        # Process results
+        pass
+```
+
+You can invoke Modal Functions from any Python context, gaining access to built-in observability, resource management, and GPU acceleration.
+
+#### Job queues
+
+# Job processing
+
+Modal can be used as a scalable job queue to handle asynchronous tasks submitted
+from a web app or any other Python application. This allows you to offload up to 1 million
+long-running or resource-intensive tasks to Modal, while your main application
+remains responsive.
+
+## Creating jobs with .spawn()
+
+The basic pattern for using Modal as a job queue involves three key steps:
+
+1. Defining and deploying the job processing function using `modal deploy`.
+2. Submitting a job using
+   [`modal.Function.spawn()`](https://modal.com/docs/reference/modal.Function#spawn)
+3. Polling for the job's result using
+   [`modal.FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get)
+
+Here's a simple example that you can run with `modal run my_job_queue.py`:
+
+```python
+# my_job_queue.py
+import modal
+
+app = modal.App("my-job-queue")
+
+@app.function()
+def process_job(data):
+    # Perform the job processing here
+    return {"result": data}
+
+def submit_job(data):
+    # Since the `process_job` function is deployed, need to first look it up
+    process_job = modal.Function.from_name("my-job-queue", "process_job")
+    call = process_job.spawn(data)
+    return call.object_id
+
+def get_job_result(call_id):
+    function_call = modal.FunctionCall.from_id(call_id)
+    try:
+        result = function_call.get(timeout=5)
+    except modal.exception.OutputExpiredError:
+        result = {"result": "expired"}
+    except TimeoutError:
+        result = {"result": "pending"}
+    return result
+
+@app.local_entrypoint()
+def main():
+    data = "my-data"
+
+    # Submit the job to Modal
+    call_id = submit_job(data)
+    print(get_job_result(call_id))
+```
+
+In this example:
+
+- `process_job` is the Modal function that performs the actual job processing.
+  To deploy the `process_job` function on Modal, run
+  `modal deploy my_job_queue.py`.
+- `submit_job` submits a new job by first looking up the deployed `process_job`
+  function, then calling `.spawn()` with the job data. It returns the unique ID
+  of the spawned function call.
+- `get_job_result` attempts to retrieve the result of a previously submitted job
+  using [`FunctionCall.from_id()`](https://modal.com/docs/reference/modal.FunctionCall#from_id) and
+  [`FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get).
+  [`FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get) waits indefinitely
+  by default. It takes an optional timeout argument that specifies the maximum
+  number of seconds to wait, which can be set to 0 to poll for an output
+  immediately. Here, if the job hasn't completed yet, we return a pending
+  response.
+- The results of a `.spawn()` are accessible via `FunctionCall.get()` for up to
+  7 days after completion. After this period, we return an expired response.
+
+[Document OCR Web App](https://modal.com/docs/examples/doc_ocr_webapp) is an example that uses
+this pattern.
+
+## Integration with web frameworks
+
+You can easily integrate the job queue pattern with web frameworks like FastAPI.
+Here's an example, assuming that you have already deployed `process_job` on
+Modal with `modal deploy` as above. This example won't work if you haven't
+deployed your app yet.
+
+```python
+# my_job_queue_endpoint.py
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+app = modal.App("fastapi-modal", image=image)
+
+@app.function()
+@modal.asgi_app()
+@modal.concurrent(max_inputs=20)
+def fastapi_app():
+    from fastapi import FastAPI
+
+    web_app = FastAPI()
+
+    @web_app.post("/submit")
+    async def submit_job_endpoint(data):
+        process_job = modal.Function.from_name("my-job-queue", "process_job")
+
+        call = await process_job.spawn.aio(data)
+        return {"call_id": call.object_id}
+
+    @web_app.get("/result/{call_id}")
+    async def get_job_result_endpoint(call_id: str):
+        function_call = modal.FunctionCall.from_id(call_id)
+        try:
+            result = await function_call.get.aio(timeout=0)
+        except modal.exception.OutputExpiredError:
+            return fastapi.responses.JSONResponse(content="", status_code=404)
+        except TimeoutError:
+            return fastapi.responses.JSONResponse(content="", status_code=202)
+
+        return result
+
+    return web_app
+```
+
+In this example:
+
+- The `/submit` endpoint accepts job data, submits a new job using
+  `await process_job.spawn.aio()`, and returns the job's ID to the client.
+- The `/result/{call_id}` endpoint allows the client to poll for the job's
+  result using the job ID. If the job hasn't completed yet, it returns a 202
+  status code to indicate that the job is still being processed. If the job
+  has expired, it returns a 404 status code to indicate that the job is not found.
+
+You can try this app by serving it with `modal serve`:
+
+```shell
+modal serve my_job_queue_endpoint.py
+```
+
+Then interact with its endpoints with `curl`:
+
+```shell
+# Make a POST request to your app endpoint with.
+$ curl -X POST $YOUR_APP_ENDPOINT/submit?data=data
+{"call_id":"fc-XXX"}
+
+# Use the call_id value from above.
+$ curl -X GET $YOUR_APP_ENDPOINT/result/fc-XXX
+```
+
+## Scaling and reliability
+
+Modal automatically scales the job queue based on the workload, spinning up new
+instances as needed to process jobs concurrently. It also provides built-in
+reliability features like automatic retries and timeout handling.
+
+You can customize the behavior of the job queue by configuring the
+`@app.function()` decorator with options like
+[`retries`](https://modal.com/docs/guide/retries#function-retries),
+[`timeout`](https://modal.com/docs/guide/timeouts#timeouts), and
+[`max_containers`](https://modal.com/docs/guide/scale#configuring-autoscaling-behavior).
+
+#### Dynamic batching (beta)
+
+# Dynamic batching (beta)
+
+Modal's `@batched` feature allows you to accumulate requests
+and process them in dynamically-sized batches, rather than one-by-one.
+
+Batching increases throughput at a potential cost to latency.
+Batched requests can share resources and reuse work, reducing the time and cost per request.
+Batching is particularly useful for GPU-accelerated machine learning workloads,
+as GPUs are designed to maximize throughput and are frequently bottlenecked on shareable resources,
+like weights stored in memory.
+
+Static batching can lead to unbounded latency, as the function waits for a fixed number of requests to arrive.
+Modal's dynamic batching waits for the lesser of a fixed time _or_ a fixed number of requests before executing,
+maximizing the throughput benefit of batching while minimizing the latency penalty.
+
+## Enable dynamic batching with `@batched`
+
+To enable dynamic batching, apply the
+[`@modal.batched` decorator](https://modal.com/docs/reference/modal.batched) to the target
+Python function. Then, wrap it in `@app.function()` and run it on Modal,
+and the inputs will be accumulated and processed in batches.
+
+Here's what that looks like:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function()
+@modal.batched(max_batch_size=2, wait_ms=1000)
+async def batch_add(xs: list[int], ys: list[int]) -> list[int]:
+    return [x + y for x, y in zip(xs, ys)]
+```
+
+When you invoke a function decorated with `@batched`, you invoke it asynchronously on individual inputs.
+Outputs are returned where they were invoked.
+
+For instance, the code below invokes the decorated `batch_add` function above three times, but `batch_add`
+only executes twice:
+
+```python continuation
+@app.local_entrypoint()
+async def main():
+    inputs = [(1, 300), (2, 200), (3, 100)]
+    async for result in batch_add.starmap.aio(inputs):
+        print(f"Sum: {result}")
+        # Sum: 301
+        # Sum: 202
+        # Sum: 103
+```
+
+The first time it is executed with `xs` batched to `[1, 2]`
+and `ys` batched to `[300, 200]`. After about a one second delay, it is executed with `xs`
+batched to `[3]` and `ys` batched to `[100]`.
+The result is an iterator that yields `301`, `202`, and `101`.
+
+## Use `@batched` with functions that take and return lists
+
+For a Python function to be compatible with `@modal.batched`, it must adhere to
+the following rules:
+
+- ** The inputs to the function must be lists. **
+  In the example above, we pass `xs` and `ys`, which are both lists of `int`s.
+- ** The function must return a list**. In the example above, the function returns
+  a list of sums.
+- ** The lengths of all the input lists and the output list must be the same. **
+  In the example above, if `L == len(xs) == len(ys)`, then `L == len(batch_add(xs, ys))`.
+
+## Modal `Cls` methods are compatible with dynamic batching
+
+Methods on Modal [`Cls`](https://modal.com/docs/guide/lifecycle-functions)es also support dynamic batching.
+
+```python
+import modal
+
+app = modal.App()
+
+@app.cls()
+class BatchedClass():
+    @modal.batched(max_batch_size=2, wait_ms=1000)
+    async def batch_add(self, xs: list[int], ys: list[int]) -> list[int]:
+        return [x + y for x, y in zip(xs, ys)]
+```
+
+One additional rule applies to classes with Batched Methods:
+
+- If a class has a Batched Method, it **cannot have other Batched Methods or [Methods](https://modal.com/docs/reference/modal.method#modalmethod)**.
+
+## Configure the wait time and batch size of dynamic batches
+
+The `@batched` decorator takes in two required configuration parameters:
+
+- `max_batch_size` limits the number of inputs combined into a single batch.
+- `wait_ms` limits the amount of time the Function waits for more inputs after
+  the first input is received.
+
+The first invocation of the Batched Function initiates a new batch, and subsequent
+calls add requests to this ongoing batch. If `max_batch_size` is reached,
+the batch immediately executes. If the `max_batch_size` is not met but `wait_ms`
+has passed since the first request was added to the batch, the unfilled batch is
+executed.
+
+### Selecting a batch configuration
+
+To optimize the batching configurations for your application, consider the following heuristics:
+
+- Set `max_batch_size` to the largest value your function can handle, so you
+  can amortize and parallelize as much work as possible.
+
+- Set `wait_ms` to the difference between your targeted latency and the execution time. Most applications
+  have a targeted latency, and this allows the latency of any request to stay
+  within that limit.
+
+## Serve web endpoints with dynamic batching
+
+Here's a simple example of serving a Function that batches requests dynamically
+with a [`@modal.fastapi_endpoint`](https://modal.com/docs/guide/webhooks). Run
+[`modal serve`](https://modal.com/docs/reference/cli/serve), submit requests to the endpoint,
+and the Function will batch your requests on the fly.
+
+```python
+import modal
+
+app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))
+
+@app.function()
+@modal.batched(max_batch_size=2, wait_ms=1000)
+async def batch_add(xs: list[int], ys: list[int]) -> list[int]:
+    return [x + y for x, y in zip(xs, ys)]
+
+@app.function()
+@modal.fastapi_endpoint(method="POST", docs=True)
+async def add(body: dict[str, int]) -> dict[str, int]:
+    result = await batch_add.remote.aio(body["x"], body["y"])
+    return {"result": result}
+```
+
+Now, you can submit requests to the web endpoint and process them in batches. For instance, the three requests
+in the following example, which might be requests from concurrent clients in a real deployment,
+will be batched into two executions:
+
+```python notest
+import asyncio
+import aiohttp
+
+async def send_post_request(session, url, data):
+    async with session.post(url, json=data) as response:
+        return await response.json()
+
+async def main():
+    # Enter the URL of your web endpoint here
+    url = "https://workspace--app-name-endpoint-name.modal.run"
+
+    async with aiohttp.ClientSession() as session:
+        # Submit three requests asynchronously
+        tasks = [
+            send_post_request(session, url, {"x": 1, "y": 300}),
+            send_post_request(session, url, {"x": 2, "y": 200}),
+            send_post_request(session, url, {"x": 3, "y": 100}),
+        ]
+        results = await asyncio.gather(*tasks)
+        for result in results:
+            print(f"Sum: {result['result']}")
+
+asyncio.run(main())
+```
+
+#### Multi-node clusters (beta)
+
+# Multi-node clusters (beta)
+
+> 🚄 Multi-node clusters with RDMA are in **private beta.** Please contact us via the [Modal Slack](https://modal.com/slack) or support@modal.com to get access.
+
+Modal supports running a training job across several coordinated containers. Each container can saturate the available GPU devices on its host (aka node) and communicate with peer containers which do the same. By scaling a training job from a single GPU to 16 GPUs you can achieve nearly 16x improvements in training time.
+
+### Cluster compute capability
+
+Modal H100 clusters provide:
+
+- A 50 Gbps [IPv6 private network](https://modal.com/docs/guide/private-networking) for orchestration, dataset downloading, etc.
+- A 3,200 Gbps RDMA scale-out network ([RoCE](https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet)).
+- Up to 64 H100 SXM devices.
+- At least 1 TB of RAM and 4 TB of local NVMe SSD per node.
+- Deep burn-in testing.
+- Interoperability with all Modal platform functionality ([Volumes](https://modal.com/docs/guide/volumes), [Dicts](https://modal.com/docs/guide/dicts), [Tunnels](https://modal.com/docs/guide/tunnels), etc.).
+
+The guide will walk you through how the Modal client library enables multi-node training and integrates with `torchrun`.
+
+### `@clustered`
+
+Unlike standard Modal Function containers, containers in a multi-node training job must be able to:
+
+1. Perform fast, direct network communication between each other.
+2. Be scheduled together, all or nothing, at the same time.
+
+The `@clustered` decorator enables this behavior.
+
+```python
+import modal.experimental
+
+@app.function(
+    gpu="H100:8",
+    timeout=60 * 60 * 24,
+    retries=modal.Retries(initial_delay=0.0, max_retries=10),
+)
+@modal.experimental.clustered(size=4)
+def train_model():
+    cluster_info = modal.experimental.get_cluster_info()
+
+    container_rank = cluster_info.rank
+    world_size = len(cluster_info.container_ips)
+    main_addr = cluster_info.container_ips[0]
+    is_main = "(main)" if container_rank == 0 else ""
+
+    print(f"{container_rank=} {is_main} {world_size=} {main_addr=}")
+    ...
+```
+
+Applying this decorator under `@app.function` modifies the Function so that remote calls to it are serviced by a multi-node container group. The above configuration creates a group of four containers each having 8 H100 GPU devices, for a total of 32 devices.
+
+## Scheduling
+
+A `modal.experimental.clustered` Function runs on multiple nodes in our cloud, but executes like a normal function call. For example, all nodes are scheduled together ([gang scheduling](https://en.wikipedia.org/wiki/Gang_scheduling)) so that your code runs on all of the requested hardware or not at all.
+
+Traditionally this kind of cluster and scheduling management would be handled by SLURM, Kubernetes, or manually. But with Modal it's all provided serverlessly with just a Python decorator!
+
+### Rank & input broadcast
+
+![diagram](https://modal-cdn.com/cdnbot/multinodepmgnla70_4b57a155.webp)
+
+You may notice above that a single `.remote` Function call created three input executions but returned only one output. This is how input-output is structured for multi-node training jobs on Modal. The Function call’s arguments are replicated to each container, but only the rank zero container’s is returned to the caller.
+
+A container’s rank is a key concept in multi-node jobs. Rank zero is the 'leader' rank and typically coordinates the job. Rank zero is also known as the "main" container. Rank zero's output will always be the output of a multi-node training run.
+
+## Networking
+
+Function containers cannot normally make direct network connections to other Function containers, but this is a requirement for multi-node training communication. So, along with gang scheduling, the `@clustered` decorator enables Modal’s workspace-private inter-container networking called [i6pn](https://www.notion.so/Multi-node-docs-1281e7f16949806f966adedfe8b2cb74?pvs=21).
+
+The [cluster networking guide](https://modal.com/docs/guide/private-networking) goes into more detail on i6pn, but the upshot is that each container in the cluster is made aware of the network address of all the other containers in the cluster, enabling them to communicate with each other quickly via [TCP](https://pytorch.org/docs/stable/elastic/rendezvous.html).
+
+### RDMA (Infiniband)
+
+H100 clusters are equipped with Infiniband providing up to 3,200 Gbps scale-out bandwidth for inter-node communication.
+RDMA scale-out networking is enabled with the `rdma` parameter of `modal.experimental.clustered`.
+
+```python notest
+@modal.experimental.clustered(size=2, rdma=True)
+def train():
+    ...
+```
+
+To run a simple Infiniband RDMA performance test see the [this sample code](https://github.com/modal-labs/multinode-training-guide/tree/main/benchmark).
+
+## Cluster Info
+
+`modal.experimental.get_cluster_info()` exposes the following information about the cluster:
+
+- `rank: int` is the current container's order within the cluster, starting from `0`, the leader.
+- `container_ips: list[str]` contains the IPv6 addresses of each container in the cluster, sorted by rank.
+- `container_ipv4_ips: list[str]` contains the IPv4 addresses of each container in the cluster, sorted by rank.
+
+## Fault Tolerance
+
+For a clustered Function, failures in inputs and containers are handled differently.
+
+If an input fails on any container, this failure **is not propagated** to other containers in the cluster. Containers are responsible for detecting and responding to input failures on other containers.
+
+Only rank 0's output matters: if an input fails on the leader container (rank 0), the input is marked as failed, even if the input succeeds on another container. Similarly, if an input succeeds on the leader container but fails on another container, the input will still be marked as successful.
+
+If a container in the cluster is preempted, Modal will terminate all remaining containers in the cluster, and retry the input.
+
+### Input Synchronization
+
+_**Important:**_ synchronization is not relevant for single training runs, and applies mostly to inference use-cases.
+
+Modal does not synchronize input execution across containers. Containers are responsible for ensuring that they do not process inputs faster than other containers in their cluster.
+
+In particular, it is important that the leader container (rank 0) only starts processing the next input after all other containers have finished processing the current input.
+
+## Examples
+
+To get hands-on with multi-node training you can jump into the [`multinode-training-guide` repository](https://github.com/modal-labs/multinode-training-guide) or [`modal-examples` repository](https://github.com/modal-labs/modal-examples/tree/main/14_clusters) and `modal run` something!
+
+- [Simple ‘hello world’ 4 X 1 H100 torch cluster example](https://github.com/modal-labs/modal-examples/blob/main/14_clusters/simple_torch_cluster.py)
+- [Infiniband RDMA performance test](https://github.com/modal-labs/multinode-training-guide/tree/main/benchmark)
+- [Use 2 x 8 H100s to train a ResNet50 model on the ImageNet dataset](https://github.com/modal-labs/multinode-training-guide/tree/main/resnet50)
+- [Speedrun GPT-2 training with modded-nanogpt](https://github.com/modal-labs/multinode-training-guide/tree/main/nanoGPT)
+<!-- - Use 2 x 8 H100s to run multi-node _inference_ on LLaMA 3.1 405B in 16bit precision. **[TODO]** -->
+
+### Torchrun Example
+
+```python
+import modal
+import modal.experimental
+
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .pip_install("torch~=2.5.1", "numpy~=2.2.1")
+    .add_local_dir(
+        "training", remote_path="/root/training"
+    )
+)
+app = modal.App("example-simple-torch-cluster", image=image)
+
+n_nodes = 4
+
+@app.function(gpu=f"H100:8", timeout=60 * 60 * 24)
+@modal.experimental.clustered(size=n_nodes, rdma=True)
+def launch_torchrun():
+    # import the 'torchrun' interface directly.
+    from torch.distributed.run import parse_args, run
+
+    cluster_info = modal.experimental.get_cluster_info()
+
+    run(
+        parse_args(
+            [
+                f"--nnodes={n_nodes}",
+                f"--node-rank={cluster_info.rank}",
+                f"--master-addr={cluster_info.container_ips[0]}",
+                "--nproc-per-node=8",
+                "--master-port=1234",
+                "training/train.py",
+            ]
+        )
+    )
+```
+
+### Deployment
+
+#### Apps, Functions, and entrypoints
+
+# Apps, Functions, and entrypoints
+
+An [`App`](https://modal.com/docs/reference/modal.App) represents an application running on Modal. It groups one or more Functions for atomic deployment and acts as a shared namespace. All Functions and Clses are associated with an
+App.
+
+A [`Function`](https://modal.com/docs/reference/modal.Function) acts as an independent unit once it is deployed, and [scales up and down](https://modal.com/docs/guide/scale) independently from other Functions. If there are no live inputs to the Function then by default, no containers will run and your account will not be charged for compute resources, even if the App it belongs to is deployed.
+
+An App can be ephemeral or deployed. You can view a list of all currently running Apps on the [`apps`](https://modal.com/apps) page.
+
+The code for a Modal App defining two separate Functions might look something like this:
+
+```python
+
+import modal
+
+app = modal.App(name="my-modal-app")
+
+@app.function()
+def f():
+    print("Hello world!")
+
+@app.function()
+def g():
+    print("Goodbye world!")
+
+```
+
+## Ephemeral Apps
+
+An ephemeral App is created when you use the
+[`modal run`](https://modal.com/docs/reference/cli/run) CLI command, or the
+[`app.run`](https://modal.com/docs/reference/modal.App#run) method. This creates a temporary
+App that only exists for the duration of your script.
+
+Ephemeral Apps are stopped automatically when the calling program exits, or when
+the server detects that the client is no longer connected.
+You can use
+[`--detach`](https://modal.com/docs/reference/cli/run) in order to keep an ephemeral App running even
+after the client exits.
+
+By using `app.run` you can run your Modal apps from within your Python scripts:
+
+```python
+def main():
+    ...
+    with app.run():
+        some_modal_function.remote()
+```
+
+By default, running your app in this way won't propagate Modal logs and progress bar messages. To enable output, use the [`modal.enable_output`](https://modal.com/docs/reference/modal.enable_output) context manager:
+
+```python
+def main():
+    ...
+    with modal.enable_output():
+        with app.run():
+            some_modal_function.remote()
+```
+
+## Deployed Apps
+
+A deployed App is created using the [`modal deploy`](https://modal.com/docs/reference/cli/deploy)
+CLI command. The App is persisted indefinitely until you stop it via the
+[web UI](https://modal.com/apps) or the [`modal app stop`](https://modal.com/docs/reference/cli/app#modal-app-stop) command. Functions in a deployed App that have an attached
+[schedule](https://modal.com/docs/guide/cron) will be run on a schedule. Otherwise, you can
+invoke them manually using
+[web endpoints or Python](https://modal.com/docs/guide/trigger-deployed-functions).
+
+Deployed Apps are named via the [`App`](https://modal.com/docs/reference/modal.App#modalapp)
+constructor. Re-deploying an existing `App` (based on the name) will update it
+in place.
+
+## Entrypoints for ephemeral Apps
+
+The code that runs first when you `modal run` an App is called the "entrypoint".
+
+You can register a local entrypoint using the
+[`@app.local_entrypoint()`](https://modal.com/docs/reference/modal.App#local_entrypoint)
+decorator. You can also use a regular Modal function as an entrypoint, in which
+case only the code in global scope is executed locally.
+
+### Argument parsing
+
+If your entrypoint function takes arguments with primitive types, `modal run`
+automatically parses them as CLI options. For example, the following function
+can be called with `modal run script.py --foo 1 --bar "hello"`:
+
+```python
+# script.py
+
+@app.local_entrypoint()
+def main(foo: int, bar: str):
+    some_modal_function.remote(foo, bar)
+```
+
+If you wish to use your own argument parsing library, such as `argparse`, you can instead accept a variable-length argument list for your entrypoint or your function. In this case, Modal skips CLI parsing and forwards CLI arguments as a tuple of strings. For example, the following function can be invoked with `modal run my_file.py --foo=42 --bar="baz"`:
+
+```python
+import argparse
+
+@app.function()
+def train(*arglist):
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--foo", type=int)
+    parser.add_argument("--bar", type=str)
+    args = parser.parse_args(args = arglist)
+```
+
+### Manually specifying an entrypoint
+
+If there is only one `local_entrypoint` registered,
+[`modal run script.py`](https://modal.com/docs/reference/cli/run) will automatically use it. If
+you have no entrypoint specified, and just one decorated Modal function, that
+will be used as a remote entrypoint instead. Otherwise, you can direct
+`modal run` to use a specific entrypoint.
+
+For example, if you have a function decorated with
+[`@app.function()`](https://modal.com/docs/reference/modal.App#function) in your file:
+
+```python
+# script.py
+
+@app.function()
+def f():
+    print("Hello world!")
+
+@app.function()
+def g():
+    print("Goodbye world!")
+
+@app.local_entrypoint()
+def main():
+    f.remote()
+```
+
+Running [`modal run script.py`](https://modal.com/docs/reference/cli/run) will execute the `main`
+function locally, which would call the `f` function remotely. However you can
+instead run `modal run script.py::app.f` or `modal run script.py::app.g` to
+execute `f` or `g` directly.
+
+## Apps were once Stubs
+
+The `modal.App` class in the client was previously called `modal.Stub`. The
+old name was kept as an alias for some time, but from Modal 1.0.0 onwards,
+using `modal.Stub` will result in an error.
+
+#### Managing deployments
+
+# Managing deployments
+
+Once you've finished using `modal run` or `modal serve` to iterate on your Modal
+code, it's time to deploy. A Modal deployment creates and then persists an
+application and its objects, providing the following benefits:
+
+- Repeated application function executions will be grouped under the deployment,
+  aiding observability and usage tracking. Programmatically triggering lots of
+  ephemeral App runs can clutter your web and CLI interfaces.
+- Function calls are much faster because deployed functions are persistent and
+  reused, not created on-demand by calls. Learn how to trigger deployed
+  functions in
+  [Invoking deployed functions](https://modal.com/docs/guide/trigger-deployed-functions).
+- [Scheduled functions](https://modal.com/docs/guide/cron) will continue scheduling separate from
+  any local iteration you do, and will notify you on failure.
+- [Web endpoints](https://modal.com/docs/guide/webhooks) keep running when you close your laptop,
+  and their URL address matches the deployment name.
+
+## Creating deployments
+
+Deployments are created using the
+[`modal deploy` command](https://modal.com/docs/reference/cli/app#modal-app-list).
+
+```
+ % modal deploy -m whisper_pod_transcriber.main
+✓ Initialized. View app page at https://modal.com/apps/ap-PYc2Tb7JrkskFUI8U5w0KG.
+✓ Created objects.
+├── 🔨 Created populate_podcast_metadata.
+├── 🔨 Mounted /home/ubuntu/whisper_pod_transcriber at /root/whisper_pod_transcriber
+├── 🔨 Created fastapi_app => https://modal-labs-whisper-pod-transcriber-fastapi-app.modal.run
+├── 🔨 Mounted /home/ubuntu/whisper_pod_transcriber/whisper_frontend/dist at /assets
+├── 🔨 Created search_podcast.
+├── 🔨 Created refresh_index.
+├── 🔨 Created transcribe_segment.
+├── 🔨 Created transcribe_episode..
+└── 🔨 Created fetch_episodes.
+✓ App deployed! 🎉
+
+View Deployment: https://modal.com/apps/modal-labs/whisper-pod-transcriber
+```
+
+Running this command on an existing deployment will redeploy the App,
+incrementing its version. For detail on how live deployed apps transition
+between versions, see the [Updating deployments](#updating-deployments) section.
+
+Deployments can also be created programmatically using Modal's
+[Python API](https://modal.com/docs/reference/modal.App#deploy).
+
+## Viewing deployments
+
+Deployments can be viewed either on the [apps](https://modal.com/apps) web page or by using the
+[`modal app list` command](https://modal.com/docs/reference/cli/app#modal-app-list).
+
+## Updating deployments
+
+A deployment can deploy a new App or redeploy a new version of an existing
+deployed App. It's useful to understand how Modal handles the transition between
+versions when an App is redeployed. In general, Modal aims to support
+zero-downtime deployments by gradually transitioning traffic to the new version.
+
+If the deployment involves building new versions of the Images used by the App,
+the build process will need to complete succcessfully. The existing version of
+the App will continue to handle requests during this time. Errors during the
+build will abort the deployment with no change to the status of the App.
+
+After the build completes, Modal will start to bring up new containers running
+the latest version of the App. The existing containers will continue handling
+requests (using the previous version of the App) until the new containers have
+completed their cold start.
+
+Once the new containers are ready, old containers will stop accepting new
+requests. However, the old containers will continue running any requests they
+had previously accepted. The old containers will not terminate until they have
+finished processing all ongoing requests.
+
+Any warm pool containers will also be cycled during a deployment, as the
+previous version's warm pool are now outdated.
+
+## Deployment rollbacks
+
+To quickly reset an App back to a previous version, you can perform a deployment
+_rollback_. Rollbacks can be triggered from either the App dashboard or the CLI.
+Rollback deployments look like new deployments: they increment the version number
+and are attributed to the user who triggered the rollback. But the App's functions
+and metadata will be reset to their previous state independently of your current
+App codebase.
+
+Note that deployment rollbacks are supported only on the Team and Enterprise plans.
+
+## Stopping deployments
+
+Deployed apps can be stopped in the web UI by clicking the red "Stop app" button on
+the App's "Overview" page, or alternatively from the command line using the
+[`modal app stop` command](https://modal.com/docs/reference/cli/app#modal-app-stop).
+
+Stopping an App is a destructive action. Apps cannot be restarted from this state;
+a new App will need to be deployed from the same source files. Objects associated
+with stopped deployments will eventually be garbage collected.
+
+#### Invoking deployed functions
+
+# Invoking deployed functions
+
+Modal lets you take a function created by a
+[deployment](https://modal.com/docs/guide/managing-deployments) and call it from other contexts.
+
+There are two ways of invoking deployed functions. If the invoking client is
+running Python, then the same
+[Modal client library](https://pypi.org/project/modal/) used to write Modal code
+can be used. HTTPS is used if the invoking client is not running Python and
+therefore cannot import the Modal client library.
+
+## Invoking with Python
+
+Some use cases for Python invocation include:
+
+- An existing Python web server (eg. Django, Flask) wants to invoke Modal
+  functions.
+- You have split your product or system into multiple Modal applications that
+  deploy independently and call each other.
+
+### Function lookup and invocation basics
+
+Let's say you have a script `my_shared_app.py` and this script defines a Modal
+app with a function that computes the square of a number:
+
+```python
+import modal
+
+app = modal.App("my-shared-app")
+
+@app.function()
+def square(x: int):
+    return x ** 2
+```
+
+You can deploy this app to create a persistent deployment:
+
+```
+% modal deploy shared_app.py
+✓ Initialized.
+✓ Created objects.
+├── 🔨 Created square.
+├── 🔨 Mounted /Users/erikbern/modal/shared_app.py.
+✓ App deployed! 🎉
+
+View Deployment: https://modal.com/apps/erikbern/my-shared-app
+```
+
+Let's try to run this function from a different context. For instance, let's
+fire up the Python interactive interpreter:
+
+```bash
+% python
+Python 3.9.5 (default, May  4 2021, 03:29:30)
+[Clang 12.0.0 (clang-1200.0.32.27)] on darwin
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import modal
+>>> f = modal.Function.from_name("my-shared-app", "square")
+>>> f.remote(42)
+1764
+>>>
+```
+
+This works exactly the same as a regular modal `Function` object. For example,
+you can `.map()` over functions invoked this way too:
+
+```bash
+>>> f = modal.Function.from_name("my-shared-app", "square")
+>>> f.map([1, 2, 3, 4, 5])
+[1, 4, 9, 16, 25]
+```
+
+#### Authentication
+
+The Modal Python SDK will read the token from `~/.modal.toml` which typically is
+created using `modal token new`.
+
+Another method of providing the credentials is to set the environment variables
+`MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET`. If you want to call a Modal function
+from a context such as a web server, you can expose these environment variables
+to the process.
+
+#### Lookup of lifecycle functions
+
+[Lifecycle functions](https://modal.com/docs/guide/lifecycle-functions) are defined on classes,
+which you can look up in a different way. Consider this code:
+
+```python
+import modal
+
+app = modal.App("my-shared-app")
+
+@app.cls()
+class MyLifecycleClass:
+    @modal.enter()
+    def enter(self):
+        self.var = "hello world"
+
+    @modal.method()
+    def foo(self):
+        return self.var
+```
+
+Let's say you deploy this app. You can then call the function by doing this:
+
+```bash
+>>> cls = modal.Cls.from_name("my-shared-app", "MyLifecycleClass")
+>>> obj = cls()  # You can pass any constructor arguments here
+>>> obj.foo.remote()
+'hello world'
+```
+
+### Asynchronous invocation
+
+In certain contexts, a Modal client will need to trigger Modal functions without
+waiting on the result. This is done by spawning functions and receiving a
+[`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) as a
+handle to the triggered execution.
+
+The following is an example of a Flask web server (running outside Modal) which
+accepts model training jobs to be executed within Modal. Instead of the HTTP
+POST request waiting on a training job to complete, which would be infeasible,
+the relevant Modal function is spawned and the
+[`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall)
+object is stored for later polling of execution status.
+
+```python
+from uuid import uuid4
+from flask import Flask, jsonify, request
+
+app = Flask(__name__)
+pending_jobs = {}
+
+...
+
+@app.route("/jobs", methods = ["POST"])
+def create_job():
+    predict_fn = modal.Function.from_name("example", "train_model")
+    job_id = str(uuid4())
+    function_call = predict_fn.spawn(
+        job_id=job_id,
+        params=request.json,
+    )
+    pending_jobs[job_id] = function_call
+    return {
+        "job_id": job_id,
+        "status": "pending",
+    }
+```
+
+### Importing a Modal function between Modal apps
+
+You can also import one function defined in an app from another app:
+
+```python
+import modal
+
+app = modal.App("another-app")
+
+square = modal.Function.from_name("my-shared-app", "square")
+
+@app.function()
+def cube(x):
+    return x * square.remote(x)
+
+@app.local_entrypoint()
+def main():
+    assert cube.remote(42) == 74088
+```
+
+### Comparison with HTTPS
+
+Compared with HTTPS invocation, Python invocation has the following benefits:
+
+- Avoids the need to create web endpoint functions.
+- Avoids handling serialization of request and response data between Modal and
+  your client.
+- Uses the Modal client library's built-in authentication.
+  - Web endpoints are public to the entire internet, whereas function `lookup`
+    only exposes your code to you (and your org).
+- You can work with shared Modal functions as if they are normal Python
+  functions, which might be more convenient.
+
+## Invoking with HTTPS
+
+Any application that can make HTTPS requests can interact with deployed Modal
+applications via [web endpoint functions](https://modal.com/docs/guide/webhooks). Note that
+all deployed web endpoint functions have [a stable HTTPS
+URL](https://modal.com/docs/guide/webhook-urls).
+
+Some use cases for HTTPS invocation include:
+
+- Calling Modal functions from a web browser client running JavaScript
+- Calling Modal functions from backend services in languages we don't yet have
+  official SDKs for (Java, Ruby, etc.)
+- Calling Modal functions using UNIX tools (`curl`, `wget`)
+
+However, if the client of your Modal deployment is running Python, JavaScript,
+or Go, it's better to use the [Modal Python
+SDK](https://pypi.org/project/modal/) or [libmodal SDKs for JavaScript and
+Go](https://modal.com/docs/guide/sdk-javascript-go) to invoke your Modal code.
+
+For more detail on setting up functions for invocation over HTTP see the
+[web endpoints guide](https://modal.com/docs/guide/webhooks).
+
+#### Continuous deployment
+
+# Continuous deployment
+
+It's a common pattern to auto-deploy your Modal App as part of a CI/CD pipeline.
+To get you started, below is a guide to doing continuous deployment of a Modal
+App in GitHub.
+
+## GitHub Actions
+
+Here's a sample GitHub Actions workflow that deploys your App on every push to
+the `main` branch.
+
+This requires you to create a [Modal token](https://modal.com/settings/tokens) and add it as a
+[secret for your Github Actions workflow](https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/use-secrets).
+
+After setting up secrets, create a new workflow file in your repository at
+`.github/workflows/ci-cd.yml` with the following contents:
+
+```yaml
+name: CI/CD
+
+on:
+  push:
+    branches:
+      - main
+
+jobs:
+  deploy:
+    name: Deploy
+    runs-on: ubuntu-latest
+    env:
+      MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
+      MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
+
+    steps:
+      - name: Checkout Repository
+        uses: actions/checkout@v4
+
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+
+      - name: Install Modal
+        run: |
+          python -m pip install --upgrade pip
+          pip install modal
+
+      - name: Deploy job
+        run: |
+          modal deploy -m my_package.my_file
+```
+
+Be sure to replace `my_package.my_file` with your actual entrypoint.
+
+If you use multiple Modal [Environments](https://modal.com/docs/guide/environments), you can
+additionally specify the target environment in the YAML using
+`MODAL_ENVIRONMENT=xyz`.
+
+#### Running untrusted code in Functions
+
+# Running untrusted code in Functions
+
+Modal provides two primitives for running untrusted code: Restricted Functions and [Sandboxes](https://modal.com/docs/guide/sandboxes). While both can be used for running untrusted code, they serve different purposes: Sandboxes provide a container-like interface while Restricted Functions provide an interface similar to a traditional Function.
+
+Restricted Functions are useful for executing:
+
+- Code generated by language models (LLMs)
+- User-submitted code in interactive environments
+- Third-party plugins or extensions
+
+## Using `restrict_modal_access`
+
+To restrict a Function's access to Modal resources, set `restrict_modal_access=True` on the Function definition:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function(restrict_modal_access=True)
+def run_untrusted_code(code_input: str):
+    # This function cannot access Modal resources
+    return eval(code_input)
+```
+
+When `restrict_modal_access` is enabled:
+
+- The Function cannot access Modal resources (Queues, Dicts, etc.)
+- The Function cannot call other Functions
+- The Function cannot access Modal's internal APIs
+
+## Comparison with Sandboxes
+
+While both `restrict_modal_access` and [Sandboxes](https://modal.com/docs/guide/sandboxes) can be used for running untrusted code, they serve different purposes:
+
+| Feature   | Restricted Function            | Sandbox                                        |
+| --------- | ------------------------------ | ---------------------------------------------- |
+| State     | Stateless                      | Stateful                                       |
+| Interface | Function-like                  | Container-like                                 |
+| Setup     | Simple decorator               | Requires explicit creation/termination         |
+| Use case  | Quick, isolated code execution | Interactive development, long-running sessions |
+
+## Best Practices
+
+When running untrusted code, consider these additional security measures:
+
+1. Use `single_use_containers=True` to ensure each container only handles one request. Containers that get reused could cause information leakage between users.
+
+```python
+@app.function(restrict_modal_access=True, single_use_containers=True)
+def isolated_function(input_data):
+    # Each input gets a fresh container
+    return process(input_data)
+```
+
+2. Set appropriate timeouts to prevent long-running operations:
+
+```python
+@app.function(
+    restrict_modal_access=True,
+    timeout=30,  # 30 second timeout
+    single_use_containers=True
+)
+def time_limited_function(input_data):
+    return process(input_data)
+```
+
+3. Consider using `block_network=True` to prevent the container from making outbound network requests:
+
+```python
+@app.function(
+    restrict_modal_access=True,
+    block_network=True,
+    single_use_containers=True
+)
+def network_isolated_function(input_data):
+    return process(input_data)
+```
+
+4. Minimize the App source that's included in the container
+
+A restricted Modal Function will have read access to its source files in the
+container, so you'll want to avoid including anything that would be harmful
+if exfiltrated by the untrusted process.
+
+If deploying an App from within a [larger package](https://modal.com/docs/guide/project-structure),
+the entire package source may be automatically included by default. A best
+practice would be to make the untrusted Function part of a standalone App that
+includes the minimum necessary files to run:
+
+```python
+restricted_app = modal.App("restricted-app", include_source=False)
+
+image = (
+    modal.Image.debian_slim()
+    .add_local_file("restricted_executor.py", "/root/restricted_executor.py")
+)
+
+@restricted_app.function(
+    restrict_modal_access=True,
+    block_network=True,
+    single_use_containers=True
+)
+def isolated_function(input_data):
+    return process(input_data)
+```
+
+## Example: Running LLM-generated Code
+
+Below is a complete example of running code generated by a language model:
+
+```python
+import modal
+
+app = modal.App("restricted-access-example")
+
+@app.function(restrict_modal_access=True, single_use_containers=True, timeout=30, block_network=True)
+def run_llm_code(generated_code: str):
+    try:
+        # Create a restricted environment
+        execution_scope = {}
+
+        # Execute the generated code
+        exec(generated_code, execution_scope)
+
+        # Return the result if it exists
+        return execution_scope.get("result", None)
+    except Exception as e:
+        return f"Error executing code: {str(e)}"
+
+@app.local_entrypoint()
+def main():
+    # Example LLM-generated code
+    code = """
+def calculate_fibonacci(n):
+    if n <= 1:
+        return n
+    return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
+
+result = calculate_fibonacci(10)
+    """
+
+    result = run_llm_code.remote(code)
+    print(f"Result: {result}")
+
+```
+
+This example locks down the container to ensure that the code is safe to execute by:
+
+- Restricting Modal access
+- Using a fresh container for each execution
+- Setting a timeout
+- Blocking network access
+- Catching and handling potential errors
+
+## Error Handling
+
+When a restricted Function attempts to access Modal resources, it will raise an `AuthError`:
+
+```python
+@app.function(restrict_modal_access=True)
+def restricted_function(q: modal.Queue):
+    try:
+        # This will fail because the Function is restricted
+        return q.get()
+    except modal.exception.AuthError as e:
+        return f"Access denied: {e}"
+```
+
+The error message will indicate that the operation is not permitted due to restricted Modal access.
+
+### Modal Sandboxes
+
+#### Sandboxes
+
+# Sandboxes
+
+In addition to the Function interface, Modal has a direct
+interface for defining containers _at runtime_ and securely running arbitrary code
+inside them.
+
+This can be useful if, for example, you want to:
+
+- Execute code generated by a language model.
+- Create isolated environments for running untrusted code.
+- Check out a git repository and run a command against it, like a test suite, or
+  `npm lint`.
+- Run containers with arbitrary dependencies and setup scripts.
+
+Each individual job is called a **Sandbox** and can be created using the
+[`Sandbox.create`](https://modal.com/docs/reference/modal.Sandbox#create) constructor:
+
+  {#snippet python()}
+
+```python notest
+import modal
+
+app = modal.App.lookup("my-app", create_if_missing=True)
+
+sb = modal.Sandbox.create(app=app)
+
+p = sb.exec("python", "-c", "print('hello')", timeout=3)
+print(p.stdout.read())
+
+p = sb.exec("bash", "-c", "for i in {1..10}; do date +%T; sleep 0.5; done", timeout=5)
+for line in p.stdout:
+    # Avoid double newlines by using end="".
+    print(line, end="")
+
+sb.terminate()
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+
+const modal = new ModalClient();
+const app = await modal.apps.fromName("my-app", {
+  createIfMissing: true,
+});
+const image = modal.images.fromRegistry("python:3.13-slim");
+
+const sb = await modal.sandboxes.create(app, image);
+
+const p = await sb.exec(["python", "-c", "print('hello')"], {
+  timeout: 3 * 1000,
+});
+console.log(await p.stdout.readText());
+
+const p2 = await sb.exec(
+  ["bash", "-c", "for i in {1..10}; do date +%T; sleep 0.5; done"],
+  { timeout: 5 * 1000 },
+);
+for await (const line of p2.stdout) {
+  process.stdout.write(line);
+}
+
+await sb.terminate();
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+package main
+
+import (
+	"context"
+	"fmt"
+	"io"
+	"os"
+	"time"
+
+	"github.com/modal-labs/libmodal/modal-go"
+)
+
+func main() {
+	ctx := context.Background()
+	mc, _ := modal.NewClient()
+
+	app, _ := mc.Apps.FromName(ctx, "my-app", &modal.AppFromNameParams{
+		CreateIfMissing: true,
+	})
+	image := mc.Images.FromRegistry("python:3.13-slim", nil)
+
+	sb, _ := mc.Sandboxes.Create(ctx, app, image, nil)
+	defer sb.Terminate(context.Background())
+
+	p, _ := sb.Exec(ctx, []string{"python", "-c", "print('hello')"}, &modal.SandboxExecParams{
+		Timeout: 3 * time.Second,
+	})
+	stdout, _ := io.ReadAll(p.Stdout)
+	fmt.Println(string(stdout))
+
+	p2, _ := sb.Exec(ctx, []string{"bash", "-c", "for i in {1..10}; do date +%T; sleep 0.5; done"}, &modal.SandboxExecParams{
+		Timeout: 5 * time.Second,
+	})
+	io.Copy(os.Stdout, p2.Stdout)
+}
+```
+
+{/snippet}
+</CodeTabs>
+
+**Note:** you can run the above example as a script directly with `python my_script.py`. `modal run` is not needed here since there is no [entrypoint](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps).
+
+Sandboxes require an [`App`](https://modal.com/docs/guide/apps) to be passed when spawned from outside
+of a Modal container. You may pass in a regular `App` object or look one up by name with
+[`App.lookup`](https://modal.com/docs/reference/modal.App#lookup). The `create_if_missing` flag on `App.lookup`
+will create an `App` with the given name if it doesn't exist.
+
+## Lifecycle
+
+### Timeouts
+
+Sandboxes have a default maximum lifetime of 5 minutes. You can change this by passing
+a `timeout` of up to 24 hours to the `Sandbox.create(...)` function.
+
+  {#snippet python()}
+
+```python notest
+sb = modal.Sandbox.create(app=my_app, timeout=10*60)  # 10 minutes
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const sb = await modal.sandboxes.create(app, image, {
+  timeout: 10 * 60 * 1000, // 10 minutes
+});
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+sb, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+	Timeout: 10 * time.Minute,
+})
+```
+
+{/snippet}
+</CodeTabs>
+
+If you need a Sandbox to run for more than 24 hours, we recommend using
+[Filesystem Snapshots](https://modal.com/docs/guide/sandbox-snapshots) to preserve its state,
+and then restore from that snapshot with a subsequent Sandbox.
+
+### Idle Timeouts
+
+Sandboxes can also be automatically terminated after a period of inactivity - you can do this by setting the `idle_timeout` parameter. A Sandbox is considered active if any of the following are true:
+
+1. It has an active [command](https://modal.com/docs/guide/sandbox-spawn) running (via [`sb.exec(...)`](https://modal.com/docs/reference/modal.Sandbox#exec))
+2. Its stdin is being written to (via [`sb.stdin.write()`](https://modal.com/docs/reference/modal.Sandbox#stdin))
+3. It has an open TCP connection over one of its [Tunnels](https://modal.com/docs/guide/tunnels)
+
+## Configuration
+
+Sandboxes support nearly all configuration options found in regular `modal.Function`s.
+Refer to [`Sandbox.create`](https://modal.com/docs/reference/modal.Sandbox#create) for further documentation
+on Sandbox configs.
+
+For example, Images and Volumes can be used just as with functions:
+
+  {#snippet python()}
+
+```python notest
+sb = modal.Sandbox.create(
+    image=modal.Image.debian_slim().pip_install("pandas"),
+    volumes={"/data": modal.Volume.from_name("my-volume")},
+    workdir="/repo",
+    app=my_app,
+)
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const image = modal.images.fromRegistry("python:3.13-slim");
+const volume = modal.volumes.fromName("my-volume");
+const sb = await modal.sandboxes.create(app, image, {
+  volumes: { "/data": volume },
+  workdir: "/repo",
+});
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+image := mc.Images.FromRegistry("python:3.13-slim", nil)
+volume := mc.Volumes.FromName("my-volume", nil)
+sb, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Volumes: map[string]*modal.Volume{"/data": volume},
+  Workdir: "/repo",
+})
+```
+
+{/snippet}
+</CodeTabs>
+
+## Environments
+
+### Environment variables
+
+You can set environment variables using inline secrets:
+
+  {#snippet python()}
+
+```python notest
+secret = modal.Secret.from_dict({"MY_SECRET": "hello"})
+
+sb = modal.Sandbox.create(
+    secrets=[secret],
+    app=my_app,
+)
+p = sb.exec("bash", "-c", "echo $MY_SECRET")
+print(p.stdout.read())
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const secret = modal.secrets.fromObject({ MY_SECRET: "hello" });
+const image = modal.images.fromRegistry("python:3.13-slim");
+
+const sb = await modal.sandboxes.create(app, image, {
+  secrets: [secret],
+});
+const p = await sb.exec(["bash", "-c", "echo $MY_SECRET"]);
+console.log(await p.stdout.readText());
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+secret, err := mc.Secrets.FromMap(ctx, map[string]string{"MY_SECRET": "hello"}, nil)
+image := mc.Images.FromRegistry("python:3.13-slim", nil)
+
+sb, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Secrets: []*modal.Secret{secret},
+})
+p, err := sb.Exec(ctx, []string{"bash", "-c", "echo $MY_SECRET"}, nil)
+stdout, err := io.ReadAll(p.Stdout)
+fmt.Println(string(stdout))
+```
+
+{/snippet}
+</CodeTabs>
+
+### Custom Images
+
+Sandboxes support custom images just as Functions do. However, while you'll typically
+invoke a Modal Function with the `modal run` cli, you typically spawn a Sandbox
+with a simple script call. As such, you may need to manually enable output streaming
+to see your image build logs:
+
+  {#snippet python()}
+
+```python notest
+image = modal.Image.debian_slim().pip_install("pandas", "numpy")
+
+with modal.enable_output():
+    sb = modal.Sandbox.create(image=image, app=my_app)
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const image = modal.images
+  .fromRegistry("python:3.13-slim")
+  .dockerfileCommands(["RUN pip install pandas numpy"]);
+
+const sb = await modal.sandboxes.create(app, image);
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+image := mc.Images.FromRegistry("python:3.13-slim", nil).
+  DockerfileCommands([]string{"RUN pip install pandas numpy"}, nil)
+
+// Note: Image build logs are automatically streamed in Go
+sb, err := mc.Sandboxes.Create(ctx, app, image, nil)
+```
+
+{/snippet}
+</CodeTabs>
+
+### Dynamically defined environments
+
+Note that any valid `Image` or `Mount` can be used with a Sandbox, even if those
+images or mounts have not previously been defined. This also means that Images and
+Mounts can be built from requirements at **runtime**. For example, you could
+use a language model to write some code and define your image, and then spawn a
+Sandbox with it. Check out [devlooper](https://github.com/modal-labs/devlooper)
+for a concrete example of this.
+
+## Running a Sandbox with an entrypoint
+
+In most cases, Sandboxes are treated as a generic container that can run arbitrary
+commands. However, in some cases, you may want to run a single command or script
+as the entrypoint of the Sandbox. You can do this by passing command arguments to the
+Sandbox constructor:
+
+  {#snippet python()}
+
+```python notest
+sb = modal.Sandbox.create("python", "-m", "http.server", "8080", app=my_app, timeout=10)
+for line in sb.stdout:
+    print(line, end="")
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const sb = await modal.sandboxes.create(app, image, {
+  entrypoint: ["python", "-m", "http.server", "8080"],
+  timeout: 10 * 1000,
+});
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+sb, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Entrypoint: []string{"python", "-m", "http.server", "8080"},
+  Timeout:    10 * time.Second,
+})
+```
+
+{/snippet}
+</CodeTabs>
+
+This functionality is most useful for running long-lived services that you want
+to keep running in the background. See our [Jupyter notebook example](https://modal.com/docs/examples/jupyter_sandbox)
+for a more concrete example of this.
+
+## Referencing Sandboxes from other code
+
+If you have a running Sandbox, you can retrieve it using the `from_id` method.
+
+  {#snippet python()}
+
+```python notest
+sb = modal.Sandbox.create(app=my_app)
+sb_id = sb.object_id
+
+# ... later in the program ...
+
+sb2 = modal.Sandbox.from_id(sb_id)
+
+p = sb2.exec("echo", "hello")
+print(p.stdout.read())
+sb2.terminate()
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const sb = await modal.sandboxes.create(app, image);
+const sbId = sb.sandboxId;
+
+// ... later in the program ...
+
+const sb2 = await modal.sandboxes.fromId(sbId);
+
+const p = await sb2.exec(["echo", "hello"]);
+console.log(await p.stdout.readText());
+await sb2.terminate();
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+sb, err := mc.Sandboxes.Create(ctx, app, image, nil)
+sbId := sb.SandboxID
+
+// ... later in the program ...
+
+sb2, err := mc.Sandboxes.FromID(ctx, sbId)
+
+p, err := sb2.Exec(ctx, []string{"echo", "hello"}, nil)
+stdout, err := io.ReadAll(p.Stdout)
+fmt.Println(string(stdout))
+sb2.Terminate(ctx)
+```
+
+{/snippet}
+</CodeTabs>
+
+A common use case for this is keeping a pool of Sandboxes available for executing tasks
+as they come in. You can keep a list of `object_id`s of Sandboxes that are "open" and
+reuse them, closing over the `object_id` in whatever function is using them.
+
+## Logging
+
+You can see Sandbox execution logs using the `verbose` option. For example:
+
+  {#snippet python()}
+
+```python notest
+sb = modal.Sandbox.create(app=my_app, verbose=True)
+
+p = sb.exec("python", "-c", "print('hello')")
+print(p.stdout.read())
+
+with sb.open("test.txt", "w") as f:
+    f.write("Hello World\n")
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const sb = await modal.sandboxes.create(app, image, { verbose: true });
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+sb, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Verbose: true,
+})
+```
+
+{/snippet}
+</CodeTabs>
+
+shows Sandbox logs:
+
+```
+Sandbox exec started: python -c print('hello')
+Opened file 'test.txt': fd-yErSQzGL9sig6WAjyNgTPR
+Wrote to file: fd-yErSQzGL9sig6WAjyNgTPR
+Closed file: fd-yErSQzGL9sig6WAjyNgTPR
+```
+
+## Named Sandboxes
+
+You can assign a name to a Sandbox when creating it. Each name must be unique within an app -
+only one _running_ Sandbox can use a given name at a time. Note that the associated app must be
+a deployed app. Once a Sandbox completely stops running, its name becomes available for reuse.
+Some applications find Sandbox Names to be useful for ensuring that no more than one Sandbox is
+running per resource or project. If a Sandbox with the given name is already running, `create()`
+will raise an error.
+
+  {#snippet python()}
+
+```python notest
+sb1 = modal.Sandbox.create(app=my_app, name="my-name")
+# This will raise a modal.exception.AlreadyExistsError.
+sb2 = modal.Sandbox.create(app=my_app, name="my-name")
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const sb1 = await modal.sandboxes.create(app, image, { name: "my-name" });
+// this will raise an AlreadyExistsError
+const sb2 = await modal.sandboxes.create(app, image, { name: "my-name" });
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+sb1, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Name: "my-name",
+})
+// this will return an error
+sb2, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Name: "my-name",
+})
+```
+
+{/snippet}
+</CodeTabs>
+
+A named Sandbox may be fetched from a deployed app using `from_name()` _but only
+if the Sandbox is currently running_. If no running Sandbox is found, `from_name()` will raise
+an error.
+
+  {#snippet python()}
+
+```python notest
+my_app = modal.App.lookup("my-app", create_if_missing=True)
+sb1 = modal.Sandbox.create(app=my_app, name="my-name")
+# Returns the currently running Sandbox with the name "my-name" from the
+# deployed app named "my-app".
+sb2 = modal.Sandbox.from_name("my-app", "my-name")
+assert sb1.object_id == sb2.object_id # sb1 and sb2 refer to the same Sandbox
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const app = await modal.apps.fromName("my-app", { createIfMissing: true });
+const sb1 = await modal.sandboxes.create(app, image, { name: "my-name" });
+// returns the currently running Sandbox with the name "my-name" from the
+// deployed app named "my-app".
+const sb2 = await modal.sandboxes.fromName("my-app", "my-name");
+console.assert(sb1.sandboxId === sb2.sandboxId); // sb1 and sb2 refer to the same Sandbox
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+app, err := mc.Apps.FromName(ctx, "my-app", &modal.AppFromNameParams{
+  CreateIfMissing: true,
+})
+sb1, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Name: "my-name",
+})
+// returns the currently running Sandbox with the name "my-name" from the
+// deployed app named "my-app".
+sb2, err := mc.Sandboxes.FromName(ctx, "my-app", "my-name", nil)
+// sb1 and sb2 refer to the same Sandbox
+fmt.Println(sb1.SandboxID == sb2.SandboxID)
+```
+
+{/snippet}
+</CodeTabs>
+
+Sandbox Names may contain only alphanumeric characters, dashes, periods, and underscores, and must
+be shorter than 64 characters.
+
+## Tagging
+
+Sandboxes can also be tagged with arbitrary key-value pairs. These tags can be used
+to filter results in `Sandbox.list`.
+
+  {#snippet python()}
+
+```python notest
+sandbox_v1_1 = modal.Sandbox.create("sleep", "10", app=my_app)
+sandbox_v1_2 = modal.Sandbox.create("sleep", "20", app=my_app)
+
+sandbox_v1_1.set_tags({"major_version": "1", "minor_version": "1"})
+sandbox_v1_2.set_tags({"major_version": "1", "minor_version": "2"})
+
+for sandbox in modal.Sandbox.list(app_id=my_app.app_id):  # All sandboxes.
+    print(sandbox.object_id)
+
+for sandbox in modal.Sandbox.list(
+    app_id=my_app.app_id,
+    tags={"major_version": "1"},
+):  # Also all sandboxes.
+    print(sandbox.object_id)
+
+for sandbox in modal.Sandbox.list(
+    app_id=app.app_id,
+    tags={"major_version": "1", "minor_version": "2"},
+):  # Just the latest sandbox.
+    print(sandbox.object_id)
+```
+
+{/snippet}
+
+{#snippet javascript()}
+
+```javascript notest
+const sandboxV1_1 = await modal.sandboxes.create(app, image, {
+  command: ["sleep", "10"],
+});
+const sandboxV1_2 = await modal.sandboxes.create(app, image, {
+  command: ["sleep", "20"],
+});
+
+await sandboxV1_1.setTags({ major_version: "1", minor_version: "1" });
+await sandboxV1_2.setTags({ major_version: "1", minor_version: "2" });
+
+// All sandboxes.
+for await (const sandbox of modal.sandboxes.list({ appId: app.appId })) {
+  console.log(sandbox.sandboxId);
+}
+
+// Also all sandboxes.
+for await (const sandbox of modal.sandboxes.list({
+  appId: app.appId,
+  tags: { major_version: "1" },
+})) {
+  console.log(sandbox.sandboxId);
+}
+
+// Just the latest sandbox.
+for await (const sandbox of modal.sandboxes.list({
+  appId: app.appId,
+  tags: { major_version: "1", minor_version: "2" },
+})) {
+  console.log(sandbox.sandboxId);
+}
+```
+
+{/snippet}
+
+{#snippet go()}
+
+```go notest
+sandboxV1_1, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Command: []string{"sleep", "10"},
+})
+sandboxV1_2, err := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+  Command: []string{"sleep", "20"},
+})
+
+sandboxV1_1.SetTags(ctx, map[string]string{"major_version": "1", "minor_version": "1"})
+sandboxV1_2.SetTags(ctx, map[string]string{"major_version": "1", "minor_version": "2"})
+
+// All sandboxes.
+it, _ := mc.Sandboxes.List(ctx, &modal.SandboxListParams{
+  AppID: app.AppID,
+})
+for sandbox := range it {
+  fmt.Println(sandbox.SandboxID)
+}
+
+// Also all sandboxes.
+it, _ = mc.Sandboxes.List(ctx, &modal.SandboxListParams{
+  AppID: app.AppID,
+  Tags:  map[string]string{"major_version": "1"},
+})
+for sandbox := range it {
+  fmt.Println(sandbox.SandboxID)
+}
+
+// Just the latest sandbox.
+it, _ = mc.Sandboxes.List(ctx, &modal.SandboxListParams{
+  AppID: app.AppID,
+  Tags:  map[string]string{"major_version": "1", "minor_version": "2"},
+})
+for sandbox := range it {
+  fmt.Println(sandbox.SandboxID)
+}
+```
+
+{/snippet}
+</CodeTabs>
+
+#### Running commands
+
+# Running commands in Sandboxes
+
+Once you have created a Sandbox, you can run commands inside it using the
+[`Sandbox.exec`](https://modal.com/docs/reference/modal.Sandbox#exec) method.
+
+```python notest
+sb = modal.Sandbox.create(app=my_app)
+
+process = sb.exec("echo", "hello", timeout=3)
+print(process.stdout.read())
+
+process = sb.exec("python", "-c", "print(1 + 1)", timeout=3)
+print(process.stdout.read())
+
+process = sb.exec(
+    "bash",
+    "-c",
+    "for i in $(seq 1 10); do echo foo $i; sleep 0.1; done",
+    timeout=5,
+)
+for line in process.stdout:
+    print(line, end="")
+
+sb.terminate()
+```
+
+`Sandbox.exec` returns a [`ContainerProcess`](https://modal.com/docs/reference/modal.container_process#modalcontainer_processcontainerprocess)
+object, which allows access to the process's `stdout`, `stderr`, and `stdin`.
+The `timeout` parameter ensures that the `exec` command will run for at most
+`timeout` seconds.
+
+## Input
+
+The Sandbox and ContainerProcess `stdin` handles are [`StreamWriter`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamwriter)
+objects. This object supports flushing writes with both synchronous and asynchronous APIs:
+
+```python notest
+import asyncio
+
+sb = modal.Sandbox.create(app=my_app)
+
+p = sb.exec("bash", "-c", "while read line; do echo $line; done")
+p.stdin.write(b"foo bar\n")
+p.stdin.write_eof()
+p.stdin.drain()
+p.wait()
+sb.terminate()
+
+async def run_async():
+    sb = await modal.Sandbox.create.aio(app=my_app)
+    p = await sb.exec.aio("bash", "-c", "while read line; do echo $line; done")
+    p.stdin.write(b"foo bar\n")
+    p.stdin.write_eof()
+    await p.stdin.drain.aio()
+    await p.wait.aio()
+    await sb.terminate.aio()
+
+asyncio.run(run_async())
+```
+
+## Output
+
+The Sandbox and ContainerProcess `stdout` and `stderr` handles are [`StreamReader`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamreader)
+objects. These objects support reading from the stream in both synchronous and asynchronous manners.
+These handles also respect the timeout given to `Sandbox.exec`.
+
+To read from a stream after the underlying process has finished, you can use the `read`
+method, which blocks until the process finishes and returns the entire output stream.
+
+```python notest
+sb = modal.Sandbox.create(app=my_app)
+p = sb.exec("echo", "hello")
+print(p.stdout.read())
+sb.terminate()
+```
+
+To stream output, take advantage of the fact that `stdout` and `stderr` are
+iterable:
+
+```python notest
+import asyncio
+
+sb = modal.Sandbox.create(app=my_app)
+
+p = sb.exec("bash", "-c", "for i in $(seq 1 10); do echo foo $i; sleep 0.1; done")
+
+for line in p.stdout:
+    # Lines preserve the trailing newline character, so use end="" to avoid double newlines.
+    print(line, end="")
+p.wait()
+sb.terminate()
+
+async def run_async():
+    sb = await modal.Sandbox.create.aio(app=my_app)
+    p = await sb.exec.aio("bash", "-c", "for i in $(seq 1 10); do echo foo $i; sleep 0.1; done")
+    async for line in p.stdout:
+        # Avoid double newlines by using end="".
+        print(line, end="")
+    await p.wait.aio()
+    await sb.terminate.aio()
+
+asyncio.run(run_async())
+```
+
+### Stream types
+
+By default, all streams are buffered in memory, waiting to be consumed by the
+client. You can control this behavior with the `stdout` and `stderr` parameters.
+These parameters are conceptually similar to the `stdout` and `stderr`
+parameters of the [`subprocess`](https://docs.python.org/3/library/subprocess.html#subprocess.DEVNULL) module.
+
+```python notest
+from modal.stream_type import StreamType
+
+sb = modal.Sandbox.create(app=my_app)
+
+# Default behavior: buffered in memory.
+p = sb.exec(
+    "bash",
+    "-c",
+    "echo foo; echo bar >&2",
+    stdout=StreamType.PIPE,
+    stderr=StreamType.PIPE,
+)
+print(p.stdout.read())
+print(p.stderr.read())
+
+# Print the stream to STDOUT as it comes in.
+p = sb.exec(
+    "bash",
+    "-c",
+    "echo foo; echo bar >&2",
+    stdout=StreamType.STDOUT,
+    stderr=StreamType.STDOUT,
+)
+p.wait()
+
+# Discard all output.
+p = sb.exec(
+    "bash",
+    "-c",
+    "echo foo; echo bar >&2",
+    stdout=StreamType.DEVNULL,
+    stderr=StreamType.DEVNULL,
+)
+p.wait()
+
+sb.terminate()
+```
+
+#### Networking and security
+
+# Networking and security
+
+Sandboxes are built to be secure-by-default, meaning that a default Sandbox has
+no ability to accept incoming network connections or access your Modal resources.
+
+## Networking
+
+Since Sandboxes may run untrusted code, they have options to restrict their network access.
+To block all network access, set `block_network=True` on [`Sandbox.create`](https://modal.com/docs/reference/modal.Sandbox#create).
+
+For more fine-grained networking control, a Sandbox's outbound network access
+can be restricted using the `cidr_allowlist` parameter. This parameter takes a
+list of CIDR ranges that the Sandbox is allowed to access, blocking all other
+outbound traffic.
+
+### Connecting to Sandboxes with HTTP and WebSockets
+
+You can make authenticated HTTP and WebSocket requests to a Sandbox by generating
+Sandbox Connect Tokens. They work like this:
+
+```python notest
+# Start a Sandbox with a server running on port 8080.
+sb = modal.Sandbox.create(
+    "bash", "-c", "python3 -m http.server 8080",
+    app=my_app,
+)
+
+# Create a connect token, optionally including arbitrary user metadata.
+creds = sb.create_connect_token(user_metadata={"user_id": "foo"})
+
+# Make an HTTP request, passing the token in the Authorization header.
+requests.get(creds.url, headers={"Authorization": f"Bearer {creds.token}"})
+
+# You can also put the token in a `_modal_connect_token` query param.
+url = f"{creds.url}/?_modal_connect_token={creds.token}"
+ws_url = url.replace("https://", "wss://")
+with websockets.connect(ws_url) as socket:
+    socket.send("Hello world!")
+```
+
+The server running on port 8080 in the container will receive an authenticated
+request with an unspoofable `X-Verified-User-Data` header whose value is the
+JSON-serialized Python dict that was passed as `user_metadata` to the
+`create_connect_token()` function. This can be used by the application to
+determine access control, for example.
+
+There are a few things to remember with Sandbox Connect Tokens:
+
+1. The server inside the container must be listening on port 8080.
+2. The token may be sent in an `Authorization` header, in a `_modal_connect_token`
+   query param, or in a `_modal_connect_token` cookie.
+3. If `_modal_connect_token` is set as a query param, the resulting response will
+   include a `Set-Cookie` header that sets it as a cookie.
+4. The `user_metadata` must be JSON-serializable and must be less than 512
+   characters after serialization.
+
+### Forwarding ports
+
+While it is recommended to use [Sandbox Connect Tokens](#connecting-to-sandboxes-with-http-and-websockets)
+for HTTP requests and WebSocket connections to the container, you can also expose
+raw TCP ports to the internet. This is useful if, for example, you want to run a
+server inside the Sandbox that expects a raw TCP connection and handles
+authentication itself.
+
+Use the `encrypted_ports` and `unencrypted_ports` parameters of `Sandbox.create`
+to specify which ports to forward. You can then access the public URL of a tunnel
+using the [`Sandbox.tunnels`](https://modal.com/docs/reference/modal.Sandbox#tunnels) method:
+
+```python notest
+import requests
+import time
+
+sb = modal.Sandbox.create(
+    "python",
+    "-m",
+    "http.server",
+    "12345",
+    encrypted_ports=[12345],
+    app=my_app,
+)
+
+tunnel = sb.tunnels()[12345]
+
+time.sleep(1)  # Wait for server to start.
+
+print(f"Connecting to {tunnel.url}...")
+print(requests.get(tunnel.url, timeout=5).text)
+```
+
+It is also possible to create an encrypted port that uses `HTTP/2` rather than `HTTP/1.1` with the `h2_ports` option. This will return
+a URL that you can make H2 (HTTP/2 + TLS) requests to. If you want to run an `HTTP/2` server inside a sandbox, this feature may be useful.
+Here is an example:
+
+```python notest
+import time
+
+port = 4359
+sb = modal.Sandbox.create(
+    app=my_app,
+    image=my_image,
+    h2_ports = [port],
+)
+p = sb.exec("python", "my_http2_server.py")
+
+tunnel = sb.tunnels()[port]
+time.sleep(1)
+print(f"Tunnel URL: {tunnel.url}")
+```
+
+For more details on how tunnels work, see the [tunnels guide](https://modal.com/docs/guide/tunnels).
+
+## Security model
+
+Sandboxes are built on top of [gVisor](https://gvisor.dev/), a container runtime
+by Google that provides strong isolation properties. gVisor has custom logic to
+prevent Sandboxes from making malicious system calls, giving you stronger isolation
+than standard [runc](https://github.com/opencontainers/runc) containers.
+
+Additionally, Sandboxes are not authorized to access other resources in your Modal
+workspace the way that Modal Functions are [by default](https://modal.com/docs/guide/restricted-access).
+As a result, the blast radius of any malicious code will be limited to the Sandbox
+container itself.
+
+#### File access
+
+# Filesystem Access
+
+There are multiple options for uploading files to a Sandbox and accessing them
+from outside the Sandbox.
+
+## Efficient file syncing
+
+To efficiently upload local files to a Sandbox, you can use the
+[`add_local_file`](https://modal.com/docs/reference/modal.Image#add_local_file) and
+[`add_local_dir`](https://modal.com/docs/reference/modal.Image#add_local_dir) methods on the
+[`Image`](https://modal.com/docs/reference/modal.Image) class:
+
+```python notest
+sb = modal.Sandbox.create(
+    app=my_app,
+    image=modal.Image.debian_slim().add_local_dir(
+        local_path="/home/user/my_dir",
+        remote_path="/app"
+    )
+)
+p = sb.exec("ls", "/app")
+print(p.stdout.read())
+p.wait()
+```
+
+Alternatively, it's possible to use Modal [Volume](https://modal.com/docs/reference/modal.Volume)s or
+[CloudBucketMount](https://modal.com/docs/guide/cloud-bucket-mounts)s. These have the benefit that
+files created from inside the Sandbox can easily be accessed outside the
+Sandbox.
+
+To efficiently upload files to a Sandbox using a Volume, you can use the
+[`batch_upload`](https://modal.com/docs/reference/modal.Volume#batch_upload) method on the
+`Volume` class - for instance, using an ephemeral Volume that
+will be garbage collected when the App finishes:
+
+```python notest
+with modal.Volume.ephemeral() as vol:
+    import io
+    with vol.batch_upload() as batch:
+        batch.put_file("local-path.txt", "/remote-path.txt")
+        batch.put_directory("/local/directory/", "/remote/directory")
+        batch.put_file(io.BytesIO(b"some data"), "/foobar")
+
+    sb = modal.Sandbox.create(
+        volumes={"/cache": vol},
+        app=my_app,
+    )
+    p = sb.exec("cat", "/cache/remote-path.txt")
+    print(p.stdout.read())
+    p.wait()
+    sb.terminate()
+```
+
+The caller also can access files created in the Volume from the Sandbox, even after the Sandbox is terminated:
+
+```python notest
+with modal.Volume.ephemeral() as vol:
+    sb = modal.Sandbox.create(
+        volumes={"/cache": vol},
+        app=my_app,
+    )
+    p = sb.exec("bash", "-c", "echo foo > /cache/a.txt")
+    p.wait()
+    sb.terminate()
+    sb.wait(raise_on_termination=False)
+    for data in vol.read_file("a.txt"):
+        print(data)
+```
+
+Alternatively, if you want to persist files between Sandbox invocations (useful
+if you're building a stateful code interpreter, for example), you can use create
+a persisted `Volume` with a dynamically assigned label:
+
+```python notest
+session_id = "example-session-id-123abc"
+vol = modal.Volume.from_name(f"vol-{session_id}", create_if_missing=True)
+sb = modal.Sandbox.create(
+    volumes={"/cache": vol},
+    app=my_app,
+)
+p = sb.exec("bash", "-c", "echo foo > /cache/a.txt")
+p.wait()
+sb.terminate()
+sb.wait(raise_on_termination=False)
+for data in vol.read_file("a.txt"):
+    print(data)
+```
+
+File syncing behavior differs between Volumes and CloudBucketMounts. For
+Volumes, files are only synced back to the Volume when the Sandbox terminates.
+For CloudBucketMounts, files are synced automatically.
+
+## Filesystem API (Alpha)
+
+If you're less concerned with efficiency of uploads and want a convenient way
+to pass data in and out of the Sandbox during execution, you can use our
+filesystem API to easily read and write files. The API supports reading
+files up to 100 MiB and writes up to 1 GiB in size.
+
+This API is currently in Alpha, and we don't recommend using it for production
+workloads.
+
+```python
+import modal
+
+app = modal.App.lookup("sandbox-fs-demo", create_if_missing=True)
+
+sb = modal.Sandbox.create(app=app)
+
+with sb.open("test.txt", "w") as f:
+    f.write("Hello World\n")
+
+f = sb.open("test.txt", "rb")
+print(f.read())
+f.close()
+```
+
+The filesystem API is similar to Python's built-in [io.FileIO](https://docs.python.org/3/library/io.html#io.FileIO) and supports many of the same methods, including `read`, `readline`, `readlines`, `write`, `flush`, `seek`, and `close`.
+
+We additionally provide commands [`mkdir`](https://modal.com/docs/reference/modal.Sandbox#mkdir), [`rm`](https://modal.com/docs/reference/modal.Sandbox#rm), and [`ls`](https://modal.com/docs/reference/modal.Sandbox#ls) to make interacting with the filesystem more ergonomic.
+
+<!-- TODO(WRK-956) -->
+<!-- ## File Watching
+
+You can watch files or directories for changes using [`watch`](https://modal.com/docs/reference/modal.Sandbox#watch), which is conceptually similar to [`fsnotify`](https://pkg.go.dev/github.com/fsnotify/fsnotify).
+
+```python notest
+from modal.file_io import FileWatchEventType
+
+async def watch(sb: modal.Sandbox):
+    event_stream = sb.watch.aio(
+        "/watch",
+        recursive=True,
+        filter=[FileWatchEventType.Create, FileWatchEventType.Modify],
+    )
+    async for event in event_stream:
+        print(event)
+
+async def main():
+    app = modal.App.lookup("sandbox-file-watch", create_if_missing=True)
+    sb = await modal.Sandbox.create.aio(app=app)
+    asyncio.create_task(watch(sb))
+
+    await sb.mkdir.aio("/watch")
+    for i in range(10):
+        async with await sb.open.aio(f"/watch/bar-{i}.txt", "w") as f:
+            await f.write.aio(f"hello-{i}")
+``` -->
+
+#### Snapshots
+
+# Snapshots
+
+Sandboxes support snapshotting, allowing you to save your Sandbox's state
+and restore it later. This is useful for:
+
+- Creating custom environments for your Sandboxes to run in
+- Backing up your Sandbox's state for debugging
+- Running large-scale experiments with the same initial state
+- Branching your Sandbox's state to test different code changes independently
+
+## Filesystem Snapshots
+
+Filesystem Snapshots are copies of the Sandbox's filesystem at a given point in time.
+These Snapshots are [Images](https://modal.com/docs/reference/modal.Image) and can be used to create
+new Sandboxes.
+
+To create a Filesystem Snapshot, you can use the
+[`Sandbox.snapshot_filesystem()`](https://modal.com/docs/reference/modal.Sandbox#snapshot_filesystem) method:
+
+```python notest
+import modal
+
+app = modal.App.lookup("sandbox-fs-snapshot-test", create_if_missing=True)
+
+sb = modal.Sandbox.create(app=app)
+p = sb.exec("bash", "-c", "echo 'test' > /test")
+p.wait()
+assert p.returncode == 0, "failed to write to file"
+image = sb.snapshot_filesystem()
+sb.terminate()
+
+sb2 = modal.Sandbox.create(image=image, app=app)
+p2 = sb2.exec("bash", "-c", "cat /test")
+assert p2.stdout.read().strip() == "test"
+sb2.terminate()
+```
+
+Filesystem Snapshots are optimized for performance: they are calculated as the difference
+from your base image, so only modified files are stored. Restoring a Filesystem Snapshot
+utilizes the same infrastructure we use to get fast cold starts for your Sandboxes.
+
+Filesystem Snapshots will generally persist indefinitely.
+
+## Memory Snapshots
+
+[Sandboxes memory snapshots](https://modal.com/docs/guide/sandbox-memory-snapshots) are in early preview.
+Contact us if this is something you're interested in!
+
+### Modal Notebooks
+
+# Modal Notebooks
+
+Notebooks allow you to write and execute Python code in Modal's cloud, within your browser. It's a hosted Jupyter notebook with:
+
+- Serverless pricing and automatic idle shutdown
+- Access to Modal GPUs and compute
+- Real-time collaborative editing
+- Python Intellisense/LSP support and AI autocomplete
+- Support for rich and interactive outputs like images, widgets, and plots
+
+<center>
+<video controls autoplay muted playsinline>
+<source src="https://modal-cdn.com/Modal-Notebooks-Beta.mp4" type="video/mp4">
+</video>
+</center>
+
+## Getting started
+
+Open [modal.com/notebooks](https://modal.com/notebooks) in your browser and create a new notebook. You can also upload an `.ipynb` file from your computer.
+
+Once you create a notebook, you can start running cells. Try a simple statement like
+
+```python
+print("Hello, Modal!")
+```
+
+Or, import a library and create a plot:
+
+```python notest
+import matplotlib.pyplot as plt
+import numpy as np
+
+x = np.linspace(-20, 20, 500)
+plt.plot(np.cos(x / 3.7 + 0.3), x * np.sin(x))
+```
+
+The default notebook image comes with a number of Python packages pre-installed, so you can get started right away. Popular ones include PyTorch, NumPy, Pandas, JAX, Transformers, and Matplotlib. You can find the full image definition [here](https://github.com/modal-labs/modal-client/blob/v1.1.3/modal/experimental/__init__.py#L234-L342). If you need another package, just install it:
+
+```shell
+%uv pip install [my-package]
+```
+
+All output types work out-of-the-box, including rich HTML, images, [Jupyter Widgets](https://ipywidgets.readthedocs.io/en/latest/), and interactive plots.
+
+## Kernel resources
+
+Just like with Modal Functions, notebooks run in serverless containers. This means you pay only for the CPU cores and memory you use.
+
+If you need more resources, you can change kernel settings in the sidebar. This lets you set the number of CPU cores, memory, and GPU type for your notebook. You can also set a timeout for idle shutdown, which defaults to 10 minutes.
+
+Use any GPU type available in Modal, including up to 8 Nvidia A100s or H100s. You can switch the kernel configuration in seconds!
+
+![Compute profile tab in notebook sidebar](https://modal-cdn.com/cdnbot/compute-profilev9rvmmvw_365a1197.webp)
+
+Note that the CPU and memory settings are _reservations_, so you can usually burst above the request. For example, if you've set the notebook to have 0.5 CPU cores, you'll be billed for that continuously, but you can use up to any available cores on the machine (e.g., 32 CPUs) and will be billed for only the time you use them.
+
+### Notebook pricing
+
+Modal Notebooks are priced simply, by compute usage while the kernel is running. See the [pricing page](https://modal.com/pricing) for rates. Currently the CPU and Memory costs are priced according to Sandboxes. They appear in your [usage dashboard](https://modal.com/settings/usage) under "Sandboxes" as well.
+
+Inactive notebooks do not incur any cost. You are only billed for time the notebook is actively running.
+
+## Custom images, volumes, secrets, and cloud storage
+
+Modal Notebooks supports custom images, volumes, and secrets, just like Modal Functions. You can use these to install additional packages, mount persistent storage, or access secrets.
+
+- To use a custom image, you need to have a [deployed Modal Function](https://modal.com/docs/guide/managing-deployments) using that image. Then, search for that function in the sidebar.
+- To use a Secret, simply create a [Modal Secret](https://modal.com/secrets) using our wizard and attach it to the notebook, so it can be injected as an environment variable automatically.
+- To use a Volume, create a [Modal Volume](https://modal.com/docs/guide/volumes) and attach it to the notebook. This lets you mount high-performance, persistent storage that can be shared across multiple notebooks or functions. They will appear as folders in the `/mnt` directory by default.
+
+### Creating a Custom Image
+
+If you don't have a suitable deployed Modal App already, you can set up your environment to deploy custom images in under a minute using the Modal CLI. First, run `pip install modal`, and define your image in a file like:
+
+```python
+import modal
+
+# Image definition here:
+image = (
+    modal.Image.from_registry("python:3.13-slim")
+    .pip_install("requests", "numpy")
+    .apt_install("curl", "wget")
+    .run_commands(
+        "echo 'foo' > /root/hello.txt",
+        # ... other commands
+    )
+)
+
+app = modal.App("notebook-images")
+
+@app.function(image=image)  # You need a Function object to reference the image.
+def notebook_image():
+    pass
+```
+
+Then, make sure you have the Modal CLI (`pip install modal`) and run this command to build and deploy the image:
+
+```bash
+modal deploy notebook_images.py
+```
+
+For more information on custom images in Modal, see our [guide on defining images](https://modal.com/docs/guide/images).
+
+(Advanced) Note that if you use the [`add_local_file()` or `add_local_dir()` functions](https://modal.com/docs/guide/images#add-local-files-with-add_local_dir-and-add_local_file), you'll need to pass `copy=True` for them to work in Modal Notebooks. This is because they skip creating a custom image and instead mount the files into the function at startup, which won't work in notebooks.
+
+### Creating a Secret
+
+Secrets can be created from the dashboard at [modal.com/secrets](https://modal.com/secrets). We have templates for common credential types, and they are saved as encrypted objects until container startup.
+
+Attacahed secrets become available as environment variables in your notebook.
+
+### Creating a Volume
+
+[Volumes](https://modal.com/docs/guide/volumes) can be created via the files panel on the filesystem tab. This panel can also be used to attach existing Volumes from your Apps or Functions, including those created via the Modal CLI.
+
+Any volumes are attached in the `/mnt` folder in your notebook, and files saved there will be persisted across kernel startups and elsewhere on Modal.
+
+### Mounting Cloud Buckets
+
+Modal Notebooks now support mounting cloud storage buckets, initially S3 buckets, directly to your notebook filesystem. This allows you to access large datasets stored in cloud storage easily on your notebooks.
+
+To mount an S3 bucket:
+
+1. Create a [Modal Secret](https://modal.com/secrets) containing your AWS credentials (AWS Access Key ID and Secret Access Key)
+2. In the notebook sidebar's Files panel, use the Cloud Buckets section to attach your bucket
+3. Specify:
+   - The S3 bucket name
+   - Mount path (e.g., `/mnt/s3/my-data`)
+   - The AWS credentials secret stored in that environment
+   - Optional: A key prefix to mount only a subset of objects (e.g., `datasets/`)
+   - Optional: Set the mount as read-only
+
+Once attached, your S3 bucket will be mounted at the specified path and accessible just like any other directory in your notebook.
+
+For more information on using cloud bucket mounts with Modal, see the [CloudBucket mounts guide](https://modal.com/docs/guide/cloud-bucket-mounts).
+
+## Access and sharing
+
+Need a colleague—or the whole internet—to see your work? Just click **Share** in the top‑right corner of the notebook editor.
+
+Notebooks are editable by you and teammates in your workspace. To make the notebook view-only to collaborators, the creator of the notebook can change access settings in the "Share" menu. Workspace managers are also allowed to change this setting.
+
+You can also turn on sharing by public, unlisted link. If you toggle this, it allows _anyone with the link_ to open the notebook, even if they are not logged in. Pick **Can view** (default) or **Can view and run** based on your preference. Viewers don’t need a Modal account, so this is perfect for collaborating with stakeholders outside your workspace.
+
+No matter how the notebook is shared, anyone with access can fork and run their own version of it.
+
+## Interactive file viewer
+
+The panel on the left-hand side of the notebook shows a **live view of the container’s filesystem**:
+
+| Feature                 | Details                                                                                                                                                                    |
+| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Browse & preview**    | Click through folders to inspect any file that your code has created or downloaded.                                                                                        |
+| **Upload & download**   | Drag-and-drop files from your desktop, or click the **⬆** / **⬇** icons to add new data sets, notebooks, or models—or to save results back to your machine.              |
+| **One-click refresh**   | Changes made by your code (for example, writing a CSV) appear instantly; hit the refresh icon if you want to force an update.                                              |
+| **Context-aware paths** | The viewer always reflects _exactly_ what your code sees (e.g. `/root`, `/mnt/…`), so you can double-check that that file you just wrote really landed where you expected. |
+
+**Important:** the underlying container is **ephemeral**. Anything stored outside an attached [Volume](https://modal.com/docs/guide/volumes) disappears when the kernel shuts down (after your idle-timeout or when you hit **Stop kernel**). Mount a Volume for data you want to keep across sessions.
+
+The viewer itself is only active while the kernel is running—if the notebook is stopped you’ll see an “empty” state until you start it again.
+
+## Editor features
+
+Modal Notebooks bundle the same productivity tooling you’d expect from a modern IDE.
+
+With Pyright, you get autocomplete, signature help, and on-hover documentation for every installed library.
+
+We also implemented AI-powered code completion using Anthropic's **Claude 4** model. This keeps you in the flow for everything from small snippets to multi-line functions. Just press `Tab` to accept suggestions or `Esc` to dismiss them.
+
+Familiar Jupyter shortcuts (`A`, `B`, `X`, `Y`, `M`, etc.) all work within the notebook, so you can quickly add new cells, delete existing ones, or change cell types.
+
+Finally, we have real-time collaborative editing, so you can work with your team in the same notebook. You can see other users' cursors and edits in real-time, and you can see when others are running cells with you. This makes it easy to pair program or review code together.
+
+## Widgets
+
+Modal Notebooks support [Jupyter Widgets](https://ipywidgets.readthedocs.io/en/latest/), which can be used to create interactive components living in the browser. Currently, Notebooks support all the widgets in the base `ipywidgets` package, except the following:
+
+- Media Widgets (`Audio`, `Video`), try using `IPython.display` outputs instead.
+- `Play`
+- Controllers (`ControllerAxis`, `ControllerButton`, `Controller`)
+
+Modal Notebooks do not support custom widget packages.
+
+## Cell magic
+
+Modal Notebooks have built-in support for the `%modal` cell magic. This lets you run code in any [deployed Modal Function or Cls](https://modal.com/docs/guide/trigger-deployed-functions), right from your notebook.
+
+For example, if you have previously run `modal deploy` for an app like:
+
+```python notest
+import modal
+
+app = modal.App("my-app")
+
+@app.function()
+def my_function(s: str):
+    return len(s)
+```
+
+Then you could access this function from your notebook:
+
+```python notest
+%modal from my-app import my_function
+
+my_function.remote("hello, world!")  # returns 13
+```
+
+Run `%modal` to see all options. This works for Cls as well, and you can import from different environments or alias them with the `as` keyword.
+
+## Roadmap
+
+The product is in beta, and we're planning to make a lot of improvements over the coming months. Some bigger features on mind:
+
+- **Modal cloud integrations**
+  - Expose ports with [Tunnels](https://modal.com/docs/guide/tunnels)
+  - Memory snapshots to restore from past notebook sessions
+  - Create notebooks from the `modal` CLI
+  - Custom image registry
+- **Notebook editor**
+  - Interactive outline, collapsing sections by headings
+  - Reactive cell execution
+  - Edit history
+  - Integrated debugger (pdb and `%debug`)
+- **Documents and sharing**
+  - Restore recently deleted notebooks
+  - Folders and tags for grouping notebooks
+  - Sync with Git repositories
+
+Let us know via [Slack](https://modal.com/slack) if you have any feedback.
+
+### Secrets and environment variables
+
+#### Secrets
+
+# Secrets
+
+Securely provide credentials and other sensitive information to your Modal Functions with Secrets.
+
+You can create and edit Secrets via
+the [dashboard](https://modal.com/secrets),
+the command line interface ([`modal secret`](https://modal.com/docs/reference/cli/secret)), and
+programmatically from Python code ([`modal.Secret`](https://modal.com/docs/reference/modal.Secret)).
+
+To inject Secrets into the container running your Function, add the
+`secrets=[...]` argument to your `app.function` or `app.cls` decoration.
+
+## Deploy Secrets from the Modal Dashboard
+
+The most common way to create a Modal Secret is to use the
+[Secrets panel of the Modal dashboard](https://modal.com/secrets),
+which also shows any existing Secrets.
+
+When you create a new Secret, you'll be prompted with a number of templates to help you get started.
+These templates demonstrate standard formats for credentials for everything from Postgres and MongoDB
+to Weights & Biases and Hugging Face.
+
+## Use Secrets in your Modal Apps
+
+You can then use your Secret by constructing it `from_name` when defining a Modal App
+and then accessing its contents as environment variables.
+For example, if you have a Secret called `secret-keys` containing the key
+`MY_PASSWORD`:
+
+```python
+@app.function(secrets=[modal.Secret.from_name("secret-keys")])
+def some_function():
+    import os
+
+    secret_key = os.environ["MY_PASSWORD"]
+    ...
+```
+
+Each Secret can contain multiple keys and values but you can also inject
+multiple Secrets, allowing you to separate Secrets into smaller reusable units:
+
+```python
+@app.function(secrets=[
+    modal.Secret.from_name("my-secret-name"),
+    modal.Secret.from_name("other-secret"),
+])
+def other_function():
+    ...
+```
+
+The Secrets are applied in order, so key-values from later `modal.Secret`
+objects in the list will overwrite earlier key-values in the case of a clash.
+For example, if both `modal.Secret` objects above contained the key `FOO`, then
+the value from `"other-secret"` would always be present in `os.environ["FOO"]`.
+
+## Create Secrets programmatically
+
+In addition to defining Secrets on the web dashboard, you can
+programmatically create a Secret directly in your script and send it along to
+your Function using `Secret.from_dict(...)`. This can be useful if you want to
+send Secrets from your local development machine to the remote Modal App.
+
+```python
+import os
+
+if modal.is_local():
+    local_secret = modal.Secret.from_dict({"FOO": os.environ["LOCAL_FOO"]})
+else:
+    local_secret = modal.Secret.from_dict({})
+
+@app.function(secrets=[local_secret])
+def some_function():
+    import os
+
+    print(os.environ["FOO"])
+```
+
+If you have [`python-dotenv`](https://pypi.org/project/python-dotenv/) installed,
+you can also use `Secret.from_dotenv()` to create a Secret from the variables in a `.env`
+file
+
+```python
+@app.function(secrets=[modal.Secret.from_dotenv()])
+def some_other_function():
+    print(os.environ["USERNAME"])
+```
+
+## Interact with Secrets from the command line
+
+You can create, list, and delete your Modal Secrets with the `modal secret` command line interface.
+
+View your Secrets and their timestamps with
+
+```bash
+modal secret list
+```
+
+Create a new Secret by passing `{KEY}={VALUE}` pairs to `modal secret create`:
+
+```bash
+modal secret create database-secret PGHOST=uri PGPORT=5432 PGUSER=admin PGPASSWORD=hunter2
+```
+
+or using environment variables (assuming below that the `PGPASSWORD` environment variable is set
+e.g. by your CI system):
+
+```bash
+modal secret create database-secret PGHOST=uri PGPORT=5432 PGUSER=admin PGPASSWORD="$PGPASSWORD"
+```
+
+Remove Secrets by passing their name to `modal secret delete`:
+
+```bash
+modal secret delete database-secret
+```
+
+#### Environment variables
+
+# Environment variables
+
+The Modal runtime sets several environment variables during initialization. The
+keys for these environment variables are reserved and cannot be overridden by
+your Function or Sandbox configuration.
+
+These variables provide information about the containers's runtime
+environment.
+
+## Container runtime environment variables
+
+The following variables are present in every Modal container:
+
+- **`MODAL_CLOUD_PROVIDER`** — Modal executes containers across a number of cloud
+  providers ([AWS](https://aws.amazon.com/), [GCP](https://cloud.google.com/),
+  [OCI](https://www.oracle.com/cloud/)). This variable specifies which cloud
+  provider the Modal container is running within.
+- **`MODAL_IMAGE_ID`** — The ID of the
+  [`modal.Image`](https://modal.com/docs/reference/modal.Image) used by the Modal container.
+- **`MODAL_REGION`** — This will correspond to a geographic area identifier from
+  the cloud provider associated with the Modal container (see above). For AWS, the
+  identifier is a "region". For GCP it is a "zone", and for OCI it is an
+  "availability domain". Example values are `us-east-1` (AWS), `us-central1`
+  (GCP), `us-ashburn-1` (OCI). See the [full list here](https://modal.com/docs/guide/region-selection#region-options).
+- **`MODAL_TASK_ID`** — The ID of the container running the Modal Function or Sandbox.
+
+## Function runtime environment variables
+
+The following variables are present in containers running Modal Functions:
+
+- **`MODAL_ENVIRONMENT`** — The name of the
+  [Modal Environment](https://modal.com/docs/guide/environments) the container is running within.
+- **`MODAL_IS_REMOTE`** - Set to '1' to indicate that Modal Function code is running in
+  a remote container.
+- **`MODAL_IDENTITY_TOKEN`** — An [OIDC token](https://modal.com/docs/guide/oidc-integration)
+  encoding the identity of the Modal Function.
+
+## Sandbox environment variables
+
+The following variables are present within [`modal.Sandbox`](https://modal.com/docs/reference/modal.Sandbox) instances.
+
+- **`MODAL_SANDBOX_ID`** — The ID of the Sandbox.
+
+## Container image environment variables
+
+The container image layers used by a `modal.Image` may set
+environment variables. These variables will be present within your container's runtime
+environment. For example, the
+[`debian_slim`](https://modal.com/docs/reference/modal.Image#debian_slim) image sets the
+`GPG_KEY` variable.
+
+To override image variables or set new ones, use the
+[`.env`](https://modal.com/docs/reference/modal.Image#env) method provided by
+`modal.Image`.
+
+### Scheduling and cron jobs
+
+# Scheduling remote cron jobs
+
+A common requirement is to perform some task at a given time every day or week
+automatically. Modal facilitates this through function schedules.
+
+## Basic scheduling
+
+Let's say we have a Python module `heavy.py` with a function,
+`perform_heavy_computation()`.
+
+```python
+# heavy.py
+def perform_heavy_computation():
+    ...
+
+if __name__ == "__main__":
+    perform_heavy_computation()
+```
+
+To schedule this function to run once per day, we create a Modal App and attach
+our function to it with the `@app.function` decorator and a schedule parameter:
+
+```python
+# heavy.py
+import modal
+
+app = modal.App()
+
+@app.function(schedule=modal.Period(days=1))
+def perform_heavy_computation():
+    ...
+```
+
+To activate the schedule, deploy your app, either through the CLI:
+
+```shell
+modal deploy --name daily_heavy heavy.py
+```
+
+Or programmatically:
+
+```python
+if __name__ == "__main__":
+   app.deploy()
+```
+
+Now the function will run every day, at the time of the initial deployment,
+without any further interaction on your part.
+
+When you make changes to your function, just rerun the deploy command to
+overwrite the old deployment.
+
+Note that when you redeploy your function, `modal.Period` resets, and the
+schedule will run X hours after this most recent deployment.
+
+If you want to run your function at a regular schedule not disturbed by deploys,
+`modal.Cron` (see below) is a better option.
+
+## Monitoring your scheduled runs
+
+To see past execution logs for the scheduled function, go to the
+[Apps](https://modal.com/apps) section on the Modal web site.
+
+Schedules currently cannot be paused. Instead the schedule should be removed and
+the app redeployed. Schedules can be started manually on the app's dashboard
+page, using the "run now" button.
+
+## Schedule types
+
+There are two kinds of base schedule values -
+[`modal.Period`](https://modal.com/docs/reference/modal.Period) and
+[`modal.Cron`](https://modal.com/docs/reference/modal.Cron).
+
+[`modal.Period`](https://modal.com/docs/reference/modal.Period) lets you specify an interval
+between function calls, e.g. `Period(days=1)` or `Period(hours=5)`:
+
+```python
+# runs once every 5 hours
+@app.function(schedule=modal.Period(hours=5))
+def perform_heavy_computation():
+    ...
+```
+
+[`modal.Cron`](https://modal.com/docs/reference/modal.Cron) gives you finer control using
+[cron](https://en.wikipedia.org/wiki/Cron) syntax:
+
+```python
+# runs at 8 am (UTC) every Monday
+@app.function(schedule=modal.Cron("0 8 * * 1"))
+def perform_heavy_computation():
+    ...
+
+# runs daily at 6 am (New York time)
+@app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York"))
+def send_morning_report():
+    ...
+```
+
+For more details, see the API reference for
+[Period](https://modal.com/docs/reference/modal.Period), [Cron](https://modal.com/docs/reference/modal.Cron) and
+[Function](https://modal.com/docs/reference/modal.Function)
+
+### Web endpoints
+
+#### Web endpoints
+
+# Web endpoints
+
+This guide explains how to set up web endpoints with Modal.
+
+All deployed Modal Functions can be [invoked from any other Python application](https://modal.com/docs/guide/trigger-deployed-functions)
+using the Modal client library. We additionally provide multiple ways to expose
+your Functions over the web for non-Python clients.
+
+You can [turn any Python function into a web endpoint](#simple-endpoints) with a single line
+of code, you can [serve a full app](#serving-asgi-and-wsgi-apps) using
+frameworks like FastAPI, Django, or Flask, or you can
+[serve anything that speaks HTTP and listens on a port](#non-asgi-web-servers).
+
+Below we walk through each method, assuming you're familiar with web applications outside of Modal.
+For a detailed walkthrough of basic web endpoints on Modal aimed at developers new to web applications,
+see [this tutorial](https://modal.com/docs/examples/basic_web).
+
+## Simple endpoints
+
+The easiest way to create a web endpoint from an existing Python function is to use the
+[`@modal.fastapi_endpoint` decorator](https://modal.com/docs/reference/modal.fastapi_endpoint).
+
+```python
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+
+@app.function(image=image)
+@modal.fastapi_endpoint()
+def f():
+    return "Hello world!"
+```
+
+This decorator wraps the Modal Function in a
+[FastAPI application](#how-do-web-endpoints-run-in-the-cloud).
+
+_Note: Prior to v0.73.82, this function was named `@modal.web_endpoint`_.
+
+### Developing with `modal serve`
+
+You can run this code as an ephemeral app, by running the command
+
+```shell
+modal serve server_script.py
+```
+
+Where `server_script.py` is the file name of your code. This will create an
+ephemeral app for the duration of your script (until you hit Ctrl-C to stop it).
+It creates a temporary URL that you can use like any other REST endpoint. This
+URL is on the public internet.
+
+The `modal serve` command will live-update an app when any of its supporting
+files change.
+
+Live updating is particularly useful when working with apps containing web
+endpoints, as any changes made to web endpoint handlers will show up almost
+immediately, without requiring a manual restart of the app.
+
+### Deploying with `modal deploy`
+
+You can also deploy your app and create a persistent web endpoint in the cloud
+by running `modal deploy`:
+
+### Passing arguments to an endpoint
+
+When using `@modal.fastapi_endpoint`, you can add
+[query parameters](https://fastapi.tiangolo.com/tutorial/query-params/) which
+will be passed to your Function as arguments. For instance
+
+```python
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+
+@app.function(image=image)
+@modal.fastapi_endpoint()
+def square(x: int):
+    return {"square": x**2}
+```
+
+If you hit this with a URL-encoded query string with the `x` parameter present,
+the Function will receive the value as an argument:
+
+```
+$ curl https://modal-labs--web-endpoint-square-dev.modal.run?x=42
+{"square":1764}
+```
+
+If you want to use a `POST` request, you can use the `method` argument to
+`@modal.fastapi_endpoint` to set the HTTP verb. To accept any valid JSON object,
+[use `dict` as your type annotation](https://fastapi.tiangolo.com/tutorial/body-nested-models/?h=dict#bodies-of-arbitrary-dicts)
+and FastAPI will handle the rest.
+
+```python
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+
+@app.function(image=image)
+@modal.fastapi_endpoint(method="POST")
+def square(item: dict):
+    return {"square": item['x']**2}
+```
+
+This now creates an endpoint that takes a JSON body:
+
+```
+$ curl -X POST -H 'Content-Type: application/json' --data-binary '{"x": 42}' https://modal-labs--web-endpoint-square-dev.modal.run
+{"square":1764}
+```
+
+This is often the easiest way to get started, but note that FastAPI recommends
+that you use
+[typed Pydantic models](https://fastapi.tiangolo.com/tutorial/body/) in order to
+get automatic validation and documentation. FastAPI also lets you pass data to
+web endpoints in other ways, for instance as
+[form data](https://fastapi.tiangolo.com/tutorial/request-forms/) and
+[file uploads](https://fastapi.tiangolo.com/tutorial/request-files/).
+
+## How do web endpoints run in the cloud?
+
+Note that web endpoints, like everything else on Modal, only run when they need
+to. When you hit the web endpoint the first time, it will boot up the container,
+which might take a few seconds. Modal keeps the container alive for a short
+period in case there are subsequent requests. If there are a lot of requests,
+Modal might create more containers running in parallel.
+
+For the shortcut `@modal.fastapi_endpoint` decorator, Modal wraps your function in a
+[FastAPI](https://fastapi.tiangolo.com/) application. This means that the
+[Image](https://modal.com/docs/guide/images)
+your Function uses must have FastAPI installed, and the Functions that you write
+need to follow its request and response
+[semantics](https://fastapi.tiangolo.com/tutorial). Web endpoint Functions can use
+all of FastAPI's powerful features, such as Pydantic models for automatic validation,
+typed query and path parameters, and response types.
+
+Here's everything together, combining Modal's abilities to run functions in
+user-defined containers with the expressivity of FastAPI:
+
+```python
+import modal
+from fastapi.responses import HTMLResponse
+from pydantic import BaseModel
+
+image = modal.Image.debian_slim().pip_install("fastapi[standard]", "boto3")
+app = modal.App(image=image)
+
+class Item(BaseModel):
+    name: str
+    qty: int = 42
+
+@app.function()
+@modal.fastapi_endpoint(method="POST")
+def f(item: Item):
+    import boto3
+    # do things with boto3...
+    return HTMLResponse(f"<html>Hello, {item.name}!</html>")
+```
+
+This endpoint definition would be called like so:
+
+```bash
+curl -d '{"name": "Erik", "qty": 10}' \
+    -H "Content-Type: application/json" \
+    -X POST https://ecorp--web-demo-f-dev.modal.run
+```
+
+Or in Python with the [`requests`](https://pypi.org/project/requests/) library:
+
+```python
+import requests
+
+data = {"name": "Erik", "qty": 10}
+requests.post("https://ecorp--web-demo-f-dev.modal.run", json=data, timeout=10.0)
+```
+
+## Serving ASGI and WSGI apps
+
+You can also serve any app written in an
+[ASGI](https://asgi.readthedocs.io/en/latest/) or
+[WSGI](https://en.wikipedia.org/wiki/Web_Server_Gateway_Interface)-compatible
+web framework on Modal.
+
+ASGI provides support for async web frameworks. WSGI provides support for
+synchronous web frameworks.
+
+### ASGI apps - FastAPI, FastHTML, Starlette
+
+For ASGI apps, you can create a function decorated with
+[`@modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app) that returns a reference to
+your web app:
+
+```python
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+
+@app.function(image=image)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def fastapi_app():
+    from fastapi import FastAPI, Request
+
+    web_app = FastAPI()
+
+    @web_app.post("/echo")
+    async def echo(request: Request):
+        body = await request.json()
+        return body
+
+    return web_app
+```
+
+Now, as before, when you deploy this script as a Modal App, you get a URL for
+your app that you can hit:
+
+The `@modal.concurrent` decorator enables a single container
+to process multiple inputs at once, taking advantage of the asynchronous
+event loops in ASGI applications. See [this guide](https://modal.com/docs/guide/concurrent-inputs)
+for details.
+
+#### ASGI Lifespan
+
+While we recommend using [`@modal.enter`](https://modal.com/docs/guide/lifecycle-functions#enter) for defining container lifecycle hooks, we also support the [ASGI lifespan protocol](https://asgi.readthedocs.io/en/latest/specs/lifespan.html). Lifespans begin when containers start, typically at the time of the first request. Here's an example using [FastAPI](https://fastapi.tiangolo.com/advanced/events/#lifespan):
+
+```python
+import modal
+
+app = modal.App("fastapi-lifespan-app")
+
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+
+@app.function(image=image)
+@modal.asgi_app()
+def fastapi_app_with_lifespan():
+    from fastapi import FastAPI, Request
+
+    def lifespan(wapp: FastAPI):
+        print("Starting")
+        yield
+        print("Shutting down")
+
+    web_app = FastAPI(lifespan=lifespan)
+
+    @web_app.get("/")
+    async def hello(request: Request):
+        return "hello"
+
+    return web_app
+```
+
+### WSGI apps - Django, Flask
+
+You can serve WSGI apps using the
+[`@modal.wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app) decorator:
+
+```python
+image = modal.Image.debian_slim().pip_install("flask")
+
+@app.function(image=image)
+@modal.concurrent(max_inputs=100)
+@modal.wsgi_app()
+def flask_app():
+    from flask import Flask, request
+
+    web_app = Flask(__name__)
+
+    @web_app.post("/echo")
+    def echo():
+        return request.json
+
+    return web_app
+```
+
+See [Flask's docs](https://flask.palletsprojects.com/en/2.1.x/deploying/asgi/)
+for more information on using Flask as a WSGI app.
+
+Because WSGI apps are synchronous, concurrent inputs will be run on separate
+threads. See [this guide](https://modal.com/docs/guide/concurrent-inputs) for details.
+
+## Non-ASGI web servers
+
+Not all web frameworks offer an ASGI or WSGI interface. For example,
+[`aiohttp`](https://docs.aiohttp.org/) and [`tornado`](https://www.tornadoweb.org/)
+use their own asynchronous network binding, while others like
+[`text-generation-inference`](https://github.com/huggingface/text-generation-inference)
+actually expose a Rust-based HTTP server running as a subprocess.
+
+For these cases, you can use the
+[`@modal.web_server`](https://modal.com/docs/reference/modal.web_server) decorator to "expose" a
+port on the container:
+
+```python
+@app.function()
+@modal.concurrent(max_inputs=100)
+@modal.web_server(8000)
+def my_file_server():
+    import subprocess
+    subprocess.Popen("python -m http.server -d / 8000", shell=True)
+```
+
+Just like all web endpoints on Modal, this is only run on-demand. The function
+is executed on container startup, creating a file server at the root directory.
+When you hit the web endpoint URL, your request will be routed to the file
+server listening on port `8000`.
+
+For `@web_server` endpoints, you need to make sure that the application binds to
+the external network interface, not just localhost. This usually means binding
+to `0.0.0.0` instead of `127.0.0.1`.
+
+See our examples of how to serve [Streamlit](https://modal.com/docs/examples/serve_streamlit) and
+[ComfyUI](https://modal.com/docs/examples/comfyapp) on Modal.
+
+## Serve many configurations with parametrized functions
+
+Python functions that launch ASGI/WSGI apps or web servers on Modal
+cannot take arguments.
+
+One simple pattern for allowing client-side configuration of these web endpoints
+is to use [parametrized functions](https://modal.com/docs/guide/parametrized-functions).
+Each different choice for the values of the parameters will create a distinct
+auto-scaling container pool.
+
+```python
+@app.cls()
+@modal.concurrent(max_inputs=100)
+class Server:
+    root: str = modal.parameter(default=".")
+
+    @modal.web_server(8000)
+    def files(self):
+        import subprocess
+        subprocess.Popen(f"python -m http.server -d {self.root} 8000", shell=True)
+```
+
+The values are provided in URLs as query parameters:
+
+```bash
+curl https://ecorp--server-files.modal.run		# use the default value
+curl https://ecorp--server-files.modal.run?root=.cache  # use a different value
+curl https://ecorp--server-files.modal.run?root=%2F	# don't forget to URL encode!
+```
+
+For details, see [this guide to parametrized functions](https://modal.com/docs/guide/parametrized-functions).
+
+## WebSockets
+
+Functions annotated with `@web_server`, `@asgi_app`, or `@wsgi_app` also support
+the WebSocket protocol. Consult your web framework for appropriate documentation
+on how to use WebSockets with that library.
+
+WebSockets on Modal maintain a single function call per connection, which can be
+useful for keeping state around. Most of the time, you will want to set your
+handler function to [allow concurrent inputs](https://modal.com/docs/guide/concurrent-inputs),
+which allows multiple simultaneous WebSocket connections to be handled by the
+same container.
+
+We support the full WebSocket protocol as per
+[RFC 6455](https://www.rfc-editor.org/rfc/rfc6455), but we do not yet have
+support for [RFC 8441](https://www.rfc-editor.org/rfc/rfc8441) (WebSockets over
+HTTP/2) or [RFC 7692](https://datatracker.ietf.org/doc/html/rfc7692)
+(`permessage-deflate` extension). WebSocket messages can be up to 2 MiB each.
+
+## Performance and scaling
+
+If you have no active containers when the web endpoint receives a request, it will
+experience a "cold start". Consult the guide page on
+[cold start performance](https://modal.com/docs/guide/cold-start) for more information on when
+Functions will cold start and advice how to mitigate the impact.
+
+If your Function uses `@modal.concurrent`, multiple requests to the same
+endpoint may be handled by the same container. Beyond this limit, additional
+containers will start up to scale your App horizontally. When you reach the
+Function's limit on containers, requests will queue for handling.
+
+Each workspace on Modal has a rate limit on total operations. For a new account,
+this is set to 200 function inputs or web endpoint requests per second, with a
+burst multiplier of 5 seconds. If you reach the rate limit, excess requests to
+web endpoints will return a
+[429 status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429),
+and you'll need to [get in touch](mailto:support@modal.com) with us about
+raising the limit.
+
+Web endpoint request bodies can be up to 4 GiB, and their response bodies are
+unlimited in size.
+
+## Authentication
+
+Modal offers first-class web endpoint protection via [proxy auth tokens](https://modal.com/docs/guide/webhook-proxy-auth).
+Proxy auth tokens protect web endpoints by requiring a key and token combination to be passed
+in the `Modal-Key` and `Modal-Secret` headers.
+Modal works as a proxy, rejecting requests that aren't authorized to access
+your endpoint.
+
+We also support standard techniques for securing web servers.
+
+### Token-based authentication
+
+This is easy to implement in whichever framework you're using. For example, if
+you're using `@modal.fastapi_endpoint` or `@modal.asgi_app` with FastAPI, you
+can validate a Bearer token like this:
+
+```python
+from fastapi import Depends, HTTPException, status, Request
+from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
+
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+app = modal.App("auth-example", image=image)
+
+auth_scheme = HTTPBearer()
+
+@app.function(secrets=[modal.Secret.from_name("my-web-auth-token")])
+@modal.fastapi_endpoint()
+async def f(request: Request, token: HTTPAuthorizationCredentials = Depends(auth_scheme)):
+    import os
+
+    print(os.environ["AUTH_TOKEN"])
+
+    if token.credentials != os.environ["AUTH_TOKEN"]:
+        raise HTTPException(
+            status_code=status.HTTP_401_UNAUTHORIZED,
+            detail="Incorrect bearer token",
+            headers={"WWW-Authenticate": "Bearer"},
+        )
+
+    # Function body
+    return "success!"
+```
+
+This assumes you have a [Modal Secret](https://modal.com/secrets) named
+`my-web-auth-token` created, with contents `{AUTH_TOKEN: secret-random-token}`.
+Now, your endpoint will return a 401 status code except when you hit it with the
+correct `Authorization` header set (note that you have to prefix the token with
+`Bearer `):
+
+```bash
+curl --header "Authorization: Bearer secret-random-token" https://modal-labs--auth-example-f.modal.run
+```
+
+### Client IP address
+
+You can access the IP address of the client making the request. This can be used
+for geolocation, whitelists, blacklists, and rate limits.
+
+```python
+from fastapi import Request
+
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+app = modal.App(image=image)
+
+@app.function()
+@modal.fastapi_endpoint()
+def get_ip_address(request: Request):
+    return f"Your IP address is {request.client.host}"
+```
+
+#### Streaming endpoints
+
+# Streaming endpoints
+
+Modal web endpoints support streaming responses using FastAPI's
+[`StreamingResponse`](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
+class. This class accepts asynchronous generators, synchronous generators, or
+any Python object that implements the
+[_iterator protocol_](https://docs.python.org/3/library/stdtypes.html#typeiter),
+and can be used with Modal Functions!
+
+## Simple example
+
+This simple example combines Modal's `@modal.fastapi_endpoint` decorator with a
+`StreamingResponse` object to produce a real-time SSE response.
+
+```python
+import time
+
+def fake_event_streamer():
+    for i in range(10):
+        yield f"data: some data {i}\n\n".encode()
+        time.sleep(0.5)
+
+@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
+@modal.fastapi_endpoint()
+def stream_me():
+    from fastapi.responses import StreamingResponse
+    return StreamingResponse(
+        fake_event_streamer(), media_type="text/event-stream"
+    )
+```
+
+If you serve this web endpoint and hit it with `curl`, you will see the ten SSE
+events progressively appear in your terminal over a ~5 second period.
+
+```shell
+curl --no-buffer https://modal-labs--example-streaming-stream-me.modal.run
+```
+
+The MIME type of `text/event-stream` is important in this example, as it tells
+the downstream web server to return responses immediately, rather than buffering
+them in byte chunks (which is more efficient for compression).
+
+You can still return other content types like large files in streams, but they
+are not guaranteed to arrive as real-time events.
+
+## Streaming responses with `.remote`
+
+A Modal Function wrapping a generator function body can have its response passed
+directly into a `StreamingResponse`. This is particularly useful if you want to
+do some GPU processing in one Modal Function that is called by a CPU-based web
+endpoint Modal Function.
+
+```python
+@app.function(gpu="any")
+def fake_video_render():
+    for i in range(10):
+        yield f"data: finished processing some data from GPU {i}\n\n".encode()
+        time.sleep(1)
+
+@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
+@modal.fastapi_endpoint()
+def hook():
+    from fastapi.responses import StreamingResponse
+    return StreamingResponse(
+        fake_video_render.remote_gen(), media_type="text/event-stream"
+    )
+```
+
+## Streaming responses with `.map` and `.starmap`
+
+You can also combine Modal Function parallelization with streaming responses,
+enabling applications to service a request by farming out to dozens of
+containers and iteratively returning result chunks to the client.
+
+```python
+@app.function()
+def map_me(i):
+    return f"segment {i}\n"
+
+@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
+@modal.fastapi_endpoint()
+def mapped():
+    from fastapi.responses import StreamingResponse
+    return StreamingResponse(map_me.map(range(10)), media_type="text/plain")
+```
+
+This snippet will spread the ten `map_me(i)` executions across containers, and
+return each string response part as it completes. By default the results will be
+ordered, but if this isn't necessary pass `order_outputs=False` as keyword
+argument to the `.map` call.
+
+### Asynchronous streaming
+
+The example above uses a synchronous generator, which automatically runs on its
+own thread, but in asynchronous applications, a loop over a `.map` or `.starmap`
+call can block the event loop. This will stop the `StreamingResponse` from
+returning response parts iteratively to the client.
+
+To avoid this, you can use the `.aio()` method to convert a synchronous `.map`
+into its async version. Also, other blocking calls should be offloaded to a
+separate thread with `asyncio.to_thread()`. For example:
+
+```python
+@app.function(gpu="any", image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
+@modal.fastapi_endpoint()
+async def transcribe_video(request):
+    from fastapi.responses import StreamingResponse
+
+    segments = await asyncio.to_thread(split_video, request)
+    return StreamingResponse(wrapper(segments), media_type="text/event-stream")
+
+# Notice that this is an async generator.
+async def wrapper(segments):
+    async for partial_result in transcribe_video.map.aio(segments):
+        yield "data: " + partial_result + "\n\n"
+```
+
+## Further examples
+
+- Complete code the for the simple examples given above is available
+  [in our modal-examples Github repository](https://github.com/modal-labs/modal-examples/blob/main/07_web_endpoints/streaming.py).
+- [An end-to-end example of streaming Youtube video transcriptions with OpenAI's whisper model.](https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/openai_whisper/streaming/main.py)
+
+#### Web endpoint URLs
+
+# Web endpoint URLs
+
+This guide documents the behavior of URLs for [web endpoints](https://modal.com/docs/guide/webhooks)
+on Modal: automatic generation, configuration, programmatic retrieval, and more.
+
+## Determine the URL of a web endpoint from code
+
+Modal Functions with the
+[`fastapi_endpoint`](https://modal.com/docs/reference/modal.fastapi_endpoint),
+[`asgi_app`](https://modal.com/docs/reference/modal.asgi_app),
+[`wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app),
+or [`web_server`](https://modal.com/docs/reference/modal.web_server) decorator
+are made available over the Internet when they are
+[`serve`d](https://modal.com/docs/reference/cli/serve) or [`deploy`ed](https://modal.com/docs/reference/cli/deploy)
+and so they have a URL.
+
+This URL is displayed in the `modal` CLI output
+and is available in the Modal [dashboard](https://modal.com/apps) for the Function.
+
+To determine a Function's URL programmatically,
+check its [`get_web_url()`](https://modal.com/docs/reference/modal.Function#get_web_url)
+property:
+
+```python
+@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
+@modal.fastapi_endpoint(docs=True)
+def show_url() -> str:
+    return show_url.get_web_url()
+```
+
+For deployed Functions, this also works from other Python code!
+You just need to do a [`from_name`](https://modal.com/docs/reference/modal.Function#from_name)
+based on the name of the Function and its [App](https://modal.com/docs/guide/apps):
+
+```python notest
+import requests
+
+remote_function = modal.Function.from_name("app", "show_url")
+remote_function.get_web_url() == requests.get(handle.get_web_url()).json()
+```
+
+## Auto-generated URLs
+
+By default, Modal Functions
+will be served from the `modal.run` domain.
+The full URL will be constructed from a number of pieces of information
+to uniquely identify the endpoint.
+
+At a high-level, web endpoint URLs for deployed applications have the
+following structure: `https://<source>--<label>.modal.run`.
+
+The `source` component represents the workspace and environment where the App is
+deployed. If your workspace has only a single environment, the `source` will
+just be the workspace name. Multiple environments are disambiguated by an
+["environment suffix"](https://modal.com/docs/guide/environments#environment-web-suffixes), so
+the full source would be `<workspace>-<suffix>`. However, one environment per
+workspace is allowed to have a null suffix, in which case the source would just
+be `<workspace>`.
+
+The `label` component represents the specific App and Function that the endpoint
+routes to. By default, these are concatenated with a hyphen, so the label would
+be `<app>-<function>`.
+
+These components are normalized to contain only lowercase letters, numerals, and dashes.
+
+To put this all together, consider the following example. If a member of the
+`ECorp` workspace uses the `main` environment (which has `prod` as its web
+suffix) to deploy the `text_to_speech` app with a webhook for the `flask-app`
+function, the URL will have the following components:
+
+- _Source_:
+  - _Workspace name slug_: `ECorp` → `ecorp`
+  - _Environment web suffix slug_: `main` → `prod`
+- _Label_:
+  - _App name slug_: `text_to_speech` → `text-to-speech`
+  - _Function name slug_: `flask_app` → `flask-app`
+
+The full URL will be `https://ecorp-prod--text-to-speech-flask-app.modal.run`.
+
+## User-specified labels
+
+It's also possible to customize the `label` used for each Function
+by passing a parameter to the relevant endpoint decorator:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi")
+app = modal.App(name="text_to_speech", image=image)
+
+@app.function()
+@modal.fastapi_endpoint(label="speechify")
+def web_endpoint_handler():
+    ...
+```
+
+Building on the example above, this code would produce the following URL:
+`https://ecorp-prod--speechify.modal.run`.
+
+User-specified labels are not automatically normalized, but labels with
+invalid characters will be rejected.
+
+## Ephemeral apps
+
+To support development workflows, webhooks for ephemeral apps (i.e., apps
+created with `modal serve`) will have a `-dev` suffix appended to their URL
+label (regardless of whether the label is auto-generated or user-specified).
+This prevents development work from interfering with deployed versions of the
+same app.
+
+If an ephemeral app is serving a webhook while another ephemeral webhook is
+created seeking the same web endpoint label, the new function will _steal_ the
+running webhook's label.
+
+This ensures that the latest iteration of the ephemeral function is
+serving requests and that older ones stop receiving web traffic.
+
+## Truncation
+
+If a generated subdomain label is longer than 63 characters, it will be
+truncated.
+
+For example, the following subdomain label is too long, 67 characters:
+`ecorp--text-to-speech-really-really-realllly-long-function-name-dev`.
+
+The truncation happens by calculating a SHA-256 hash of the overlong label, then
+taking the first 6 characters of this hash. The overlong subdomain label is
+truncated to 56 characters, and then joined by a dash to the hash prefix. In
+the above example, the resulting URL would be
+`ecorp--text-to-speech-really-really-rea-1b964b-dev.modal.run`.
+
+The combination of the label hashing and truncation provides a unique list of 63
+characters, complying with both DNS system limits and uniqueness requirements.
+
+## Custom domains
+
+**Custom domains are available on our
+[Team and Enterprise plans](https://modal.com/settings/plans).**
+
+For more customization, you can use your own domain names with Modal web
+endpoints. If your [plan](https://modal.com/pricing) supports custom domains, visit the [Domains
+tab](https://modal.com/settings/domains) in your workspace settings to add a domain name to your
+workspace.
+
+You can use three kinds of domains with Modal:
+
+- **Apex:** root domain names like `example.com`
+- **Subdomain:** single subdomain entries such as `my-app.example.com`,
+  `api.example.com`, etc.
+- **Wildcard domain:** either in a subdomain like `*.example.com`, or in a
+  deeper level like `*.modal.example.com`
+
+You'll be asked to update your domain DNS records with your domain name
+registrar and then validate the configuration in Modal. Once the records have
+been properly updated and propagated, your custom domain will be ready to use.
+
+You can assign any Modal web endpoint to any registered domain in your workspace
+with the `custom_domains` argument.
+
+```python
+import modal
+
+app = modal.App("custom-domains-example")
+
+@app.function()
+@modal.fastapi_endpoint(custom_domains=["api.example.com"])
+def hello(message: str):
+    return {"message": f"hello {message}"}
+```
+
+You can then run `modal deploy` to put your web endpoint online, live.
+
+```shell
+$ curl -s https://api.example.com?message=world
+{"message": "hello world"}
+```
+
+Note that Modal automatically generates and renews TLS certificates for your
+custom domains. Since we do this when your domain is first accessed, there may
+be an additional 1-2s latency on the first request. Additional requests use a
+cached certificate.
+
+You can also register multiple domain names and associate them with the same web
+endpoint.
+
+```python
+import modal
+
+app = modal.App("custom-domains-example-2")
+
+@app.function()
+@modal.fastapi_endpoint(custom_domains=["api.example.com", "api.example.net"])
+def hello(message: str):
+    return {"message": f"hello {message}"}
+```
+
+For **Wildcard** domains, Modal will automatically resolve arbitrary custom
+endpoints (and issue TLS certificates). For example, if you add the wildcard
+domain `*.example.com`, then you can create any custom domains under
+`example.com`:
+
+```python
+import random
+import modal
+
+app = modal.App("custom-domains-example-2")
+
+random_domain_name = random.choice(range(10))
+
+@app.function()
+@modal.fastapi_endpoint(custom_domains=[f"{random_domain_name}.example.com"])
+def hello(message: str):
+    return {"message": f"hello {message}"}
+```
+
+Custom domains can also be used with
+[ASGI](https://modal.com/docs/reference/modal.asgi_app#modalasgi_app) or
+[WSGI](https://modal.com/docs/reference/modal.wsgi_app) apps using the same
+`custom_domains` argument.
+
+#### Request timeouts
+
+# Request timeouts
+
+Web endpoint (a.k.a. webhook) requests should complete quickly, ideally within a
+few seconds. All web endpoint function types
+([`web_endpoint`, `asgi_app`, `wsgi_app`](https://modal.com/docs/reference/modal.web_endpoint))
+have a maximum HTTP request timeout of 150 seconds enforced. However, the
+underlying Modal function can have a longer [timeout](https://modal.com/docs/guide/timeouts).
+
+In case the function takes more than 150 seconds to complete, a HTTP status 303
+redirect response is returned pointing at the original URL with a special query
+parameter linking it that request. This is the _result URL_ for your function.
+Most web browsers allow for up to 20 such redirects, effectively allowing up to
+50 minutes (20 \* 150 s) for web endpoints before the request times out.
+
+(**Note:** This does not work with requests that require
+[CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS), since the
+response will not have been returned from your code in time for the server to
+populate CORS headers.)
+
+Some libraries and tools might require you to add a flag or option in order to
+follow redirects automatically, e.g. `curl -L ...` or `http --follow ...`.
+
+The _result URL_ can be reloaded without triggering a new request. It will block
+until the request completes.
+
+(**Note:** As of March 2025, the Python standard library's `urllib` module has the
+maximum number of redirects to any single URL set to 4 by default ([source](https://github.com/python/cpython/blob/main/Lib/urllib/request.py)), which would limit the total timeout to 12.5 minutes (5 \* 150 s = 750 s) unless this setting is overridden.)
+
+## Polling solutions
+
+Sometimes it can be useful to be able to poll for results rather than wait for a
+long running HTTP request. The easiest way to do this is to have your web
+endpoint spawn a `modal.Function` call and return the function call id that
+another endpoint can use to poll the submitted function's status. Here is an
+example:
+
+```python
+import fastapi
+
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+app = modal.App(image=image)
+
+web_app = fastapi.FastAPI()
+
+@app.function()
+@modal.asgi_app()
+def fastapi_app():
+    return web_app
+
+@app.function()
+def slow_operation():
+    ...
+
+@web_app.post("/accept")
+async def accept_job(request: fastapi.Request):
+    call = slow_operation.spawn()
+    return {"call_id": call.object_id}
+
+@web_app.get("/result/{call_id}")
+async def poll_results(call_id: str):
+    function_call = modal.FunctionCall.from_id(call_id)
+    try:
+        return function_call.get(timeout=0)
+    except TimeoutError:
+        http_accepted_code = 202
+        return fastapi.responses.JSONResponse({}, status_code=http_accepted_code)
+```
+
+[_Document OCR Web App_](https://modal.com/docs/examples/doc_ocr_webapp) is an example that uses
+this pattern.
+
+#### Proxy Auth Tokens
+
+# Proxy Auth Tokens
+
+Use Proxy Auth Tokens to prevent unauthorized clients from triggering your web endpoints.
+
+```python
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi")
+app = modal.App("proxy-auth-public", image=image)
+
+@app.function()
+@modal.fastapi_endpoint()
+def public():
+    return "hello world"
+
+@app.function()
+@modal.fastapi_endpoint(requires_proxy_auth=True)
+def private():
+    return "hello friend"
+```
+
+The `public` endpoint can be hit by any client over the Internet.
+
+```bash
+curl https://public-url--goes-here.modal.run
+```
+
+The `private` endpoint cannot.
+
+```bash
+curl --fail-with-body https://private-url--goes-here.modal.run
+# modal-http: missing credentials for proxy authorization
+# curl: (22) The requested URL returned error: 401
+# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401
+```
+
+Authorization is demonstrated via a Proxy Auth Token. You can create a Proxy Auth Token for your workspace [here](https://modal.com/settings/proxy-auth-tokens).
+In requests to the web endpoint, clients supply the Token ID and Token Secret in the `Modal-Key` and `Modal-Secret` HTTP headers.
+
+```bash
+export TOKEN_ID=wk-1234abcd
+export TOKEN_SECRET=ws-1234abcd
+curl -H "Modal-Key: $TOKEN_ID" \
+     -H "Modal-Secret: $TOKEN_SECRET" \
+     https://private-url--goes-here.modal.run
+```
+
+Proxy authorization can be added to [web endpoints](https://modal.com/docs/guide/webhooks) created by the
+[`fastapi_endpoint`](https://modal.com/docs/reference/modal.fastapi_endpoint),
+[`asgi_app`](https://modal.com/docs/reference/modal.asgi_app),
+[`wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app), or
+[`web_server`](https://modal.com/docs/reference/modal.web_server) decorators,
+which are otherwise publicly available.
+
+Everyone within the workspace of the web endpoint can manage its Proxy Auth Tokens.
+
+### Networking
+
+#### Tunnels
+
+# Tunnels
+
+Modal allows you to expose live TCP ports on a Modal container. This is done by
+creating a _tunnel_ that forwards the port to the public Internet.
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function()
+def start_app():
+    # Inside this `with` block, port 8000 on the container can be accessed by
+    # the address at `tunnel.url`, which is randomly assigned.
+    with modal.forward(8000) as tunnel:
+        print(f"tunnel.url        = {tunnel.url}")
+        print(f"tunnel.tls_socket = {tunnel.tls_socket}")
+        # ... start some web server at port 8000, using any framework
+```
+
+Tunnels are direct connections and terminate TLS automatically. Within a few
+milliseconds of container startup, this function prints a message such as:
+
+```
+tunnel.url        = https://wtqcahqwhd4tu0.r5.modal.host
+tunnel.tls_socket = ('wtqcahqwhd4tu0.r5.modal.host', 443)
+```
+
+You can also create tunnels on a [Sandbox](https://modal.com/docs/guide/sandbox-networking#forwarding-ports)
+to directly expose the container's ports.
+
+## Build with tunnels
+
+Tunnels are the fastest way to get a low-latency, direct connection to a running
+container. You can use them to run live browser applications with **interactive
+terminals**, **Jupyter notebooks**, **VS Code servers**, and more.
+
+As a quick example, here is how you would expose a Jupyter notebook:
+
+```python
+import os
+import secrets
+import subprocess
+
+import modal
+
+image = modal.Image.debian_slim().pip_install("jupyterlab")
+app = modal.App(image=image)
+
+@app.function()
+def run_jupyter():
+    token = secrets.token_urlsafe(13)
+    with modal.forward(8888) as tunnel:
+        url = tunnel.url + "/?token=" + token
+        print(f"Starting Jupyter at {url}")
+        subprocess.run(
+            [
+                "jupyter",
+                "lab",
+                "--no-browser",
+                "--allow-root",
+                "--ip=0.0.0.0",
+                "--port=8888",
+                "--LabApp.allow_origin='*'",
+                "--LabApp.allow_remote_access=1",
+            ],
+            env={**os.environ, "JUPYTER_TOKEN": token, "SHELL": "/bin/bash"},
+            stderr=subprocess.DEVNULL,
+        )
+```
+
+When you run the function, it starts Jupyter and gives you the public URL. It's
+as simple as that.
+
+All Modal features are supported. If you
+[need GPUs](https://modal.com/docs/guide/gpu), pass `gpu=` to the
+`@app.function()` decorator. If you
+[need more CPUs, RAM](https://modal.com/docs/guide/resources), or to attach
+[volumes](https://modal.com/docs/guide/volumes), those
+also just work.
+
+### Programmable startup
+
+The tunnel API is completely on-demand, so you can start them as the result of a
+web request.
+
+For example, you could make something like Jupyter Hub without leaving Modal,
+giving your users their own Jupyter notebooks when they visit a URL:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi[standard]")
+app = modal.App(image=image)
+
+@app.function(timeout=900)  # 15 minutes
+def run_jupyter(q):
+    ...  # as before, but return the URL on app.q
+
+@app.function()
+@modal.fastapi_endpoint(method="POST")
+def jupyter_hub():
+    from fastapi import HTTPException
+    from fastapi.responses import RedirectResponse
+
+    ...  # do some validation on the secret or bearer token
+
+    if is_valid:
+        with modal.Queue.ephemeral() as q:
+            run_jupyter.spawn(q)
+            url = q.get()
+            return RedirectResponse(url, status_code=303)
+
+    else:
+        raise HTTPException(401, "Not authenticated")
+```
+
+This gives every user who sends a POST request to the web endpoint their own
+Jupyter notebook server, on a fully isolated Modal container.
+
+You could do the same with VS Code and get some basic version of an instant,
+serverless IDE!
+
+### Advanced: Unencrypted TCP tunnels
+
+By default, tunnels are only exposed to the Internet at a secure random URL, and
+connections have automatic TLS (the "S" in HTTPS). However, sometimes you might
+need to expose a protocol like an SSH server that goes directly over TCP. In
+this case, we have support for _unencrypted_ tunnels:
+
+```python notest
+with modal.forward(8000, unencrypted=True) as tunnel:
+    print(f"tunnel.tcp_socket = {tunnel.tcp_socket}")
+```
+
+Might produce an output like:
+
+```
+tunnel.tcp_socket = ('r3.modal.host', 23447)
+```
+
+You can then connect over TCP, for example with `nc r3.modal.host 23447`. Unlike
+encrypted TLS sockets, these cannot be given a non-guessable, cryptographically
+random URL due to how the TCP protocol works, so they are assigned a random port
+number instead.
+
+## Pricing
+
+Modal only charges for containers based on
+[the resources you use](https://modal.com/pricing). There is no additional
+charge for having an active tunnel.
+
+For example, if you start a Jupyter notebook on port 8888 and access it via
+tunnel, you can use it for an hour for development (with 0.01 CPUs) and then
+actually run an intensive job with 16 CPUs for one minute. The amount you would
+be billed for in that hour is 0.01 + 16 \* (1/60) = **0.28 CPUs**, even though
+you had access to 16 CPUs without needing to restart your notebook.
+
+## Security
+
+Tunnels are run on Modal's private global network of Internet relays. On
+startup, your container will connect to the nearest tunnel so you get the
+minimum latency, very similar in performance to a direct connection with the
+machine.
+
+This makes them ideal for live debugging sessions, using web-based terminals
+like [ttyd](https://github.com/tsl0922/ttyd).
+
+The generated URLs are cryptographically random, but they are also public on the
+Internet, so anyone can access your application if they are given the URL.
+
+We do not currently do any detection of requests above L4, so if you are running
+a web server, we will not add special proxy HTTP headers or translate HTTP/2.
+You're just getting the TLS-encrypted TCP stream directly!
+
+#### Proxies (beta)
+
+# Proxies (beta)
+
+You can securely connect with resources in your private network
+using a Modal Proxy. Proxies are a secure tunnel between
+Apps and exit nodes with static IPs. You can allow-list those static IPs
+in your network firewall, making sure that only traffic originating from these
+IP addresses is allowed into your network.
+
+Proxies are unique and not shared between workspaces. All traffic
+between your Apps and the Proxy server is encrypted using
+[WireGuard](https://www.wireguard.com/).
+
+Modal Proxies are built on top of [vprox](https://github.com/modal-labs/vprox),
+a Modal open-source project used to create highly available proxy servers
+using WireGuard.
+
+_Modal Proxies are in beta. Please let us know if you run into issues._
+
+## Creating a Proxy
+
+Proxies are available for [Team Plan](https://modal.com/pricing) or [Enterprise](https://modal.com/pricing) users.
+
+You can create Proxies in your workspace [Settings](https://modal.com/settings) page.
+Team Plan users can create one Proxy and Enterprise users three Proxies. Each Proxy
+can have a maximum of five static IP addresses.
+
+Please reach out to [support@modal.com](mailto:support@modal.com) if you need greater limits.
+
+## Using a Proxy
+
+After a Proxy is online, add it to a Modal Function with the argument
+`proxy=Proxy.from_name("<your-proxy>")`. For example:
+
+```python
+import modal
+import subprocess
+
+app = modal.App(image=modal.Image.debian_slim().apt_install("curl"))
+
+@app.function(proxy=modal.Proxy.from_name("<your-proxy>"))
+def my_ip():
+    subprocess.run(["curl", "-s", "ifconfig.me"])
+
+@app.local_entrypoint()
+def main():
+    my_ip.remote()
+```
+
+All network traffic from your Function will now use the Proxy as a tunnel.
+
+The program above will always print the same IP address independent
+of where it runs in Modal's infrastructure. If that same program
+were to run without a Proxy, it would print a different IP
+address depending on where it runs.
+
+## Proxy performance
+
+All traffic that goes through a Proxy is encrypted by WireGuard. This adds
+latency to your Function's networking. If you are experiencing networking issues
+with Proxies related to performance, first add more IP addresses to your
+Proxy (see [Adding more IP addresses to a Proxy](#adding-more-ip-addresses-to-a-proxy)).
+
+## Adding more IP addresses to a Proxy
+
+Proxies support up to five static IP addresses. Adding IP addresses improves
+throughput linearly.
+
+You can add an IP address to your workspace in [Settings](https://modal.com/settings) > Proxies.
+Select the desired Proxy and add a new IP.
+
+If a Proxy has multiple IPs, Modal will randomly pick one when running your Function.
+
+## Proxies and Sandboxes
+
+Proxies can also be used with [Sandboxes](https://modal.com/docs/guide/sandboxes). For example:
+
+```python notest
+import modal
+
+app = modal.App.lookup("sandbox-proxy", create_if_missing=True)
+sb = modal.Sandbox.create(
+    app=app,
+    image=modal.Image.debian_slim().apt_install("curl"),
+    proxy=modal.Proxy.from_name("<your-proxy>"))
+
+process = sb.exec("curl", "-s", "https://ifconfig.me")
+stdout = process.stdout.read()
+print(stdout)
+
+sb.terminate()
+```
+
+Similarly to our Function implementation, this Sandbox program will
+always print the same IP address.
+
+#### Cluster networking
+
+# Cluster networking
+
+i6pn (IPv6 private networking) is Modal’s private container-to-container networking solution. It allows users to create clusters of Modal containers which can send network traffic to each other with low latency and high bandwidth (≥ 50Gbps).
+
+Normally, `modal.Function` containers can initiate outbound network connections to the internet but they are not directly addressable by other containers. i6pn-enabled containers, on the other hand, can be directly connected to by other i6pn-enabled containers and this is a key enabler of Modal’s preview `@modal.experimental.clustered` functionality.
+
+You can enable i6pn on any `modal.Function`:
+
+```python
+@app.function(i6pn=True)
+def hello_private_network():
+    import socket
+
+    i6pn_addr = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]
+    print(i6pn_addr) # fdaa:5137:3ebf:a70:1b9d:3a11:71f2:5f0f
+```
+
+In this snippet we see that the i6pn-enabled container is able to retrieve its own IPv6 address by
+resolving `i6pn.modal.local`. For this Function container to discover the addresses of _other_ containers,
+address sharing must be implemented using an auxiliary data structure, such as a shared `modal.Dict` or `modal.Queue`.
+
+## Private networking
+
+All i6pn network traffic is _Workspace private_.
+
+![i6pn-diagram](https://modal-cdn.com/cdnbot/i6pn-1eksk4vuy_c4c4a0df.webp)
+
+In the image above, Workspace A has subnet `fdaa:1::/48`, while Workspace B has subnet `fdaa:2::/48`.
+
+You’ll notice they share the first 16 bits. This is because the `fdaa::/16` prefix contains all of our private network IPv6 addresses, while each workspace is assigned a random 32-bit identifier when it is created. Together, these form the 48-bit subnet.
+
+The upshot of this is that only containers in the same workspace can see each other and send each other network packets. i6pn networking is secure by default.
+
+## Region boundaries
+
+Modal operates a [global fleet](https://modal.com/docs/guide/region-selection) and allows containers to run on multiple cloud providers and in many regions. i6pn networking is however region-scoped functionality, meaning that only i6pn-enabled containers in the same region can perform network communication.
+
+Modal’s i6pn-enabled primitives such as `@modal.experimental.clustered` automatically restrict container geographic placement and cloud placement to ensure inter-container connectivity.
+
+## Public network access to cluster networking
+
+For cluster networked containers that need to be publicly accessible, you need to expose ports with [modal.Tunnel](https://modal.com/docs/guide/tunnels) because i6pn addresses are not publicly exposed.
+
+Consider having a container setup a Tunnel and act as the gateway to the private cluster networking.
+
+### Data sharing and storage
+
+#### Passing local data
+
+# Passing local data
+
+If you have a function that needs access to some data not present in your Python
+files themselves you have a few options for bundling that data with your Modal
+app.
+
+## Passing function arguments
+
+The simplest and most straight-forward way is to read the data from your local
+script and pass the data to the outermost Modal function call:
+
+```python
+import json
+
+@app.function()
+def foo(a):
+    print(sum(a["numbers"]))
+
+@app.local_entrypoint()
+def main():
+    data_structure = json.load(open("blob.json"))
+    foo.remote(data_structure)
+```
+
+Any data of reasonable size that is serializable through
+[cloudpickle](https://github.com/cloudpipe/cloudpickle) is passable as an
+argument to Modal functions.
+
+Refer to the section on [global variables](https://modal.com/docs/guide/global-variables) for how
+to work with objects in global scope that can only be initialized locally.
+
+## Including local files
+
+For including local files for your Modal Functions to access, see [Defining Images](https://modal.com/docs/guide/images).
+
+#### Volumes
+
+# Volumes
+
+Modal Volumes provide a high-performance distributed file system for your Modal applications.
+They are designed for write-once, read-many I/O workloads, like creating machine learning model
+weights and distributing them for inference.
+
+This page is a high-level guide to using Modal Volumes.
+For reference documentation on the `modal.Volume` object, see
+[this page](https://modal.com/docs/reference/modal.Volume).
+For reference documentation on the `modal volume` CLI command, see
+[this page](https://modal.com/docs/reference/cli/volume).
+
+## Volumes v2
+
+A new generation of the file system, Volumes v2, is now available as a
+beta preview.
+
+> 🌱 Instructions that are specific to v2 Volumes will be annotated with 🌱
+> below.
+
+Read more about [Volumes v2](#volumes-v2-overview) below.
+
+## Creating a Volume
+
+The easiest way to create a Volume and use it as a part of your App is to use
+the [`modal volume create`](https://modal.com/docs/reference/cli/volume#modal-volume-create) CLI command. This will create the Volume and output
+some sample code:
+
+```bash
+% modal volume create my-volume
+Created volume 'my-volume' in environment 'main'.
+```
+
+> 🌱 To create a v2 Volume, pass `--version=2` in the command above.
+
+## Using a Volume on Modal
+
+To attach an existing Volume to a Modal Function, use [`Volume.from_name`](https://modal.com/docs/reference/modal.Volume#from_name):
+
+```python
+vol = modal.Volume.from_name("my-volume")
+
+@app.function(volumes={"/data": vol})
+def run():
+    with open("/data/xyz.txt", "w") as f:
+        f.write("hello")
+    vol.commit()  # Needed to make sure all changes are persisted before exit
+```
+
+You can also browse and manipulate Volumes from an ad hoc Modal Shell:
+
+```bash
+% modal shell --volume my-volume --volume another-volume
+```
+
+Volumes will be mounted under `/mnt`.
+
+Volumes are designed to provide up to 2.5 GB/s of bandwidth.
+Actual throughput is not guaranteed and may be lower depending on network conditions.
+
+## Downloading a file from a Volume
+
+While there’s no file size limit for individual files in a volume, the frontend only supports downloading files up to 16 MB. For larger files, please use the CLI:
+
+```bash
+% modal volume get my-volume xyz.txt xyz-local.txt
+```
+
+### Creating Volumes lazily from code
+
+You can also create Volumes lazily from code using:
+
+```python
+vol = modal.Volume.from_name("my-volume", create_if_missing=True)
+```
+
+> 🌱 To create a v2 Volume, pass `version=2` to the call to `from_name()` in the code above.
+
+This will create the Volume if it doesn't exist.
+
+## Using a Volume from outside of Modal
+
+Volumes can also be used outside Modal via the [Python SDK](https://modal.com/docs/reference/modal.Volume#modalvolume) or our [CLI](https://modal.com/docs/reference/cli/volume).
+
+### Using a Volume from local code
+
+You can interact with Volumes from anywhere you like using the `modal` Python client library.
+
+```python notest
+vol = modal.Volume.from_name("my-volume")
+
+with vol.batch_upload() as batch:
+    batch.put_file("local-path.txt", "/remote-path.txt")
+    batch.put_directory("/local/directory/", "/remote/directory")
+    batch.put_file(io.BytesIO(b"some data"), "/foobar")
+```
+
+For more details, see the [reference documentation](https://modal.com/docs/reference/modal.Volume).
+
+### Using a Volume via the command line
+
+You can also interact with Volumes using the command line interface. You can run
+`modal volume` to get a full list of its subcommands:
+
+```bash
+% modal volume
+Usage: modal volume [OPTIONS] COMMAND [ARGS]...
+
+ Read and edit modal.Volume volumes.
+ Note: users of modal.NetworkFileSystem should use the modal nfs command instead.
+
+╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ --help          Show this message and exit.                                                                                                                                                            │
+╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ File operations ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ cp       Copy within a modal.Volume. Copy source file to destination file or multiple source files to destination directory.                                                                           │
+│ get      Download files from a modal.Volume object.                                                                                                                                                    │
+│ ls       List files and directories in a modal.Volume volume.                                                                                                                                          │
+│ put      Upload a file or directory to a modal.Volume.                                                                                                                                                 │
+│ rm       Delete a file or directory from a modal.Volume.                                                                                                                                               │
+╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ Management ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ create   Create a named, persistent modal.Volume.                                                                                                                                                      │
+│ delete   Delete a named, persistent modal.Volume.                                                                                                                                                      │
+│ list     List the details of all modal.Volume volumes in an Environment.                                                                                                                               │
+╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+```
+
+For more details, see the [reference documentation](https://modal.com/docs/reference/cli/volume).
+
+## Volume commits and reloads
+
+Unlike a normal filesystem, you need to explicitly reload the Volume to see
+changes made since it was first mounted. This reload is handled by invoking the
+[`.reload()`](https://modal.com/docs/reference/modal.Volume#reload) method on a Volume object.
+Similarly, any Volume changes made within a container need to be committed for
+those the changes to become visible outside the current container. This is handled
+periodically by [background commits](#background-commits) and directly by invoking
+the [`.commit()`](https://modal.com/docs/reference/modal.Volume#commit)
+method on a `modal.Volume` object.
+
+At container creation time the latest state of an attached Volume is mounted. If
+the Volume is then subsequently modified by a commit operation in another
+running container, that Volume modification won't become available until the
+original container does a [`.reload()`](https://modal.com/docs/reference/modal.Volume#reload).
+
+Consider this example which demonstrates the effect of a reload:
+
+```python
+import pathlib
+import modal
+
+app = modal.App()
+
+volume = modal.Volume.from_name("my-volume")
+
+p = pathlib.Path("/root/foo/bar.txt")
+
+@app.function(volumes={"/root/foo": volume})
+def f():
+    p.write_text("hello")
+    print(f"Created {p=}")
+    volume.commit()  # Persist changes
+    print(f"Committed {p=}")
+
+@app.function(volumes={"/root/foo": volume})
+def g(reload: bool = False):
+    if reload:
+        volume.reload()  # Fetch latest changes
+    if p.exists():
+        print(f"{p=} contains '{p.read_text()}'")
+    else:
+        print(f"{p=} does not exist!")
+
+@app.local_entrypoint()
+def main():
+    g.remote()  # 1. container for `g` starts
+    f.remote()  # 2. container for `f` starts, commits file
+    g.remote(reload=False)  # 3. reuses container for `g`, no reload
+    g.remote(reload=True)   # 4. reuses container, but reloads to see file.
+```
+
+The output for this example is this:
+
+```
+p=PosixPath('/root/foo/bar.txt') does not exist!
+Created p=PosixPath('/root/foo/bar.txt')
+Committed p=PosixPath('/root/foo/bar.txt')
+p=PosixPath('/root/foo/bar.txt') does not exist!
+p=PosixPath('/root/foo/bar.txt') contains hello
+```
+
+This code runs two containers, one for `f` and one for `g`. Only the last
+function invocation reads the file created and committed by `f` because it was
+configured to reload.
+
+### Background commits
+
+Modal Volumes run background commits:
+every few seconds while your Function executes,
+the contents of attached Volumes will be committed
+without your application code calling `.commit`.
+A final snapshot and commit is also automatically performed on container shutdown.
+
+Being able to persist changes to Volumes without changing your application code
+is especially useful when [training or fine-tuning models using frameworks](#model-checkpointing).
+
+## Model serving
+
+A single ML model can be served by simply baking it into a `modal.Image` at
+build time using [`run_function`](https://modal.com/docs/reference/modal.Image#run_function). But
+if you have dozens of models to serve, or otherwise need to decouple image
+builds from model storage and serving, use a `modal.Volume`.
+
+Volumes can be used to save a large number of ML models and later serve any one
+of them at runtime with great performance. This snippet below shows the
+basic structure of the solution.
+
+```python
+import modal
+
+app = modal.App()
+volume = modal.Volume.from_name("model-store")
+model_store_path = "/vol/models"
+
+@app.function(volumes={model_store_path: volume}, gpu="any")
+def run_training():
+    model = train(...)
+    save(model_store_path, model)
+    volume.commit()  # Persist changes
+
+@app.function(volumes={model_store_path: volume})
+def inference(model_id: str, request):
+    try:
+        model = load_model(model_store_path, model_id)
+    except NotFound:
+        volume.reload()  # Fetch latest changes
+        model = load_model(model_store_path, model_id)
+    return model.run(request)
+```
+
+For more details, see our [guide to storing model weights on Modal](https://modal.com/docs/guide/model-weights).
+
+## Model checkpointing
+
+Checkpoints are snapshots of an ML model and can be configured by the callback
+functions of ML frameworks. You can use saved checkpoints to restart a training
+job from the last saved checkpoint. This is particularly helpful in managing
+[preemption](https://modal.com/docs/guide/preemption).
+
+For more, see our [example code for long-running training](https://modal.com/docs/examples/long-training).
+
+### Hugging Face `transformers`
+
+To periodically checkpoint into a `modal.Volume`, just set the `Trainer`'s
+[`output_dir`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.output_dir)
+to a directory in the Volume.
+
+```python
+import pathlib
+
+volume = modal.Volume.from_name("my-volume")
+VOL_MOUNT_PATH = pathlib.Path("/vol")
+
+@app.function(
+    gpu="A10G",
+    timeout=2 * 60 * 60,  # run for at most two hours
+    volumes={VOL_MOUNT_PATH: volume},
+)
+def finetune():
+    from transformers import Seq2SeqTrainer
+    ...
+
+    training_args = Seq2SeqTrainingArguments(
+        output_dir=str(VOL_MOUNT_PATH / "model"),
+        # ... more args here
+    )
+
+    trainer = Seq2SeqTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_xsum_train,
+        eval_dataset=tokenized_xsum_test,
+    )
+```
+
+## Volume performance
+
+Volumes work best when they contain less than 50,000 files and directories. The
+latency to attach or modify a Volume scales linearly with the number of files in
+the Volume, and past a few tens of thousands of files the linear component
+starts to dominate the fixed overhead.
+
+There is currently a hard limit of 500,000 inodes (files, directories and
+symbolic links) per Volume. If you reach this limit, any further attempts to
+create new files or directories will error with
+[`ENOSPC` (No space left on device)](https://pubs.opengroup.org/onlinepubs/9799919799/).
+
+If you need to work with a large number of files, consider using Volumes v2!
+It is currently in beta. See below for more info.
+
+## Filesystem consistency
+
+### Concurrent modification
+
+Concurrent modification from multiple containers is supported, but concurrent
+modifications of the same files should be avoided. Last write wins in case of
+concurrent modification of the same file — any data the last writer didn't have
+when committing changes will be lost!
+
+The number of commits you can run concurrently is limited. If you run too many
+concurrent commits each commit will take longer due to contention. If you are
+committing small changes, avoid doing more than 5 concurrent commits (the number
+of concurrent commits you can make is proportional to the size of the changes
+being committed).
+
+As a result, Volumes are typically not a good fit for use cases where you need
+to make concurrent modifications to the same file (nor is distributed file
+locking supported).
+
+While a reload is in progress the Volume will appear empty to the container that
+initiated the reload. That means you cannot read from or write to a Volume in a
+container where a reload is ongoing (note that this only applies to the
+container where the reload was issued, other containers remain unaffected).
+
+### Busy Volume errors
+
+You can only reload a Volume when there no open files on the Volume. If you have
+open files on the Volume the [`.reload()`](https://modal.com/docs/reference/modal.Volume#reload)
+operation will fail with "volume busy". The following is a simple example of how
+a "volume busy" error can occur:
+
+```python
+volume = modal.Volume.from_name("my-volume")
+
+@app.function(volumes={"/vol": volume})
+def reload_with_open_files():
+    f = open("/vol/data.txt", "r")
+    volume.reload()  # Cannot reload when files in the Volume are open.
+```
+
+### Can't find file on Volume errors
+
+When accessing files in your Volume, don't forget to pre-pend where your Volume
+is mounted in the container.
+
+In the example below, where the Volume has been mounted at `/data`, "hello" is
+being written to `/data/xyz.txt`.
+
+```python
+import modal
+
+app = modal.App()
+vol = modal.Volume.from_name("my-volume")
+
+@app.function(volumes={"/data": vol})
+def run():
+    with open("/data/xyz.txt", "w") as f:
+        f.write("hello")
+    vol.commit()
+```
+
+If you instead write to `/xyz.txt`, the file will be saved to the local disk of the Modal Function.
+When you dump the contents of the Volume, you will not see the `xyz.txt` file.
+
+## Volumes v2 overview
+
+Volumes v2 generally behave just like Volumes v1, and most of the existing APIs
+and CLI commands that you are used to will work the same between versions.
+Because the file system implementation is completely different, there will be
+some significant performance characteristics that can differ from version 1
+Volumes. Below is an outline of the key differences you should be aware of.
+
+### Volumes v2 is still in beta
+
+This new file system version is still in beta, and we cannot guarantee that
+no data will be lost. We don't recommend using Volumes v2 for any
+mission-critical data at this time. You can still reap the benefits of v2 for
+data that isn't precious, or that is easy to rebuild, such as log files,
+regularly updated training data and model weights, caches, and more.
+
+### Volumes v2 are HIPAA compliant
+
+If you delete the volume, the data is be guaranteed to be lost according to HIPAA requirements.
+
+### Volumes v2 is more scaleable
+
+Volumes v2 support more files, higher throughput, and more irregular access
+patterns. Commits and reloads are also faster.
+
+Additionally, Volumes v2 supports hard-linking of files, where multiple paths
+can point to the same inode.
+
+### In v2, you can store as many files as you want
+
+There is no limit on the number of files in Volumes v2.
+
+By contrast, in Volumes v1, there is a limit on the number of files of 500,000,
+and we recommend keeping the count to 50,000 or less.
+
+### In v2, you can write concurrently from hundreds of containers
+
+The file system should not experience any performance degradation as more
+containers write to distinct files simultaneously.
+
+By contrast, in Volumes v1, we recommend no more than five writers access the
+Volume at once.
+
+Note, however, that concurrent access to a particular _file_ in a Volume still
+has last-write-wins semantics in many circumstances. These semantics are
+unacceptable for most applications, so any particular file should only be
+written to by a single container at a time.
+
+### In v2, random accesses have improved performance
+
+In v1, writes to locations inside a file would sometimes incur substantial
+overhead, like a rewrite of the entire file.
+
+In v2, this overhead is removed, and only changes are written.
+
+### Volumes v2 has a few limits in place
+
+While we work out performance trade-offs and listen to user feedback, we have
+put some artificial limits in place.
+
+- Files must be less than one 1 TiB.
+- At most 32,768 files can be stored in a single directory.
+  Directory depth is unbounded, so the total file count is unbounded.
+- Traversing the filesystem can be slower in v2 than in v1, due to demand
+  loading of the filesystem tree.
+
+### Upgrading v1 Volumes
+
+Currently, there is no automated tool for upgrading v1 Volumes to v2. We are
+planning to implement an automated migration path but for now v1 Volumes need
+to be manually migrated by creating a new v2 Volume and either copying files
+over from the v1 Volume or writing new files.
+
+To reuse the name of an existing v1 Volume for a new v2 Volume, first stop all
+apps that are utilizing the v1 Volume before deleting it. If this is not
+feasible, e.g. due to wanting to avoid downtime, use a new name for the v2
+Volume.
+
+**Warning:** When deleting an existing Volume, any deployed apps or running
+functions utilizing that Volume will cease to function, even if a new Volume is
+created with the same name. This is because Volumes are identified with opaque
+unique IDs that are resolved at application deployment or start time. A newly
+created Volume with the same name as a deleted Volume will have a new Volume ID
+and any deployed or running apps will still be referring to the old ID until
+these apps are re-deployed or restarted.
+
+In order to create a new volume and copy data over from the old volume, you can
+use a tool like `cp` if you intend to copy all the data in one go, or `rsync`
+if you want to incrementally copy the data across a longer time span:
+
+```shell
+$ modal volume create --version=2 2files2furious
+$ modal shell --volume files-and-furious --volume 2files2furious
+Welcome to Modal's debug shell!
+We've provided a number of utilities for you, like `curl` and `ps`.
+# Option 1: use `cp`
+root / → cp -rp /mnt/files-and-furious/. /mnt/2files2furious/.
+root / → sync /mnt/2files2furious # Ensure changes are persisted before exiting
+
+# Option 2: use `rsync`
+root / → apt install -y rsync
+root / → rsync -a /mnt/files-and-furious/. /mnt/2files2furious/.
+root / → sync /mnt/2files2furious # Ensure changes are persisted before exiting
+```
+
+## Further examples
+
+- [Character LoRA fine-tuning](https://modal.com/docs/examples/diffusers_lora_finetune) with model storage on a Volume
+- [Protein folding](https://modal.com/docs/examples/chai1) with model weights and output files stored on Volumes
+- [Dataset visualization with Datasette](https://modal.com/docs/example/cron_datasette) using a SQLite database on a Volume
+
+#### Storing model weights
+
+# Storing model weights on Modal
+
+Efficiently managing the weights of large models is crucial for optimizing the
+build times and startup latency of many ML and AI applications.
+
+Our recommended method for working with model weights is to store them in a Modal [Volume](https://modal.com/docs/guide/volumes),
+which acts as a distributed file system, a "shared disk" all of your Modal Functions can access.
+
+## Storing weights in a Modal Volume
+
+To store your model weights in a Volume, you need to either
+make the Volume available to a Modal Function that saves the model weights
+or upload the model weights into the Volume from a client.
+
+### Saving model weights into a Modal Volume from a Modal Function
+
+If you're already generating the weights on Modal, you just need to
+attach the Volume to your Modal Function, making it available for reading and writing:
+
+```python
+from pathlib import Path
+
+volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
+MODEL_DIR = Path("/models")
+
+@app.function(gpu="any", volumes={MODEL_DIR: volume})  # attach the Volume
+def train_model(data, config):
+    import run_training
+
+    model = run_training(config, data)
+    model.save(config, MODEL_DIR)
+```
+
+Volumes are attached by including them in a dictionary that maps
+a path on the remote machine to a `modal.Volume` object.
+They look just like a normal file system, so model weights can be saved to them
+without adding any special code.
+
+If the model weights are generated outside of Modal and made available
+over the Internet, for example by an open-weights model provider
+or your own training job on a dedicated cluster,
+you can also download them into a Volume from a Modal Function:
+
+```python continuation
+@app.function(volumes={MODEL_DIR: volume})
+def download_model(model_id):
+    import model_hub
+
+    model_hub.download(model_id, local_dir=MODEL_DIR / model_id)
+```
+
+Add [Modal Secrets](https://modal.com/docs/guide/secrets) to access weights that require authentication.
+
+See [below](#storing-weights-from-the-hugging-face-hub-on-modal) for
+more on downloading from the popular Hugging Face Hub.
+
+### Uploading model weights into a Modal Volume
+
+Instead of pulling weights into a Modal Volume from inside a Modal Function,
+you might wish to push weights into Modal from a client,
+like your laptop or a dedicated training cluster.
+
+For that, you can use the `batch_upload` method of
+[`modal.Volume`](https://modal.com/docs/reference/modal.Volume)s
+via the Modal Python client library:
+
+```python continuation
+volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
+
+@app.local_entrypoint()
+def main(local_path: str, remote_path: str):
+    with volume.batch_upload() as upload:
+        upload.put_directory(local_path, remote_path)
+```
+
+Alternatively, you can upload model weights using the
+[`modal volume`](https://modal.com/docs/reference/cli/volume) CLI command:
+
+```bash
+modal volume put model-weights-vol path/to/model path/on/volume
+```
+
+### Mounting cloud buckets as Modal Volumes
+
+If your model weights are already in cloud storage,
+for example in an S3 bucket, you can connect them
+to Modal Functions with a `CloudBucketMount`.
+
+See [the guide](https://modal.com/docs/guide/cloud-bucket-mounts) for details.
+
+## Reading model weights from a Modal Volume
+
+You can read weights from a Volume as you would normally read them
+from disk, so long as you attach the Volume to your Function.
+
+```python continuation
+@app.function(gpu="any", volumes={MODEL_DIR: volume})
+def inference(prompt, model_id):
+    import load_model
+
+    model = load_model(MODEL_DIR / model_id)
+    model.run(prompt)
+```
+
+## Storing weights in the Modal Image
+
+It is also possible to store weights in your Function's Modal [Image](https://modal.com/docs/guide/images),
+the private file system state that a Function sees when it starts up.
+The weights might be downloaded via shell commands with [`Image.run_commands`](https://modal.com/docs/guide/images)
+or downloaded using a Python function with [`Image.run_function`](https://modal.com/docs/guide/images).
+
+We recommend storing model weights in a Modal [Volume](https://modal.com/docs/guide/volumes),
+as described [above](#storing-weights-in-a-modal-volume). Performance is similar
+for the two methods. Volumes are more flexible.
+Images are rebuilt when their definition changes, starting from the changed layer,
+which increases reproducibility for some builds but leads to unnecessary extra downloads
+in most cases.
+
+## Optimizing model weight reads with `@enter`
+
+In the above code samples, weights are loaded from disk into memory each time
+the `inference` function is run. This isn't so bad if inference is much
+slower than model loading (e.g. it is run on very large datasets)
+or if the model loading logic is smart enough to skip reloading.
+
+To guarantee a particular model's weights are only loaded once, you can use the `@enter`
+[container lifecycle hook](https://modal.com/docs/guide/lifecycle-functions)
+to load the weights only when a new container starts.
+
+```python continuation
+MODEL_ID = "some-model-id"
+
+@app.cls(gpu="any", volumes={MODEL_DIR: volume})
+class Model:
+    @modal.enter()
+    def setup(self, model_id=MODEL_ID):
+        import load_model
+
+        self.model = load_model(MODEL_DIR, model_id)
+
+    @modal.method()
+    def inference(self, prompt):
+        return self.model.run(prompt)
+```
+
+Note that methods decorated with `@enter` can't be passed dynamic arguments.
+
+If you need to load a single but possibly different model on each container start, you can
+[parametrize](https://modal.com/docs/guide/parametrized-functions) your Modal Cls.
+Below, we use the `modal.parameter` syntax.
+
+```python continuation
+@app.cls(gpu="any", volumes={MODEL_DIR: volume})
+class ParametrizedModel:
+    model_id: str = modal.parameter()
+
+    @modal.enter()
+    def setup(self):
+        import load_model
+
+        self.model = load_model(MODEL_DIR, self.model_id)
+
+    @modal.method()
+    def inference(self, prompt):
+        return self.model.run(prompt)
+```
+
+## Storing weights from the Hugging Face Hub on Modal
+
+The [Hugging Face Hub](https://huggingface.co/models) has over 1,000,000 models
+with weights available for download.
+
+The snippet below shows some additional tricks for downloading models
+from the Hugging Face Hub on Modal.
+
+```python
+from typing import Optional
+from pathlib import Path
+
+import modal
+
+# create a Volume, or retrieve it if it exists
+volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
+MODEL_DIR = Path("/models")
+
+# define dependencies for downloading model
+download_image = (
+    modal.Image.debian_slim()
+    .pip_install("huggingface_hub")
+    .env({"HF_XET_HIGH_PERFORMANCE": "1"}) # enable fast data transfer
+)
+app = modal.App()
+
+@app.function(
+    volumes={MODEL_DIR.as_posix(): volume},  # "mount" the Volume, sharing it with your function
+    image=download_image,  # only download dependencies needed here
+)
+def download_model(
+    repo_id: str = "hf-internal-testing/tiny-random-GPTNeoXForCausalLM",
+    revision: Optional[str] = None,  # include a revision to prevent surprises!
+):
+    from huggingface_hub import snapshot_download
+
+    snapshot_download(repo_id=repo_id, local_dir=MODEL_DIR / repo_id, revision=revision)
+    print(f"Model downloaded to {MODEL_DIR / repo_id}")
+```
+
+#### Cloud bucket mounts
+
+# Cloud bucket mounts
+
+The [`modal.CloudBucketMount`](https://modal.com/docs/reference/modal.CloudBucketMount) is a
+mutable volume that allows for both reading and writing files from a cloud
+bucket. It supports AWS S3, Cloudflare R2, and Google Cloud Storage buckets.
+
+Cloud bucket mounts are built on top of AWS'
+[`mountpoint`](https://github.com/awslabs/mountpoint-s3) technology and inherits
+its limitations. See the [Limitations and troubleshooting](#limitations-and-troubleshooting) section for more details.
+
+## Mounting Cloudflare R2 buckets
+
+`CloudBucketMount` enables Cloudflare R2 buckets to be mounted as file system
+volumes. Because Cloudflare R2 is
+[S3-Compatible](https://developers.cloudflare.com/r2/api/s3/api/) the setup is
+very similar between R2 and S3. See
+[modal.CloudBucketMount](https://modal.com/docs/reference/modal.CloudBucketMount#modalcloudbucketmount)
+for usage instructions.
+
+When creating the R2 API token for use with the mount, you need to have the
+ability to read, write, and list objects in the specific buckets you will mount.
+You do _not_ need admin permissions, and you should _not_ use "Client IP Address
+Filtering".
+
+## Mounting Google Cloud Storage buckets
+
+`CloudBucketMount` enables Google Cloud Storage (GCS) buckets to be mounted as file system
+volumes. See [modal.CloudBucketMount](https://modal.com/docs/reference/modal.CloudBucketMount#modalcloudbucketmount)
+for GCS setup instructions.
+
+## Mounting S3 buckets
+
+`CloudBucketMount` enables S3 buckets to be mounted as file system volumes. To
+interact with a bucket, you must have the appropriate IAM permissions configured
+(refer to the section on [IAM Permissions](#iam-permissions)).
+
+```python
+import modal
+import subprocess
+
+app = modal.App()
+
+s3_bucket_name = "s3-bucket-name"  # Bucket name not ARN.
+s3_access_credentials = modal.Secret.from_dict({
+    "AWS_ACCESS_KEY_ID": "...",
+    "AWS_SECRET_ACCESS_KEY": "...",
+    "AWS_REGION": "..."
+})
+
+@app.function(
+    volumes={
+        "/my-mount": modal.CloudBucketMount(s3_bucket_name, secret=s3_access_credentials)
+    }
+)
+def f():
+    subprocess.run(["ls", "/my-mount"])
+```
+
+### Specifying S3 bucket region
+
+Amazon S3 buckets are associated with a single AWS Region. [`Mountpoint`](https://github.com/awslabs/mountpoint-s3) attempts to automatically detect the region for your S3 bucket at startup time and directs all S3 requests to that region. However, in certain scenarios, like if your container is running on an AWS worker in a certain region, while your bucket is in a different region, this automatic detection may fail.
+
+To avoid this issue, you can specify the region of your S3 bucket by adding an `AWS_REGION` key to your Modal secrets, as in the code example above.
+
+### Using AWS temporary security credentials
+
+`CloudBucketMount`s also support AWS temporary security credentials by passing
+the additional environment variable `AWS_SESSION_TOKEN`. Temporary credentials
+will expire and will not get renewed automatically. You will need to update
+the corresponding Modal Secret in order to prevent failures.
+
+You can get temporary credentials with the [AWS CLI](https://aws.amazon.com/cli/) with:
+
+```shell
+$ aws configure export-credentials --format env
+export AWS_ACCESS_KEY_ID=XXX
+export AWS_SECRET_ACCESS_KEY=XXX
+export AWS_SESSION_TOKEN=XXX...
+```
+
+All these values are required.
+
+### Using OIDC identity tokens
+
+Modal provides [OIDC integration](https://modal.com/docs/guide/oidc-integration) and will automatically generate identity tokens to authenticate to AWS.
+OIDC eliminates the need for manual token passing through Modal secrets and is based on short-lived tokens, which limits the window of exposure if a token is compromised.
+To use this feature, you must [configure AWS to trust Modal's OIDC provider](https://modal.com/docs/guide/oidc-integration#step-1-configure-aws-to-trust-modals-oidc-provider)
+and [create an IAM role that can be assumed by Modal Functions](https://modal.com/docs/guide/oidc-integration#step-2-create-an-iam-role-that-can-be-assumed-by-modal-functions).
+
+Then, you specify the IAM role that your Modal Function should assume to access the S3 bucket.
+
+```python
+import modal
+
+app = modal.App()
+
+s3_bucket_name = "s3-bucket-name"
+role_arn = "arn:aws:iam::123456789abcd:role/s3mount-role"
+
+@app.function(
+    volumes={
+        "/my-mount": modal.CloudBucketMount(
+            bucket_name=s3_bucket_name,
+            oidc_auth_role_arn=role_arn
+        )
+    }
+)
+def f():
+    subprocess.run(["ls", "/my-mount"])
+```
+
+### Mounting a path within a bucket
+
+To mount only the files under a specific subdirectory, you can specify a path prefix using `key_prefix`.
+Since this prefix specifies a directory, it must end in a `/`.
+The entire bucket is mounted when no prefix is supplied.
+
+```python
+import modal
+import subprocess
+
+app = modal.App()
+
+s3_bucket_name = "s3-bucket-name"
+prefix = 'path/to/dir/'
+
+s3_access_credentials = modal.Secret.from_dict({
+    "AWS_ACCESS_KEY_ID": "...",
+    "AWS_SECRET_ACCESS_KEY": "...",
+})
+
+@app.function(
+    volumes={
+        "/my-mount": modal.CloudBucketMount(
+            bucket_name=s3_bucket_name,
+            key_prefix=prefix,
+            secret=s3_access_credentials
+        )
+    }
+)
+def f():
+    subprocess.run(["ls", "/my-mount"])
+```
+
+This will only mount the files in the bucket `s3-bucket-name` that are prefixed by `path/to/dir/`.
+
+### Read-only mode
+
+To mount a bucket in read-only mode, set `read_only=True` as an argument.
+
+```python
+import modal
+import subprocess
+
+app = modal.App()
+
+s3_bucket_name = "s3-bucket-name"  # Bucket name not ARN.
+s3_access_credentials = modal.Secret.from_dict({
+    "AWS_ACCESS_KEY_ID": "...",
+    "AWS_SECRET_ACCESS_KEY": "...",
+})
+
+@app.function(
+    volumes={
+        "/my-mount": modal.CloudBucketMount(s3_bucket_name, secret=s3_access_credentials, read_only=True)
+    }
+)
+def f():
+    subprocess.run(["ls", "/my-mount"])
+```
+
+While S3 mounts support both write and read operations, they are optimized for
+reading large files sequentially. Certain file operations, such as renaming
+files, are not supported. For a comprehensive list of supported operations,
+consult the
+[Mountpoint documentation](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md).
+
+### IAM permissions
+
+To utilize `CloudBucketMount` for reading and writing files from S3 buckets,
+your IAM policy must include permissions for `s3:PutObject`,
+`s3:AbortMultipartUpload`, and `s3:DeleteObject`. These permissions are not
+required for mounts configured with `read_only=True`.
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Sid": "ModalListBucketAccess",
+      "Effect": "Allow",
+      "Action": ["s3:ListBucket"],
+      "Resource": ["arn:aws:s3:::"]
+    },
+    {
+      "Sid": "ModalBucketAccess",
+      "Effect": "Allow",
+      "Action": [
+        "s3:GetObject",
+        "s3:PutObject",
+        "s3:AbortMultipartUpload",
+        "s3:DeleteObject"
+      ],
+      "Resource": ["arn:aws:s3:::/*"]
+    }
+  ]
+}
+```
+
+## Limitations and troubleshooting
+
+Cloud Bucket Mounts have certain limitations that do not apply to [Volumes](https://modal.com/docs/guide/volumes).
+These limitations are primarily around the way that files can be opened and edited in Cloud Bucket Mounts. For
+a comprehensive list of limitations, see the [Mountpoint troubleshooting documentation](https://github.com/awslabs/mountpoint-s3/blob/a6179c72bfc237a1fdd06eb4a0863ca537f8d8a7/doc/TROUBLESHOOTING.md)
+and the [Mountpoint semantics documentation](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md).
+
+The most common issues that users encounter are:
+
+- Files cannot be opened in append mode.
+- Files cannot be written to at arbitrary offsets i.e. `seek` and write are not supported together.
+- To write to a file, you must open it in `truncate` mode.
+
+These operations typically result in a `PermissionError: [Errno 1] Operation not permitted` error.
+
+If you need these features, give [Volumes](https://modal.com/docs/guide/volumes) a try! If you need these features in S3
+and are willing to pay extra for your bucket, you may be able to use [S3 Express](https://aws.amazon.com/s3/storage-classes/express-one-zone/).
+Contact us [in our Slack](https://modal.com/slack) if you're interested in using S3 Express.
+
+### Writing files in append mode
+
+If you're using a library which must open a file in append mode, it's best to write to a temporary file
+and then move it to your bucket's mount path. A similar approach can be used to write to a file at an arbitrary offset.
+
+```python notest
+import tempfile
+import shutil
+
+@app.function(
+    volumes={"/bucket": modal.CloudBucketMount("my-bucket", secret=s3_credentials)}
+)
+def append_to_log():
+    # Write to a temporary file that supports append mode
+    with tempfile.NamedTemporaryFile(mode='a', delete=False) as temp_file:
+        temp_file.write("Log entry 1\n")
+        temp_file.write("Log entry 2\n")
+        temp_path = temp_file.name
+
+    # Move the completed file to the bucket mount
+    shutil.move(temp_path, "/bucket/logfile.txt")
+```
+
+### Creating a file without a parent directory
+
+If you try to create a file in a directory that doesn't exist, you'll get a `Operation not permitted` error.
+To fix this, create the parent directory first with `Path(dst).parent.mkdir(exist_ok=True, parents=True)`.
+
+### Using `np.savez`
+
+`np.savez` seeks to random offsets in a file, making it unsafe for Cloud Bucket Mounts. If your file is large,
+you can write it to a temporary file and then move it to your bucket's mount path. If it's small, however,
+you can solve this with an in-memory buffer:
+
+```python notest
+import io
+import numpy as np
+import shutil
+
+data = np.random.rand(1000, 512)
+
+# 1. Build the archive entirely in memory
+tmp = io.BytesIO()
+np.savez_compressed(tmp, array=data)
+
+# 2. Copy it once, sequentially, to the mount point
+dest = "/bucket/data.npz"
+with open(dest, "wb") as f:
+    shutil.copyfileobj(tmp, f)
+```
+
+### Torchtune writing checkpoint files
+
+Old versions of [Torchtune](https://github.com/pytorch/torchtune) are incompatible with Cloud Bucket Mounts.
+Upgrade to a version greater than or equal to `0.6.1` to ensure checkpoints can be written to the bucket.
+
+### Using the TensorBoard `SummaryWriter`
+
+The TensorBoard `SummaryWriter` opens log files in append mode. These files are quite small, though,
+so we recommend writing to a temporary directory and using the [Watchdog](https://github.com/gorakhargosh/watchdog)
+Python library to copy the files to the bucket mount path as they come in.
+
+This is a case where it may be worth it to use [Volumes](https://modal.com/docs/guide/volumes) instead - in particular,
+training logs are sometimes not subject to the same compliance requirements that force something like checkpoints
+or model weights to be stored in a secure location. We even have an example of
+[how to use TensorBoard on Volumes](https://modal.com/docs/examples/torch_profiling#serving-tensorboard-on-modal-to-view-pytorch-profiles-and-traces).
+
+#### Dicts
+
+# Dicts
+
+Modal Dicts provide distributed key-value storage to your Modal Apps.
+
+```python runner:ModalRunner
+import modal
+
+app = modal.App()
+kv = modal.Dict.from_name("kv", create_if_missing=True)
+
+@app.local_entrypoint()
+def main(key="cloud", value="dictionary", put=True):
+    if put:
+        kv[key] = value
+    print(f"{key}: {kv[key]}")
+```
+
+This page is a high-level guide to using Modal Dicts.
+For reference documentation on the `modal.Dict` object, see
+[this page](https://modal.com/docs/reference/modal.Dict).
+For reference documentation on the `modal dict` CLI command, see
+[this page](https://modal.com/docs/reference/cli/dict).
+
+## Modal Dicts are Python dicts in the cloud
+
+Dicts provide distributed key-value storage to your Modal Apps.
+Much like a standard Python dictionary, a Dict lets you store and retrieve
+values using keys. However, unlike a regular dictionary, a Dict in Modal is
+accessible from anywhere, concurrently and in parallel.
+
+```python
+# create a remote Dict
+dictionary = modal.Dict.from_name("my-dict", create_if_missing=True)
+
+dictionary["key"] = "value"  # set a value from anywhere
+value = dictionary["key"]    # get a value from anywhere
+```
+
+Dicts are persisted, which means that the data in the dictionary is
+stored and can be retrieved even after the application is redeployed.
+
+## You can access Modal Dicts asynchronously
+
+Modal Dicts live in the cloud, which means reads and writes
+against them go over the network. That has some unavoidable latency overhead,
+relative to just reading from memory, of a few dozen ms.
+Reads from Dicts via `["key"]`-style indexing are synchronous,
+which means that latency is often directly felt by the application.
+
+But like all Modal objects, you can also interact with Dicts asynchronously
+by putting the `.aio` suffix on methods -- in this case, `put` and `get`,
+which are synonyms for bracket-based indexing.
+Just add the `async` keyword to your `local_entrypoint`s or remote Functions
+and `await` the method calls.
+
+```python runner:ModalRunner
+import modal
+
+app = modal.App()
+dictionary = modal.Dict.from_name("async-dict", create_if_missing=True)
+
+@app.local_entrypoint()
+async def main():
+    await dictionary.put.aio("key", "value")  # setting a value asynchronously
+    assert await dictionary.get.aio("key")   # getting a value asyncrhonrously
+```
+
+See the guide to [asynchronous functions](https://modal.com/docs/guide/async) for more
+information.
+
+## Modal Dicts are not _exactly_ Python dicts
+
+Python dicts can have keys of any hashable type and values of any type.
+
+You can store Python objects of any serializable type within Dicts as keys or values.
+
+Objects are serialized using [`cloudpickle`](https://github.com/cloudpipe/cloudpickle),
+so precise support is inherited from that library. `cloudpickle` can serialize a surprising variety of objects,
+like `lambda` functions or even Python modules, but it can't serialize a few things that don't
+really make sense to serialize, like live system resources (sockets, writable file descriptors).
+
+Note that you will need to have the library defining the type installed in the environment
+where you retrieve the object so that it can be deserialized.
+
+```python runner:ModalRunner
+import modal
+
+app = modal.App()
+dictionary = modal.Dict.from_name("funky-dict", create_if_missing=True)
+
+@app.function(image=modal.Image.debian_slim().pip_install("numpy"))
+def fill():
+    import numpy
+
+    dictionary["numpy"] = numpy
+    dictionary["modal"] = modal
+    dictionary[dictionary] = dictionary  # don't try this at home!
+
+@app.local_entrypoint()
+def main():
+    fill.remote()
+    print(dictionary["modal"])
+    print(dictionary[dictionary]["modal"].Dict)
+    # print(dictionary["numpy"])  # DeserializationError, if no numpy locally
+```
+
+Unlike with normal Python dictionaries, updates to mutable value types will not
+be reflected in other containers unless the updated object is explicitly put
+back into the Dict. As a consequence, patterns like chained updates
+(`my_dict["outer_key"]["inner_key"] = value`) cannot be used the same way as
+they would with a local dictionary.
+
+Currently, the per-object size limit is 100 MiB and the maximum number of entries
+per update is 10,000. It's recommended to use Dicts for smaller objects (under 5 MiB).
+Each object in the Dict will expire after 7 days of inactivity (no reads or writes).
+
+Dicts also provide a locking primitive. See
+[this blog post](https://modal.com/blog/cache-dict-launch) for details.
+
+#### Queues
+
+# Queues
+
+Modal Queues provide distributed FIFO queues to your Modal Apps.
+
+```python runner:ModalRunner
+import modal
+
+app = modal.App()
+queue = modal.Queue.from_name("simple-queue", create_if_missing=True)
+
+def producer(x):
+    queue.put(x)  # adding a value
+
+@app.function()
+def consumer():
+    return queue.get()  # retrieving a value
+
+@app.local_entrypoint()
+def main(x="some object"):
+    # produce and consume tasks from local or remote code
+    producer(x)
+    print(consumer.remote())
+```
+
+This page is a high-level guide to using Modal Queues.
+For reference documentation on the `modal.Queue` object, see
+[this page](https://modal.com/docs/reference/modal.Queue).
+For reference documentation on the `modal queue` CLI command, see
+[this page](https://modal.com/docs/reference/cli/queue).
+
+## Modal Queues are Python queues in the cloud
+
+Like [Python `Queue`s](https://docs.python.org/3/library/queue.html),
+Modal Queues are multi-producer, multi-consumer first-in-first-out (FIFO) queues.
+
+Queues are particularly useful when you want to handle tasks or process
+data asynchronously, or when you need to pass messages between different
+components of your distributed system.
+
+Queues are cleared 24 hours after the last `put` operation and are backed by
+a replicated in-memory database, so persistence is likely, but not guaranteed.
+As such, `Queue`s are best used for communication between active functions and
+not relied on for persistent storage.
+
+[Please get in touch](mailto:support@modal.com) if you need durability for Queue objects.
+
+## Queues are partitioned by key
+
+Queues are split into separate FIFO partitions via a string key. By default, one
+partition (corresponding to an empty key) is used.
+
+A single `Queue` can contain up to 100,000 partitions, each with up to 5,000
+items. Each item can be up to 1 MiB. These limits also apply to the default
+partition.
+
+Each partition has an independent TTL, by default 24 hours.
+Lower TTLs can be specified by the `partition_ttl` argument in the `put` or
+`put_many` methods.
+
+```python runner:ModalRunner
+import modal
+
+app = modal.App()
+my_queue = modal.Queue.from_name("partitioned-queue", create_if_missing=True)
+
+@app.local_entrypoint()
+def main():
+    # clear all elements, start from a clean slate
+    my_queue.clear()
+
+    my_queue.put("some value")  # first in
+    my_queue.put(123)
+
+    assert my_queue.get() == "some value"  # first out
+    assert my_queue.get() == 123
+
+    my_queue.put(0)
+    my_queue.put(1, partition="foo")
+    my_queue.put(2, partition="bar")
+
+    # Default and "foo" partition are ignored by the get operation.
+    assert my_queue.get(partition="bar") == 2
+
+    # Set custom 10s expiration time on "foo" partition.
+    my_queue.put(3, partition="foo", partition_ttl=10)
+
+    # (beta feature) Iterate through items in place (read immutably)
+    my_queue.put(1)
+    assert [v for v in my_queue.iterate()] == [0, 1]
+```
+
+## You can access Modal Queues synchronously or asynchronously, blocking or non-blocking
+
+Queues are synchronous and blocking by default. Consumers will block and wait
+on an empty Queue and producers will block and wait on a full Queue,
+both with an `Optional`, configurable `timeout`. If the `timeout` is `None`,
+they will wait indefinitely. If a `timeout` is provided, `get` methods will raise
+[`queue.Empty`](https://docs.python.org/3/library/queue.html#queue.Empty)
+exceptions and `put` methods will raise
+[`queue.Full`](https://docs.python.org/3/library/queue.html#queue.Full)
+exceptions, both from the Python standard library.
+
+The `get` and `put` methods can be made non-blocking by setting the `block` argument to `False`.
+They raise `queue` exceptions without waiting on the `timeout`.
+
+Queues are stored in the cloud, so all interactions require communication over the network.
+This adds some extra latency to calls, apart from the `timeout`, on the order of tens of milliseconds.
+To avoid this latency impacting application latency, you can asynchronously interact with Queues
+by adding the `.aio` function suffix to access methods.
+
+```python notest
+@app.local_entrypoint()
+async def main(value=None):
+    await my_queue.put.aio(value or 200)
+    assert await my_queue.get.aio() == value
+```
+
+See the guide to [asynchronous functions](https://modal.com/docs/guide/async) for more
+information.
+
+## Modal Queues are not _exactly_ Python Queues
+
+Python Queues can have values of any type.
+
+Modal Queues can store Python objects of any serializable type.
+
+Objects are serialized using [`cloudpickle`](https://github.com/cloudpipe/cloudpickle),
+so precise support is inherited from that library. `cloudpickle` can serialize a surprising variety of objects,
+like `lambda` functions or even Python modules, but it can't serialize a few things that don't
+really make sense to serialize, like live system resources (sockets, writable file descriptors).
+
+Note that you will need to have the library defining the type installed in the environment
+where you retrieve the object so that it can be deserialized.
+
+```python runner:ModalRunner
+import modal
+
+app = modal.App()
+queue = modal.Queue.from_name("funky-queue", create_if_missing=True)
+queue.clear()  # start from a clean slate
+
+@app.function(image=modal.Image.debian_slim().pip_install("numpy"))
+def fill():
+    import numpy
+
+    queue.put(modal)
+    queue.put(queue)  # don't try this at home!
+    queue.put(numpy)
+
+@app.local_entrypoint()
+def main():
+    fill.remote()
+    print(queue.get().Queue)
+    print(queue.get())
+    # print(queue.get())  # DeserializationError, if no torch locally
+```
+
+#### Dataset ingestion
+
+# Large dataset ingestion
+
+This guide provides best practices for downloading, transforming, and storing large datasets within
+Modal. A dataset is considered large if it contains hundreds of thousands of files and/or is over
+100 GiB in size.
+
+These guidelines ensure that large datasets can be ingested fully and reliably.
+
+## Configure your Function for heavy disk usage
+
+Large datasets should be downloaded and transformed using a `modal.Function` and stored
+into a `modal.CloudBucketMount`. We recommend backing the latter with a Cloudflare R2 bucket,
+because Cloudflare does not charge network egress fees and has lower GiB/month storage costs than AWS S3.
+
+This `modal.Function` should specify a large `timeout` because large dataset processing can take hours,
+and it should request a larger ephemeral disk in cases where the dataset being downloaded and processed
+is hundreds of GiBs.
+
+```python
+@app.function(
+    volumes={
+        "/mnt": modal.CloudBucketMount(
+            "datasets",
+            bucket_endpoint_url="https://abc123example.r2.cloudflarestorage.com",
+            secret=modal.Secret.from_name("cloudflare-r2-datasets"),
+        )
+    },
+    ephemeral_disk=1000 * 1000,  # 1 TiB
+    timeout=60 * 60 * 12,  # 12 hours
+
+)
+def download_and_transform() -> None:
+    ...
+```
+
+### Use compressed archives on Modal Volumes
+
+`modal.Volume`s are designed for storing tens of thousands of individual files,
+but not for hundreds of thousands or millions of files.
+However they can be still be used for storing large datasets if files are first combined and compressed
+in a dataset transformation step before saving them into the Volume.
+
+See the [transforming](#transforming) section below for more details.
+
+## Experimentation
+
+Downloading and transforming large datasets can be fiddly. While iterating on a reliable ingestion program
+it is recommended to start a long-running `modal.Function` serving a JupyterHub server so that you can maintain
+disk state in the face of application errors.
+
+See the [running Jupyter server within a Modal function](https://github.com/modal-labs/modal-examples/blob/main/11_notebooks/jupyter_inside_modal.py) example as base code.
+
+## Downloading
+
+The raw dataset data should be first downloaded into the container at `/tmp/` and not placed
+directly into the mounted volume. This serves a couple purposes.
+
+1. Certain download libraries and tools (e.g. `wget`) perform filesystem operations not supported properly by `CloudBucketMount`.
+2. The raw dataset data may need to be transformed before use, in which case it is wasteful to store it permanently.
+
+This snippet shows the basic download-and-copy procedure:
+
+```python notest
+import pathlib
+import shutil
+import subprocess
+
+tmp_path = pathlib.Path("/tmp/imagenet/")
+vol_path = pathlib.Path("/mnt/imagenet/")
+filename = "imagenet-object-localization-challenge.zip"
+# 1. Download into /tmp/
+subprocess.run(
+    f"kaggle competitions download -c imagenet-object-localization-challenge --path {tmp_path}",
+    shell=True,
+    check=True
+)
+vol_path.mkdir(exist_ok=True)
+# 2. Copy (without transform) into mounted volume.
+shutil.copyfile(tmp_path / filename, vol_path / filename)
+```
+
+## Transforming
+
+When ingesting a large dataset it is sometimes necessary to transform it before storage, so that it is in
+an optimal format for loading at runtime. A common kind of necessary transform is gzip decompression. Very large
+datasets are often gzipped for storage and network transmission efficiency, but gzip decompression (80 MiB/s)
+is hundreds of times slower than reading from a solid state drive (SSD)
+and should be done once before storage to avoid decompressing on every read against the dataset.
+
+Transformations should be performed after storing the raw dataset in `/tmp/`. Performing transformations almost always increases container disk usage and this is where the [`ephemeral_disk` parameter](https://modal.com/docs/reference/modal.App#function) parameter becomes important. For example, a
+100 GiB raw, compressed dataset may decompress to into 500 GiB, occupying 600 GiB of container disk space.
+
+Transformations should also typically be performed against `/tmp/`. This is because
+
+1. transforms can be IO intensive and IO latency is lower against local SSD.
+2. transforms can create temporary data which is wasteful to store permanently.
+
+## Examples
+
+The best practices offered in this guide are demonstrated in the [`modal-examples` repository](https://github.com/modal-labs/modal-examples/tree/main/12_datasets).
+
+The examples include these popular large datasets:
+
+- [ImageNet](https://www.image-net.org/), the image labeling dataset that kicked off the deep learning revolution
+- [COCO](https://cocodataset.org/#download), the Common Objects in COntext dataset of densely-labeled images
+- [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/), the Stable Diffusion training dataset
+- Data derived from the [Big "Fantastic" Database](https://bfd.mmseqs.com/),
+  [Protein Data Bank](https://www.wwpdb.org/), and [UniProt Database](https://www.uniprot.org/)
+  used in training the [RoseTTAFold](https://github.com/RosettaCommons/RoseTTAFold) protein structure model
+
+### Performance
+
+#### Cold start performance
+
+# Cold start performance
+
+Modal Functions are run in [containers](https://modal.com/docs/guide/images).
+
+If a container is already ready to run your Function, it will be reused.
+
+If not, Modal spins up a new container.
+This is known as a _cold start_,
+and it is often associated with higher latency.
+
+There are two sources of increased latency during cold starts:
+
+1. inputs may **spend more time waiting** in a queue for a container
+   to become ready or "warm".
+2. when an input is handled by the container that just started,
+   there may be **extra work that only needs to be done on the first invocation**
+   ("initialization").
+
+This guide presents techniques and Modal features for reducing the impact of both queueing
+and initialization on observed latencies.
+
+If you are invoking Functions with no warm containers
+or if you otherwise see inputs spending too much time in the "pending" state,
+you should
+[target queueing time for optimization](#reduce-time-spent-queueing-for-warm-containers).
+
+If you see some Function invocations taking much longer than others,
+and those invocations are the first handled by a new container,
+you should
+[target initialization for optimization](#reduce-latency-from-initialization).
+
+## Reduce time spent queueing for warm containers
+
+New containers are booted when there are not enough other warm containers to
+to handle the current number of inputs.
+
+For example, the first time you send an input to a Function,
+there are zero warm containers and there is one input,
+so a single container must be booted up.
+The total latency for the input will include
+the time it takes to boot a container.
+
+If you send another input right after the first one finishes,
+there will be one warm container and one pending input,
+and no new container will be booted.
+
+Generalizing, there are two factors that affect the time inputs spend queueing:
+the time it takes for a container to boot and become warm (which we solve by booting faster)
+and the time until a warm container is available to handle an input (which we solve by having more warm containers).
+
+### Warm up containers faster
+
+The time taken for a container to become warm
+and ready for inputs can range from seconds to minutes.
+
+Modal's custom container stack has been heavily optimized to reduce this time.
+Containers boot in about one second.
+
+But before a container is considered warm and ready to handle inputs,
+we need to execute any logic in your code's global scope (such as imports)
+or in any
+[`modal.enter` methods](https://modal.com/docs/guide/lifecycle-functions).
+So if your boots are slow, these are the first places to work on optimization.
+
+For example, you might be downloading a large model from a model server
+during the boot process.
+You can instead
+[download the model ahead of time](https://modal.com/docs/guide/model-weights),
+so that it only needs to be downloaded once.
+
+For models in the tens of gigabytes,
+this can reduce boot times from minutes to seconds.
+
+### Run more warm containers
+
+It is not always possible to speed up boots sufficiently.
+For example, seconds of added latency to load a model may not
+be acceptable in an interactive setting.
+
+In this case, the only option is to have more warm containers running.
+This increases the chance that an input will be handled by a warm container,
+for example one that finishes an input while another container is booting.
+
+Modal currently exposes [three parameters](https://modal.com/docs/guide/scale) that control how
+many containers will be warm: `scaledown_window`, `min_containers`,
+and `buffer_containers`.
+
+All of these strategies can increase the resources consumed by your Function
+and so introduce a trade-off between cold start latencies and cost.
+
+#### Keep containers warm for longer with `scaledown_window`
+
+Modal containers will remain idle for a short period before shutting down. By
+default, the maximum idle time is 60 seconds. You can configure this by setting
+the `scaledown_window` on the [`@function`](https://modal.com/docs/reference/modal.App#function)
+decorator. The value is measured in seconds, and it can be set anywhere between
+two seconds and twenty minutes.
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function(scaledown_window=300)
+def my_idle_greeting():
+    return {"hello": "world"}
+```
+
+Increasing the `scaledown_window` reduces the chance that subsequent requests
+will require a cold start, although you will be billed for any resources used
+while the container is idle (e.g., GPU reservation or residual memory
+occupancy). Note that containers will not necessarily remain alive for the
+entire window, as the autoscaler will scale down more agressively when the
+Function is substantially over-provisioned.
+
+#### Overprovision resources with `min_containers` and `buffer_containers`
+
+Keeping already warm containers around longer doesn't help if there are no warm
+containers to begin with, as when Functions scale from zero.
+
+To keep some containers warm and running at all times, set the `min_containers`
+value on the [`@function`](https://modal.com/docs/reference/modal.App#function) decorator. This
+puts a floor on the the number of containers so that the Function doesn't scale
+to zero. Modal will still scale up and spin down more containers as the
+demand for your Function fluctuates above the `min_containers` value, as usual.
+
+While `min_containers` overprovisions containers while the Function is idle,
+`buffer_containers` provisions extra containers while the Function is active.
+This "buffer" of extra containers will be idle and ready to handle inputs if
+the rate of requests increases. This parameter is particularly useful for
+bursty request patterns, where the arrival of one input predicts the arrival of more inputs,
+like when a new user or client starts hitting the Function.
+
+```python
+import modal
+
+app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))
+
+@app.function(min_containers=3, buffer_containers=3)
+def my_warm_greeting():
+    return "Hello, world!"
+```
+
+## Reduce latency from initialization
+
+Some work is done the first time that a function is invoked
+but can be used on every subsequent invocation.
+This is
+[_amortized work_](https://www.cs.cornell.edu/courses/cs312/2006sp/lectures/lec18.html)
+done at initialization.
+
+For example, you may be using a large pre-trained model
+whose weights need to be loaded from disk to memory the first time it is used.
+
+This results in longer latencies for the first invocation of a warm container,
+which shows up in the application as occasional slow calls: high tail latency or elevated p9Xs.
+
+### Move initialization work out of the first invocation
+
+Some work done on the first invocation can be moved up and completed ahead of time.
+
+Any work that can be saved to disk, like
+[downloading model weights](https://modal.com/docs/guide/model-weights),
+should be done as early as possible. The results can be included in the
+[container's Image](https://modal.com/docs/guide/images)
+or saved to a
+[Modal Volume](https://modal.com/docs/guide/volumes).
+
+Some work is tricky to serialize, like spinning up a network connection or an inference server.
+If you can move this initialization logic out of the function body and into the global scope or a
+[container `enter` method](https://modal.com/docs/guide/lifecycle-functions#enter),
+you can move this work into the warm up period.
+Containers will not be considered warm until all `enter` methods have completed,
+so no inputs will be routed to containers that have yet to complete this initialization.
+
+For more on how to use `enter` with machine learning model weights, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+Note that `enter` doesn't get rid of the latency --
+it just moves the latency to the warm up period,
+where it can be handled by
+[running more warm containers](#run-more-warm-containers).
+
+### Share initialization work across cold starts with memory snapshots
+
+Cold starts can also be made faster by using memory snapshots.
+
+Invocations of a Function after the first
+are faster in part because the memory is already populated
+with values that otherwise need to be computed or read from disk,
+like the contents of imported libraries.
+
+Memory snapshotting captures the state of a container's memory
+at user-controlled points after it has been warmed up
+and reuses that state in future boots, which can substantially
+reduce cold start latency penalties and warm up period duration.
+
+Refer to the [memory snapshot](https://modal.com/docs/guide/memory-snapshot)
+guide for details.
+
+### Optimize initialization code
+
+Sometimes, there is nothing to be done but to speed this work up.
+
+Here, we share specific patterns that show up in optimizing initialization
+in Modal Functions.
+
+#### Load multiple large files concurrently
+
+Often Modal applications need to read large files into memory (eg. model
+weights) before they can process inputs. Where feasible these large file
+reads should happen concurrently and not sequentially. Concurrent IO takes
+full advantage of our platform's high disk and network bandwidth
+to reduce latency.
+
+One common example of slow sequential IO is loading multiple independent
+Huggingface `transformers` models in series.
+
+```python notest
+from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
+model_a = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+processor_a = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+model_b = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
+processor_b = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
+```
+
+The above snippet does four `.from_pretrained` loads sequentially.
+None of the components depend on another being already loaded in memory, so they
+can be loaded concurrently instead.
+
+They could instead be loaded concurrently using a function like this:
+
+```python notest
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
+
+def load_models_concurrently(load_functions_map: dict) -> dict:
+    model_id_to_model = {}
+    with ThreadPoolExecutor(max_workers=len(load_functions_map)) as executor:
+        future_to_model_id = {
+            executor.submit(load_fn): model_id
+            for model_id, load_fn in load_functions_map.items()
+        }
+        for future in as_completed(future_to_model_id.keys()):
+            model_id_to_model[future_to_model_id[future]] = future.result()
+    return model_id_to_model
+
+components = load_models_concurrently({
+    "clip_model": lambda: CLIPModel.from_pretrained("openai/clip-vit-base-patch32"),
+    "clip_processor": lambda: CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32"),
+    "blip_model": lambda: BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large"),
+    "blip_processor": lambda: BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
+})
+```
+
+If performing concurrent IO on large file reads does _not_ speed up your cold
+starts, it's possible that some part of your function's code is holding the
+Python [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) and reducing
+the efficacy of the multi-threaded executor.
+
+#### Memory Snapshot
+
+# Memory Snapshot
+
+Modal can save the state of your Function's memory right after initialization and restore it directly later, skipping initialization work.
+
+These "memory snapshots" can dramatically improve cold start performance for Modal Functions.
+
+During initialization, your code might read many files from the file system, which is quite expensive.
+For example, the `torch` package is [hundreds of MiB](https://pypi.org/project/torch/#files) and requires over 20,000 file operations to load!
+Such Functions typically start several times faster with memory snapshots enabled.
+
+The memory snapshot feature has two variants. GPU memory snapshots (alpha) provide full GPU access before the snapshot is taken, while CPU memory snapshots do not.
+
+## CPU Memory Snapshot
+
+CPU memory snapshots capture the state of a container and save it to disk. This saved snapshot can then be used to quickly restore new containers to the exact same state.
+
+### Basic usage
+
+You can enable memory snapshots for your Function with the `enable_memory_snapshot=True` parameter:
+
+```python
+@app.function(enable_memory_snapshot=True)
+def my_func():
+    print("hello")
+```
+
+Then deploy the App with `modal deploy`. Memory snapshots are created only for deployed Apps.
+
+When using classes decorated with [`@cls`](https://modal.com/docs/guide/lifecycle-functions), [`@modal.enter()`](https://modal.com/docs/reference/modal.enter) hooks are not included in the snapshot by default. Add `snap=True` to include them:
+
+```python
+@app.cls(enable_memory_snapshot=True)
+class MyCls:
+    @modal.enter(snap=True)
+    def load(self):
+        ...
+```
+
+Any code executed in global scope, such as top-level imports, will also be captured by the memory snapshot.
+
+### CPU memory snapshots for GPU workloads
+
+CPU memory snapshots don't support direct GPU memory capture, but GPU Functions can still benefit
+from memory snapshots through a two-stage initialization process. This involves refactoring
+your initialization code to run across two separate `@modal.enter` functions: one that runs before
+creating the snapshot (`snap=True`), and one that runs after restoring from the
+snapshot (`snap=False`). Load model weights onto CPU memory in the `snap=True`
+method, and then move the weights onto GPU memory in the `snap=False` method.
+Here's an example using the `sentence-transformers` package:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().pip_install("sentence-transformers")
+app = modal.App("sentence-transformers", image=image)
+
+with image.imports():
+    from sentence_transformers import SentenceTransformer
+
+model_vol = modal.Volume.from_name("sentence-transformers-models", create_if_missing=True)
+
+@app.cls(gpu="a10g", volumes={"/models": model_vol}, enable_memory_snapshot=True)
+class Embedder:
+    model_id = "BAAI/bge-small-en-v1.5"
+
+    @modal.enter(snap=True)
+    def load(self):
+        # Create a memory snapshot with the model loaded in CPU memory.
+        self.model = SentenceTransformer(f"/models/{self.model_id}", device="cpu")
+
+    @modal.enter(snap=False)
+    def setup(self):
+        self.model.to("cuda")  # Move the model to a GPU!
+
+    @modal.method()
+    def run(self, sentences:list[str]):
+        embeddings = self.model.encode(sentences, normalize_embeddings=True)
+        print(embeddings)
+
+@app.local_entrypoint()
+def main():
+    Embedder().run.remote(sentences=["what is the meaning of life?"])
+
+if __name__ == "__main__":
+    cls = modal.Cls.from_name("sentence-transformers", "Embedder")
+    cls().run.remote(sentences=["what is the meaning of life?"])
+```
+
+Even without GPU snapshotting, this workaround reduces the time it takes for `Embedder.run`
+to startup by about 3x, from ~6 seconds down to just ~2 seconds.
+
+### GPU availability during the memory snapshot phase
+
+If you are using the GPU memory snapshot feature (`enable_gpu_snapshot`), then
+GPUs are available within `@enter(snap=True)`.
+
+If you are using memory snapshots _without_ `enable_gpu_snapshot`, then it's important
+to note that GPUs will not be available within the `@enter(snap=True)` method.
+
+```python
+import modal
+app = modal.App(image=modal.Image.debian_slim().pip_install("torch"))
+@app.cls(enable_memory_snapshot=True, gpu="A10")
+class GPUAvailability:
+    @modal.enter(snap=True)
+    def no_gpus_available_during_snapshots(self):
+        import torch
+        print(f"GPUs available: {torch.cuda.is_available()}")  # False
+
+    @modal.enter(snap=False)
+    def gpus_available_following_restore(self):
+        import torch
+        print(f"GPUs available: {torch.cuda.is_available()}")  # True
+
+    @modal.method()
+    def demo(self):
+        print(f"GPUs available: {torch.cuda.is_available()}") # True
+```
+
+### Known limitations
+
+The `torch.cuda` module has multiple functions which, if called during
+snapshotting, will initialize CUDA as having zero GPU devices. Such functions
+include `torch.cuda.is_available` and `torch.cuda.get_device_capability`.
+If you're using a framework that calls these methods during its import phase,
+it may not be compatible with memory snapshots. The problem can manifest as
+confusing "cuda not available" or "no CUDA-capable device is detected" errors.
+
+We have found that importing PyTorch twice solves the problem in some cases:
+
+```python
+
+@app.cls(enable_memory_snapshot=True, gpu="A10")
+class GPUAvailability:
+    @modal.enter(snap=True)
+    def pre_snap(self):
+        import torch
+        ...
+    @modal.enter(snap=False)
+    def post_snap(self):
+        import torch   # re-import to re-init GPU availability state
+        ...
+```
+
+In particular, `xformers` is known to call `torch.cuda.get_device_capability` on
+import, so if it is imported during snapshotting it can unhelpfully initialize
+CUDA with zero GPUs. The
+[workaround](https://github.com/facebookresearch/xformers/issues/1030) for this
+is to set the `XFORMERS_ENABLE_TRITON` environment variable to `1` in your `modal.Image`.
+
+```python
+image = modal.Image.debian_slim().pip_install("xformers>=0.28")  # for instance
+image = image.env({"XFORMERS_ENABLE_TRITON": "1"})
+```
+
+## GPU Memory Snapshot
+
+With our experimental GPU memory snapshot feature, we are able to capture the entire GPU state too.
+This makes for simpler initialization logic and even faster cold starts.
+
+Pass the additional option `experimental_options={"enable_gpu_snapshot": True}` to your Function or class
+to enable GPU snapshotting. These functions have full GPU and CUDA access.
+
+```python
+@app.function(
+    gpu="a10",
+    enable_memory_snapshot=True,
+    experimental_options={"enable_gpu_snapshot": True},
+)
+def my_gpu_func():
+    import torch
+    print(f"GPUs available: {torch.cuda.is_available()}")  # True
+```
+
+Here's what the above `SentenceTransformer` example looks like with GPU memory snapshot enabled:
+
+```python notest
+@app.cls(
+    gpu="a10g",
+    volumes={"/models": model_vol},
+    enable_memory_snapshot=True,
+    experimental_options={"enable_gpu_snapshot": True}
+)
+class Embedder:
+    model_id = "BAAI/bge-small-en-v1.5"
+
+    @modal.enter(snap=True)
+    def load(self):
+        # Create a memory snapshot with the model loaded in GPU memory.
+        self.model = SentenceTransformer(f"/models/{self.model_id}", device="cuda")
+```
+
+To achieve even faster cold starts, we recommend warming up your model by running a few forward passes on sample data
+in the `@enter(snap=True)` method.
+
+Refer to the code sample [here](https://modal.com/docs/examples/gpu_snapshot) for a more complete example. Our
+[blog post](https://modal.com/blog/gpu-mem-snapshots) also provides more useful details.
+
+### Known limitations
+
+GPU memory snapshots are in _alpha_.
+[We've seen](https://modal.com/blog/gpu-mem-snapshots) that they can massively reduce cold boot time
+but we are still exploring their limitations. Try it for yourself and let us know how it goes!
+
+#### Compatibility with `torch.compile`
+
+If `torch.compile` is called (either directly or indirectly) during the `@modal.enter(snap=True)` method, creating the snapshot will fail for some models. In some of these cases, setting the [environment variable](https://modal.com/docs/guide/environment_variables) `TORCHINDUCTOR_COMPILE_THREADS` to `1` will solve the issue.
+
+## Memory Snapshot FAQ
+
+### When are snapshots updated?
+
+Redeploying your Function with new configuration (e.g. a [new GPU type](https://modal.com/docs/guide/gpu))
+or new code will cause previous snapshots to become obsolete.
+Subsequent invocations to the new Function version will automatically create new snapshots with the new configuration and code.
+
+Changes to [Modal Volumes](https://modal.com/docs/guide/volumes) do not cause snapshots to update.
+Deleting files in a Volume used during restore will cause restore failures.
+
+### I haven't changed my Function. Why do I still see snapshots being created sometimes?
+
+Modal recaptures snapshots to keep up with the platform's latest runtime and security changes.
+
+Additionally, you may observe your Function being memory
+snapshot multiple times during its first few invocations. This happens because
+memory snapshots are specific to the underlying worker type that created them (e.g. low-level processor details),
+and Modal Functions run across a handful of worker types.
+
+Snapshots may add a small amount of latency to Function initialization.
+
+CPU-only Functions need around 6 snapshots for full coverage, and Functions targeting a specific
+GPU (e.g. A100) need 2-3.
+
+### How do snapshots handle randomness?
+
+If your application depends on uniqueness of state, you must evaluate your
+Function code and verify that it is resilient to snapshotting operations. For
+example, if a variable is randomly initialized and snapshotted, that variable
+will be identical after every restore, possibly breaking uniqueness expectations
+of the proceeding Function code.
+
+#### Geographic latency
+
+# Geographic Latency
+
+Modal's worker cluster is multi-cloud and multi-region. The vast majority of workers are located
+in the continental USA, but we do run workers in Europe and Asia.
+
+Modal's control plane is hosted in Virginia, USA (`us-east-1`).
+
+Any time data needs to travel between the Modal client, our control plane servers, and our workers
+latency will be incurred. [Cloudping.co](https://www.cloudping.co) provides good estimates on the
+significance of the latency between regions. For example, the roundtrip latency between AWS `us-east-1` (Virginia, USA) and
+`us-west-1` (California, USA) is around 60ms.
+
+You can observe the location identifier of a container [via an environment variable](https://modal.com/docs/guide/environment_variables).
+Logging this environment variable alongside latency information can reveal when geography is impacting your application
+performance.
+
+## Region selection
+
+In cases where low-latency communication is required between your container and a network dependency (e.g a database),
+it is useful to ensure that Modal schedules your container in only regions geographically proximate to that dependency.
+For example, if you have an AWS RDS database in Virginia, USA (`us-east-1`), ensuring your Modal containers are also scheduled in Virginia
+means that network latency between the container and the database will be less than 5 milliseconds.
+
+For more information, please see [Region selection](https://modal.com/docs/guide/region-selection).
+
+### Reliability and robustness
+
+#### Failures and retries
+
+# Failures and retries
+
+When you call a function over a sequence of inputs with
+[Function.map()](https://modal.com/docs/guide/scale#parallel-execution-of-inputs), sometimes
+errors can happen during function execution. Exceptions from within the remote
+function are propagated to the caller, so you can handle them with a
+`try-except` statement (refer to
+[section on custom types](https://modal.com/docs/guide/troubleshooting#custom-types-defined-in-__main__)
+for more on how to catch user-defined exceptions):
+
+```python
+@app.function()
+def f(i):
+    raise ValueError()
+
+@app.local_entrypoint()
+def main():
+    try:
+        for _ in f.map([1, 2, 3]):
+            pass
+    except ValueError:
+        print("Exception handled")
+```
+
+## Function retries
+
+You can configure Modal to automatically retry function failures if you set the
+`retries` option when declaring your function:
+
+```python
+@app.function(retries=3)
+def my_flaky_function():
+    pass
+```
+
+When used with `Function.map()`, each input is retried up to the max number of
+retries specified.
+
+The basic configuration shown provides a fixed 1s delay between retry attempts.
+For fine-grained control over retry delays, including exponential backoff
+configuration, use [`modal.Retries`](https://modal.com/docs/reference/modal.Retries).
+
+To treat exceptions as successful results and aggregate them in the results list instead, pass in [`return_exceptions=True`](https://modal.com/docs/guide/scale#exceptions).
+
+## Container crashes
+
+If a `modal.Function` container crashes (either on start-up, e.g. while handling imports in global scope, or during execution, e.g. an out-of-memory error), Modal will reschedule the container and any work it was currently assigned.
+
+For [ephemeral apps](https://modal.com/docs/guide/apps#ephemeral-apps), container crashes will be retried until a failure rate is exceeded, after which all pending inputs will be failed and the exception will be propagated to the caller.
+
+For [deployed apps](https://modal.com/docs/guide/apps#deployed-apps), container crashes will be retried indefinitely, so as to not disrupt service. Modal will instead apply a crash-loop backoff and the rate of new container creation for the function will be slowed down. Crash-looping containers are displayed in the app dashboard.
+
+#### Preemption
+
+# Preemption
+
+All Modal Functions are subject to preemption by default.
+If a preemption event interrupts a running Function, Modal will gracefully terminate
+the Function and restart it on the same input.
+
+Preemptions are rare, but it is always possible that your Function is
+interrupted. Long-running Functions such as model training Functions should take
+particular care to tolerate interruptions, as likelihood of interruption increases
+with Function run duration.
+
+## Preparing for interruptions
+
+Design your applications to be fault and preemption tolerant. Modal will send an
+interrupt signal to your container when preemption occurs. This will cause the
+Function's [exit handler](https://modal.com/docs/guide/lifecycle-functions#exit) to run, which
+can perform any cleanup within its grace period.
+
+Other best practices for handling preemptions include:
+
+- Divide long-running operations into small tasks or use checkpoints so that you
+  can save your work frequently.
+- Ensure preemptible operations are safely retryable (ie. idempotent).
+
+## Non-preemptible Functions
+
+If you require Functions that are guaranteed not to be preempted, you may set the `nonpreemptible`
+parameter (available starting in client version v1.2.3) to `True` in the `@app.function()` or `@app.cls()` decorator.
+Note that a 3x multiplier will be applied to the [list price](https://modal.com/pricing) for CPU and Memory usage when
+`nonpreemptible` is set to `True`.
+
+**Note:** The `nonpreemptible` parameter is not supported for GPU Functions.
+
+## Non-preemptible Sandboxes
+
+Modal Sandboxes are not subject to preemption, except in the case where a `gpu`
+requirement is specified. This is because of availability and scheduling latency constraints.
+
+#### Timeouts
+
+# Timeouts
+
+All Modal [Function](https://modal.com/docs/reference/modal.Function) executions have a default
+execution timeout of 300 seconds (5 minutes), but users may specify timeout
+durations between 1 second and 24 hours.
+
+```python
+import time
+
+@app.function()
+def f():
+    time.sleep(599)  # Timeout!
+
+@app.function(timeout=600)
+def g():
+    time.sleep(599)
+    print("*Just* made it!")
+```
+
+The timeout duration is a measure of a Function's _execution_ time. It does not
+include scheduling time or any other period besides the time your code is
+executing in Modal. This duration is also per execution attempt, meaning
+Functions configured with [`modal.Retries`](https://modal.com/docs/reference/modal.Retries) will
+start new execution timeouts on each retry. For example, an infinite-looping
+Function with a 100 second timeout and 3 allowed retries will run for least 400
+seconds within Modal.
+
+### Container startup timeout
+
+A Function's `startup_timeout` configures the container's _startup_ time. Your container
+may be taking a long time to startup because it is loading large data, initializing a
+large model or importing many packages. In these cases, you can extend the
+`startup_timeout` of your Function.
+
+```python
+@app.cls(startup_timeout=30, timeout=10)
+class MyFunction:
+    @modal.enter()
+    def startup(self):
+        time.sleep(20)
+
+    @modal.method()
+    def f(self):
+        time.sleep(1)
+```
+
+`startup_timeout` was added in v1.1.4. Prior to v1.1.4, `timeout` configures the
+_execution_ time and _startup_ time. If `startup_timeout` is not set, `timeout` will
+still configure both times.
+
+## Handling timeouts
+
+After exhausting any specified retries, a timeout in a Function will produce a
+`modal.exception.FunctionTimeoutError` which you may catch in your code.
+
+```python
+import modal.exception
+
+@app.function(timeout=100)
+def f():
+    time.sleep(200)  # Timeout!
+
+@app.local_entrypoint()
+def main():
+    try:
+        f.remote()
+    except modal.exception.FunctionTimeoutError:
+        ... # Handle the timeout.
+```
+
+## Timeout accuracy
+
+Functions will run for _at least_ as long as their timeout allows, but they may
+run a handful of seconds longer. If you require accurate and precise timeout
+durations on your Function executions, it is recommended that you implement
+timeout logic in your user code.
+
+#### GPU health
+
+# GPU Health
+
+Modal constantly monitors host GPU health, draining Workers with critical issues
+and surfacing warnings for customer triage.
+
+Application level observability of GPU health is facilitated by [metrics](https://modal.com/docs/guide/gpu-metrics) and event logging to container log streams.
+
+## `[gpu-health]` logging
+
+Containers with attached NVIDIA GPUs are connected to our `gpu-health` monitoring system
+and receive event logs which originate from either application software behavior, system software behavior, or hardware failure.
+
+These logs are in the following format: `[gpu-health] [LEVEL] GPU-[UUID]: EVENT_TYPE: MSG`
+
+- `gpu-health`: Name indicating the source is Modal's observability system.
+- `LEVEL`: Represents the severity level of the log message.
+- `GPU_UUID`: A unique identifier for the GPU device associated with the event, if any.
+- `EVENT_TYPE`: The type of event source. Modal monitors for multiple types of errors,
+  including Xid, SXid, and uncorrectable ECC. See below for more details.
+- `MSG`: The message component is either the original message taken from the event source, or a description provided by Modal of the problem.
+
+## Level
+
+The severity level may be `CRITICAL` or `WARN`. Modal automatically responds to `CRITICAL` level events by draining the underlying Worker and migrating customer containers.
+`WARN` level logs may be benign or indication of an application or library bug. No automatic action is taken by our system for warnings.
+
+## Handling Application-level health issues
+
+As noted above, Modal will automatically respond to critical GPU events, but warning level events can still
+be associated with application exceptions. Applications should catch exceptions caused by GPU-related faults
+and call `modal.experimental.stop_fetching_inputs()`:
+
+<!-- TODO: Migrate snippet to modal.Container when it's shipped. ref: https://modal-com.slack.com/archives/C056CGAANRM/p1756931590088119 -->
+
+```python
+import modal.experimental
+...
+
+@app.function(gpu="H100")
+def demo():
+    try:
+        ... # code which may hit GPU fault (e.g. illegal memory access)
+    except RuntimeError:
+        modal.experimental.stop_fetching_inputs()
+        return
+```
+
+## Xid & SXid
+
+The Xid message is an error report from the NVIDIA driver. The SXid, or "Switch Xid" is a report for the NVSwitch component used in GPU-to-GPU communication, and is thus only relevant in multi-GPU containers.
+
+A classic critical Xid error is the 'fell of the bus' report, code 79. The `gpu-health` event log looks like this:
+
+```
+[gpu-health] [CRITICAL] GPU-1234: XID: NVRM: Xid (PCI:0000:c6:00): 79, pid=1101234, name=nvc:[driver], GPU has fallen off the bus.
+```
+
+There are over 100 Xid codes and they are of highly varying frequency, severity, and specificity.
+[NVIDIA's official documentation](https://docs.nvidia.com/deploy/xid-errors/index.html) provides limited information, so
+we maintain our own tabular information below.
+
+#### Troubleshooting
+
+# Troubleshooting
+
+## "Command not found" errors
+
+If you installed Modal but you're seeing an error like
+`modal: command not found` when trying to run the CLI, this means that the
+installation location of Python package executables ("binaries") are not present
+on your system path. This is a common problem; you need to reconfigure your
+system's environment variables to fix it.
+
+One workaround is to use `python -m modal` instead of `modal`. However, this
+is just a patch. There's no single solution for the problem, because Python
+installs dependencies on different locations depending on your environment. See
+this [popular StackOverflow question](https://stackoverflow.com/q/35898734) for
+pointers on how to resolve your system path issue.
+
+## Function side effects
+
+The same container _can_ be reused for multiple invocations of the same Function
+within an App. This means that if your Function has side effects like modifying
+files on disk, they may or may not be present for subsequent calls to that
+Function. You should not rely on the side effects to be present, but you might
+have to be careful so they don't cause problems.
+
+For example, if you create a disk-backed database using sqlite3:
+
+```python
+import modal
+import sqlite3
+
+app = modal.App()
+
+@app.function()
+def db_op():
+    db = sqlite3("db_file.sqlite3")
+    db.execute("CREATE TABLE example (col_1 TEXT)")
+    ...
+```
+
+This function _can_ (but will not necessarily) fail on the second invocation
+with an `OperationalError: table foo already exists` error.
+
+To get around this, take care to either clean up your side effects (e.g.
+deleting the db file at the end your function call above) or make your functions
+take them into consideration (e.g. adding an
+`if os.path.exists("db_file.sqlite")` condition or randomize the filename
+above). Alternatively, you can set `single_use_containers=True` so that every
+Function call will spin up a new container; however, note that this will result
+in higher cost and worse latency as every invocation will require a cold start.
+
+## Heartbeat timeout
+
+The Modal client in `modal.Function` containers runs a heartbeat loop that the host uses to healthcheck the container's main process.
+If the container stops heartbeating for a long period (minutes), the container will be terminated due to a `heartbeat timeout`, which is displayed in logs.
+
+Container heartbeat timeouts are rare, and they are typically caused by one of two application-level sources:
+
+- [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock) is held for a long time, stopping the heartbeat thread from making progress. [py-spy](https://github.com/benfred/py-spy?tab=readme-ov-file#how-does-gil-detection-work) can detect GIL holding. We include `py-spy` [automatically in `modal shell`](https://modal.com/docs/guide/developing-debugging#debug-shells) for convenience. A quick fix for GIL holding is to run the code which holds the GIL [in a subprocess](https://docs.python.org/3/library/multiprocessing.html#the-process-class).
+- Container process initiates shutdown, intentionally stopping the heartbeats, but it does not complete shutdown.
+
+In both cases [turning on debug logging](https://modal.com/docs/guide/developing-debugging#debug-logs) will help diagnose the issue.
+
+## `413 Content Too Large` errors
+
+If you receive a `413 Content Too Large` error, this might be because you are
+hitting our gRPC payload size limits.
+
+The size limit is currently 100MB.
+
+## Outdated kernel version (4.4.0)
+
+Our secure runtime [reports a misleadingly old](https://github.com/google/gvisor/issues/11117) kernel version, 4.4.0.
+Certain software libraries will detect this and report a warning. These warnings can be ignored because the runtime
+actually implements Linux kernel features from versions 5.15+.
+
+If the outdated kernel version reporting creates errors in your application please contact us [in our Slack](https://modal.com/slack).
+
+## CUDA driver initialization failed on L4 GPU type
+
+Certain L4 instance types within Modal's fleet have a flaky issue in the NVIDIA driver which causes
+the following CUDA context initialization error:
+
+```
+RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
+```
+
+A workaround to ensure reliable container startup is given below:
+
+```python
+@modal.enter()
+def warmup_cuda(self):
+    import ctypes
+    import time
+    import modal
+    cu = ctypes.CDLL("libcuda.so.1")
+    max_retries = 10
+    retry_delay_secs = 0.5
+    for attempt in range(max_retries):
+        rc = cu.cuInit(0)
+        if rc == 0:
+            break
+        else:
+            if attempt < max_retries - 1:
+                print(f"cuInit failed on attempt {attempt + 1}/{max_retries} with code {rc}, retrying...")
+                time.sleep(retry_delay_secs)
+    else:
+        print(f"CUDA initialization failed after {max_retries} attempts; stopping container")
+        modal.experimental.stop_fetching_inputs()
+```
+
+We are investigating a root cause fix for this problem.
+
+### Security and privacy
+
+# Security and privacy at Modal
+
+The document outlines Modal's security and privacy commitments.
+
+## Application security (AppSec)
+
+AppSec is the practice of building software that is secure by design, secured
+during development, secured with testing and review, and deployed securely.
+
+- We build our software using memory-safe programming languages, including Rust
+  (for our worker runtime and storage infrastructure) and Python (for our API
+  servers and Modal client).
+- Software dependencies are audited by Github's Dependabot.
+- We make decisions that minimize our attack surface. Most interactions with
+  Modal are well-described in a gRPC API, and occur through
+  [`modal`](https://pypi.org/project/modal), our open-source command-line tool
+  and Python client library.
+- We have automated synthetic monitoring test applications that continuously
+  check for network and application isolation within our runtime.
+- We use HTTPS for secure connections. Modal forces HTTPS for all services using
+  TLS (SSL), including our public website and the Dashboard to ensure secure
+  connections. Modal's [client library](https://pypi.org/project/modal) connects
+  to Modal's servers over TLS and verify TLS certificates on each connection.
+- All user data is encrypted in transit and at rest.
+- All public Modal APIs use
+  [TLS 1.3](https://datatracker.ietf.org/doc/html/rfc8446), the latest and
+  safest version of the TLS protocol.
+- Internal code reviews are performed using a modern, PR-based development
+  workflow (Github), and engage external penetration testing firms to assess our
+  software security.
+
+## Corporate security (CorpSec)
+
+CorpSec is the practice of making sure Modal employees have secure access to
+Modal company infrastructure, and also that exposed channels to Modal are
+secured. CorpSec controls are the primary concern of standards such as SOC2.
+
+- Access to our services and applications is gated on a SSO Identity Provider
+  (IdP).
+- We mandate phishing-resistant multi-factor authentication (MFA) in all
+  enrolled IdP accounts.
+- We regularly audit access to internal systems.
+- Employee laptops are protected by full disk encryption using FileVault2, and
+  managed by Secureframe MDM.
+
+## Network and infrastructure security (InfraSec)
+
+InfraSec is the practice of ensuring a hardened, minimal attack surface for
+components we deploy on our network.
+
+- Modal uses logging and metrics observability providers, including Datadog and
+  Sentry.io.
+- Compute jobs at Modal are containerized and virtualized using
+  [gVisor](https://github.com/google/gvisor), the sandboxing technology
+  developed at Google and used in their _Google Cloud Run_ and _Google
+  Kubernetes Engine_ cloud services.
+- We conduct annual business continuity and security incident exercises.
+<!-- TODO: we don't yet encrypt network file system data. "Customer information on databases and volumes at Modal is encrypted with the Linux LUKS block storage encryption secrets." -->
+
+## Vulnerability remediation
+
+Security vulnerabilities directly affecting Modal's systems and services will be
+patched or otherwise remediated within a timeframe appropriate for the severity
+of the vulnerability, subject to the public availability of a patch or other
+remediation mechanisms.
+
+If there is a CVSS severity rating accompanying a vulnerability disclosure, we
+rely on that as a starting point, but may upgrade or downgrade the severity
+using our best judgement.
+
+### Severity timeframes
+
+- **Critical:** 24 hours
+- **High:** 1 week
+- **Medium:** 1 month
+- **Low:** 3 months
+- **Informational:** 3 months or longer
+
+## Shared responsibility model
+
+Modal prioritizes the integrity, security, and availability of customer data. Under our shared responsibility model, customers also have certain responsibilities regarding data backup, recovery, and availability.
+
+1. **Data backup**: Customers are responsible for maintaining backups of their data. Performing daily backups is recommended. Customers must routinely verify the integrity of their backups.
+2. **Data recovery**: Customers should maintain a comprehensive data recovery plan that includes detailed procedures for data restoration in the event of data loss, corruption, or system failure. Customers must routinely test their recovery process.
+3. **Availability**: While Modal is committed to high service availability, customers must implement contingency measures to maintain business continuity during service interruptions. Customers are also responsible for the reliability of their own IT infrastructure.
+4. **Security measures**: Customers must implement appropriate security measures, such as encryption and access controls, to protect their data throughout the backup, storage, and recovery processes. These processes must comply with all relevant laws and regulations.
+
+## SOC 2
+
+We have successfully completed a [System and Organization Controls (SOC) 2 Type 2
+audit](https://modal.com/blog/soc2type2). Go to our [Security Portal](https://trust.modal.com) to request access to the report.
+
+## HIPAA
+
+HIPAA, which stands for the Health Insurance Portability and Accountability Act, establishes a set of standards that protect health information, including individuals’ medical records and other individually identifiable health information. HIPAA guidelines apply to both covered entities and business associates—of which Modal is the latter if you are processing PHI on Modal.
+
+Modal's services can be used in a HIPAA compliant manner. It is important to note that unlike other security standards, there is no officially recognized certification process for HIPAA compliance. Instead, we demonstrate our compliance with regulations such as HIPAA via the practices outlined in this doc, our technical and operational security measures, and through official audits for standards compliance such as SOC 2 certification.
+
+To use Modal services for HIPAA-compliant workloads, a Business Associate Agreement (BAA) should be established with us prior to submission of any PHI. This is available on our Enterprise plan. Contact us at security@modal.com to get started. At the moment, [Volumes v1](https://modal.com/docs/guide/volumes), [Images](https://modal.com/docs/guide/images) (persistent storage), [memory snapshots](https://modal.com/docs/guide/memory-snapshot), and user code are out of scope of the commitments within our BAA, so PHI should not be used in those areas of the product.
+
+[Volumes v2](https://modal.com/docs/guide/volumes#volumes-v2) are HIPAA compliant.
+
+## PCI
+
+_Payment Card Industry Data Security Standard_ (PCI) is a standard that defines
+the security and privacy requirements for payment card processing.
+
+Modal uses [Stripe](https://stripe.com) to securely process transactions and
+trusts their commitment to best-in-class security. We do not store personal
+credit card information for any of our customers. Stripe is certified as "PCI
+Service Provider Level 1", which is the highest level of certification in the
+payments industry.
+
+## Bug bounty program
+
+Keeping user data secure is a top priority at Modal. We welcome contributions
+from the security community to identify vulnerabilities in our product and
+disclose them to us in a responsible manner. We offer rewards ranging from $100
+to $1000+ depending on the severity of the issue discovered. To participate,
+please send a report of the vulnerability to security@modal.com.
+
+## Data privacy
+
+Modal will never access or use:
+
+- your source code.
+- the inputs (function arguments) or outputs (function return values) to your Modal Functions.
+- any data you store in Modal, such as in Images or Volumes.
+
+Inputs (function arguments) and outputs (function return values) are deleted from our system after a max TTL of 7 days.
+
+App logs and metadata are stored on Modal. Modal will not access this data
+unless permission is granted by the user to help with troubleshooting.
+
+## Questions?
+
+[Email us!](mailto:security@modal.com)
+
+### Integrations
+
+#### Using OIDC to authenticate with external services
+
+# Using OIDC to authenticate with external services
+
+Your Functions in Modal may need to access external resources like S3 buckets.
+Traditionally, you would need to store long-lived credentials in Modal Secrets
+and reference those Secrets in your function code. With the Modal OIDC
+integration, you can instead use automatically-generated identity
+tokens to authenticate to external services.
+
+## How it works
+
+[OIDC](https://auth0.com/docs/authenticate/protocols/openid-connect-protocol) is
+a standard protocol for authenticating users between systems. In Modal, we use
+OIDC to generate short-lived tokens that external services can use to verify
+that your function is authenticated.
+
+The OIDC integration has two components: the discovery document and the generated
+tokens.
+
+The [OIDC discovery document](https://swagger.io/docs/specification/v3_0/authentication/openid-connect-discovery/)
+describes how our OIDC server is configured. It primarily includes the supported
+[claims](https://developer.okta.com/blog/2017/07/25/oidc-primer-part-1) and the [keys](https://auth0.com/docs/secure/tokens/json-web-tokens/json-web-key-sets)
+we use to sign tokens. Discovery documents are always hosted at `/.well-known/openid-configuration`, and
+you can view ours at <https://oidc.modal.com/.well-known/openid-configuration>.
+
+The generated tokens are [JWTs](https://jwt.io/) signed by Modal using the keys described in the
+discovery document. These tokens contain the full identity of the Function
+in the `sub` claim, and they use custom claims to make this information more
+easily accessible. See our [discovery document](https://oidc.modal.com/.well-known/openid-configuration)
+for a full list of claims.
+
+Generated tokens are injected into your Function's containers via the `MODAL_IDENTITY_TOKEN`
+environment variable. Below is an example of what claims might be included in a token:
+
+```json
+{
+  "sub": "modal:workspace_id:ac-12345abcd:environment_name:modal-examples:app_name:oidc-token-test:function_name:jwt_return_func:container_id:ta-12345abcd",
+  "aud": "oidc.modal.com",
+  "exp": 1732137751,
+  "iat": 1731964951,
+  "iss": "https://oidc.modal.com",
+  "jti": "31f92dca-e847-4bc9-8d15-9f234567a123",
+  "workspace_id": "ac-12345abcd",
+  "environment_id": "en-12345abcd",
+  "environment_name": "modal-examples",
+  "app_id": "ap-12345abcd",
+  "app_name": "oidc-token-test",
+  "function_id": "fu-12345abcd",
+  "function_name": "jwt_return_func",
+  "container_id": "ta-12345abcd"
+}
+```
+
+### Key thumbprints
+
+RSA keys have [thumbprints](https://connect2id.com/products/nimbus-jose-jwt/examples/jwk-thumbprints). You
+can use these thumbprints to verify that the keys in our discovery document are
+genuine. This protects against potential Man in the Middle (MitM) attacks, although
+our required use of HTTPS mitigates this risk.
+
+If you'd like to have the extra security of verifying the thumbprints, you can
+use the following command to print the thumbprints for the keys in our
+discovery document:
+
+```bash
+$ openssl s_client -connect oidc.modal.com:443 < /dev/null 2>/dev/null | openssl x509 -fingerprint -noout | awk -F= '{print $2}' | tr -d ':'
+F062F2151EDE30D1620B48B7AC91D66047D769D3
+```
+
+Note that these thumbprints may change over time as we rotate keys. We recommend
+periodically checking for and updating your scripts with the new thumbprints.
+
+### App name format
+
+By default, Modal Apps can be created with arbitrary names. However, when using
+OIDC, the App name has a stricter character set. Specifically, it must be 64
+characters or less and can only include alphanumeric characters, dashes, periods,
+and underscores. If these constraints are violated, the OIDC token will not be
+injected into the container.
+
+Note that these are the same constraints that are applied to [Deployed Apps](https://modal.com/docs/guide/managing-deployments).
+This means that if an App is deployable, it will also be compatible with OIDC.
+
+## Demo usage with AWS S3
+
+To see how OIDC tokens can be used, we'll demo a simple Function that lists
+objects in an S3 bucket.
+
+### Step 0: Understand your OIDC claims
+
+Before we can configure OIDC policies, we need to know what claims we can match
+against. We can run a Function and inspect its claims to find out.
+
+```python notest
+app = modal.App("oidc-token-test")
+
+jwt_image = modal.Image.debian_slim().pip_install("pyjwt")
+
+@app.function(image=jwt_image)
+def jwt_return_func():
+    import jwt
+
+    token = os.environ["MODAL_IDENTITY_TOKEN"]
+    claims = jwt.decode(token, options={"verify_signature": False})
+    print(json.dumps(claims, indent=2))
+
+@app.local_entrypoint()
+def main():
+    jwt_return_func.remote()
+```
+
+Run the function locally to see its claims:
+
+```bash
+$ modal run oidc-token-test.py
+{
+  "sub": "modal:workspace_id:ac-12345abcd:environment_name:modal-examples:app_name:oidc-token-test:function_name:jwt_return_func:container_id:ta-12345abcd",
+  "aud": "oidc.modal.com",
+  "exp": 1732137751,
+  "iat": 1731964951,
+  "iss": "https://oidc.modal.com",
+  "jti": "31f92dca-e847-4bc9-8d15-9f234567a123",
+  "workspace_id": "ac-12345abcd",
+  "environment_id": "en-12345abcd",
+  "environment_name": "modal-examples",
+  "app_id": "ap-12345abcd",
+  "app_name": "oidc-token-test",
+  "function_id": "fu-12345abcd",
+  "function_name": "jwt_return_func",
+  "container_id": "ta-12345abcd"
+}
+```
+
+Now we can match off these claims to configure our OIDC policies.
+
+### Step 1: Configure AWS to trust Modal's OIDC provider
+
+We need to make AWS accept Modal identity tokens. To do this, we need to add
+Modal's OIDC provider as a trusted entity in our AWS account.
+
+```bash
+aws iam create-open-id-connect-provider \
+    --url https://oidc.modal.com \
+    --client-id-list oidc.modal.com \
+    # Optionally replace with the thumbprint from the discovery document.
+    # Note that this may change over time as we rotate keys, and this argument
+    # can be omitted if you'd prefer to rely on the HTTPS verification instead.
+    --thumbprint-list "<thumbprint>"
+```
+
+This will trigger AWS to pull down our [JSON Web Key Set (JWKS)](https://auth0.com/docs/secure/tokens/json-web-tokens/json-web-key-sets)
+and use it to verify the signatures of any tokens signed by Modal.
+
+### Step 2: Create an IAM policy that can be used by Modal Functions
+
+Let's create a simple IAM policy that allows listing objects in an S3 bucket.
+Take the policy below and replace the bucket name with your own.
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
+      "Resource": ["arn:aws:s3:::fun-bucket", "arn:aws:s3:::fun-bucket/*"]
+    }
+  ]
+}
+```
+
+### Step 3: Create an IAM role that can be assumed by Modal Functions
+
+Now, we can create an IAM role that uses this policy. Visit the IAM console
+to create this role. If you add this policy using the CLI, update the
+OIDC provider ARN to match the one created in [Step 1](#step-1-configure-aws-to-trust-modals-oidc-provider).
+Be sure to replace the Workspace ID placeholder with your own. You can find your Workspace ID
+using the script from [Step 0](#step-0-understand-your-oidc-claims).
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Principal": {
+        "Federated": "arn:aws:iam::123456789abcd:oidc-provider/oidc.modal.com"
+      },
+      "Action": "sts:AssumeRoleWithWebIdentity",
+      "Condition": {
+        "StringEquals": {
+          "oidc.modal.com:aud": "oidc.modal.com"
+        },
+        "StringLike": {
+          "oidc.modal.com:sub": "modal:workspace_id:ac-12345abcd:*"
+        }
+      }
+    }
+  ]
+}
+```
+
+Note how we use `workspace_id` to limit the scope of the role. This means that
+the IAM role can only be assumed by Functions in your Workspace. You can further
+limit this by specifying an Environment, App, or Function name.
+
+Ideally, we would use the custom claims for role limiting. Unfortunately, AWS
+does not support [matching on custom claims](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_iam-condition-keys.html#condition-keys-wif),
+so we use the `sub` claim instead.
+
+### Step 4: Use the OIDC token in your Function
+
+The AWS SDKs have built-in support for OIDC tokens, so you can use them as
+follows:
+
+```python notest
+import boto3
+
+app = modal.App("oidc-token-test")
+
+boto3_image = modal.Image.debian_slim().pip_install("boto3")
+
+# Trade a Modal OIDC token for AWS credentials
+def get_s3_client(role_arn):
+    sts_client = boto3.client("sts")
+
+    # Assume role with Web Identity
+    credential_response = sts_client.assume_role_with_web_identity(
+        RoleArn=role_arn, RoleSessionName="OIDCSession", WebIdentityToken=os.environ["MODAL_IDENTITY_TOKEN"]
+    )
+
+    # Extract credentials
+    credentials = credential_response["Credentials"]
+    return boto3.client(
+        "s3",
+        aws_access_key_id=credentials["AccessKeyId"],
+        aws_secret_access_key=credentials["SecretAccessKey"],
+        aws_session_token=credentials["SessionToken"],
+    )
+
+# List the contents of an S3 bucket
+@app.function(image=boto3_image)
+def list_bucket_contents(bucket_name, role_arn):
+    s3_client = get_s3_client(role_arn)
+    response = s3_client.list_objects_v2(Bucket=bucket_name)
+    for obj in response["Contents"]:
+        print(f"- {obj['Key']} (Size: {obj['Size']} bytes)")
+
+@app.local_entrypoint()
+def main():
+    # Replace with the role ARN and bucket name from step 2
+    list_bucket_contents.remote("fun-bucket", "arn:aws:iam::123456789abcd:role/oidc_test_role")
+```
+
+Run the function locally to see the contents of the bucket:
+
+```bash
+$ modal run oidc-token-test.py
+- test-file.txt (Size: 10 bytes)
+```
+
+## Demo usage with AWS Elastic Container Registry (ECR)
+
+You can also use OIDC to authenticate [Private Registries](https://modal.com/docs/guide/existing-images) on AWS.
+
+### Prerequisites
+
+1. Configure AWS to trust Modal's OIDC provider ([Step 1 above](#step-1-configure-aws-to-trust-modals-oidc-provider))
+
+2. [Create an AWS Policy with read-only ECR access](https://modal.com/docs/guide/existing-images#elastic-container-registry-ecr)
+
+3. Create an IAM role that uses this policy ([Step 3 above](#step-3-create-an-iam-role-that-can-be-assumed-by-modal-functions))
+
+### Test with a sample image
+
+Create sample Dockerfile:
+
+```dockerfile
+FROM python:3.11-slim
+WORKDIR /app
+CMD ["python3"]
+```
+
+Build and push the image to ECR:
+
+```bash
+# Login with the AWS CLI
+aws ecr get-login-password --region [ECR_REGION] | docker login --username AWS --password-stdin [ECR_REPO_ARN]
+
+# Build the Docker Image
+docker build -t modal-oidc-test-image .
+
+# Push the image to ECR
+docker tag modal-oidc-test-image:latest [ECR_REPO_ARN]:latest
+docker push [ECR_REPO_ARN]:latest
+```
+
+Test pulling the image from ECR:
+
+```python
+import modal
+
+app = modal.App("image-from-ecr-test")
+sample_image = modal.Image.from_aws_ecr(
+    "[ECR_IMAGE_URI]", #eg. "12345678.dkr.ecr.us-east-1.amazonaws.com/repository:latest"
+    secret=modal.Secret.from_dict(
+        {
+            "AWS_ROLE_ARN": "[IAM_ROLE_ARN]", # eg. "arn:aws:iam::123456789abcd:role/oidc_test_role"
+            "AWS_REGION": "[ECR_REGION]", # eg. "us-east-1"
+        }
+    ),
+)
+
+@app.function(image=sample_image)
+def hello():
+    print("Hello, World!")
+```
+
+## Next steps
+
+The OIDC integration can be used for much more than just AWS. With this same pattern,
+you can configure automatic access to [Vault](https://developer.hashicorp.com/vault/docs/auth/jwt),
+[GCP](https://cloud.google.com/identity-platform/docs/web/oidc), [Azure](https://learn.microsoft.com/en-us/entra/identity-platform/v2-protocols-oidc), and more.
+At this time, OIDC-authenticated container image pulling is only support with AWS ECR.
+
+#### Connecting Modal to your Datadog account
+
+# Connecting Modal to your Datadog account
+
+You can use the [Modal + Datadog Integration](https://docs.datadoghq.com/integrations/modal/)
+to export Modal function logs to Datadog. You'll find the Modal Datadog
+Integration available for install in the Datadog marketplace.
+
+## What this integration does
+
+This integration allows you to:
+
+1. Export Modal audit logs in Datadog
+2. Export Modal function logs to Datadog
+3. Export container metrics to Datadog
+
+## Installing the integration
+
+1. Open the [Modal Tile](https://app.datadoghq.com/integrations?integrationId=modal) (or the EU tile [here](https://app.datadoghq.eu/integrations?integrationId=modal))
+   in the Datadog integrations page
+2. Click "Install Integration"
+3. Click Connect Accounts to begin authorization of this integration.
+   You will be redirected to log into Modal, and once logged in, you’ll
+   be redirected to the Datadog authorization page.
+4. Click "Authorize" to complete the integration setup
+
+## Metrics
+
+The Modal Datadog Integration will forward the following metrics to Datadog:
+
+- `modal.cpu.utilization`
+- `modal.memory.usage`
+- `modal.gpu.memory.usage`
+- `modal.gpu.compute.utilization`
+- `modal.input_events.elapsed_time_us`
+- `modal.input_events.input_queue_time_us`
+- `modal.input_events.coldstart_time_us`
+- `modal.input_events.successes`
+- `modal.input_events.total_inputs`
+- `modal.function.pending_inputs`
+- `modal.function.running_inputs`
+
+Deprecated metrics:
+
+- `modal.memory.utilization` (use `modal.memory.usage`)
+- `modal.gpu.memory.utilization` (use `modal.gpu.memory.usage`)
+
+`modal.input_events.successes` and `modal.input_events.total_inputs` can be used to measure the success rate of a certain function or app.
+
+These metrics come free of charge and are tagged with `container_id`, `environment_name`,
+`app_name`, `app_id`, `function_name`, `function_id`, `workspace_name`, and `workspace_id`.
+
+## Structured logging
+
+Logs from Modal are sent to Datadog in plaintext without any structured
+parsing. This means that if you have custom log formats, you'll need to
+set up a [log processing pipeline](https://docs.datadoghq.com/logs/log_configuration/pipelines/?tab=source)
+in Datadog to parse them.
+
+Modal passes log messages in the `.message` field of the log record. To
+parse logs, you should operate over this field. Note that the Modal Integration
+does set up some basic pipelines. In order for your pipelines to work, ensure
+that your pipelines come before Modal's pipelines in your log settings.
+
+## Cost Savings
+
+The Modal Datadog Integration will forward all logs to Datadog which could be
+costly for verbose apps. We recommend using either [Log Pipelines](https://docs.datadoghq.com/logs/log_configuration/pipelines/?tab=source)
+or [Index Exclusion Filters](https://docs.datadoghq.com/logs/indexes/?tab=ui#exclusion-filters)
+to filter logs before they are sent to Datadog.
+
+The Modal Integration tags all logs with the `environment` attribute. The
+simplest way to filter logs is to create a pipeline that filters on this
+attribute and to isolate verbose apps in a separate environment.
+
+## Uninstalling the integration
+
+Once the integration is uninstalled, all logs will stop being sent to
+Datadog, and authorization will be revoked.
+
+1. Navigate to the [Modal metrics settings page](http://modal.com/settings/metrics)
+   and select "Delete Datadog Integration".
+2. On the Configure tab in the Modal integration tile in Datadog,
+   click Uninstall Integration.
+3. Confirm that you want to uninstall the integration.
+4. Ensure that all API keys associated with this integration have been
+   disabled by searching for the integration name on the [API Keys](https://app.datadoghq.com/organization-settings/api-keys?filter=Modal)
+   page.
+
+#### Connecting Modal to your OpenTelemetry provider
+
+# Connecting Modal to your OpenTelemetry Provider
+
+You can export Modal logs to your [OpenTelemetry](https://opentelemetry.io/docs/what-is-opentelemetry/)
+provider using the Modal OpenTelemetry integration. This integration is compatible with
+any observability provider that supports the OpenTelemetry HTTP APIs.
+
+## What this integration does
+
+This integration allows you to:
+
+1. Export Modal audit logs to your provider
+2. Export Modal function logs to your provider
+3. Export container metrics to your provider
+
+## Metrics
+
+The Modal OpenTelemetry Integration will forward the following metrics to your provider:
+
+- `modal.cpu.utilization`
+- `modal.memory.usage`
+- `modal.gpu.memory.usage`
+- `modal.gpu.compute.utilization`
+- `modal.input_events.elapsed_time_us`
+- `modal.input_events.input_queue_time_us`
+- `modal.input_events.coldstart_time_us`
+- `modal.input_events.successes`
+- `modal.input_events.total_inputs`
+- `modal.function.pending_inputs`
+- `modal.function.running_inputs`
+
+Deprecated metrics:
+
+- `modal.memory.utilization` (use `modal.memory.usage`)
+- `modal.gpu.memory.utilization` (use `modal.gpu.memory.usage`)
+
+`modal.input_events.successes` and `modal.input_events.total_inputs` can be used to measure the success rate of a certain function or app.
+
+These metrics are tagged with `container_id`, `environment_name`, `app_name`,
+`app_id`, `function_name`, `function_id`, `workspace_name`, and `workspace_id`.
+
+## Custom metrics (Beta)
+
+The Modal OpenTelemetry Integration allows you to send custom metrics and spans to your provider.
+To use this feature, export our collector environment variables. These configure the OpenTelemetry SDK
+to send messages to our collector in HTTP format. You don't need to do this to get the
+out-of-the-box metrics above, only for your own custom metrics.
+
+```python
+@app.function(
+   secrets=[modal.Secret.from_dict({
+      "OTEL_EXPORTER_OTLP_ENDPOINT": "otlp-collector.modal.local:4317",
+      "OTEL_EXPORTER_OTLP_INSECURE": "true",
+      "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf",
+   })],
+)
+def custom_metrics():
+   ...
+```
+
+All OpenTelemetry SDKs should pick this configuration up, and your custom metrics and spans will be
+sent to your configured provider.
+
+## Installing the integration
+
+1. Find out the endpoint URL for your OpenTelemetry provider. This is the URL that
+   the Modal integration will send logs to. Note that this should be the base URL
+   of the OpenTelemetry provider, and not a specific endpoint. For example, for the
+   [US New Relic instance](https://docs.newrelic.com/docs/opentelemetry/best-practices/opentelemetry-otlp/#configure-endpoint-port-protocol),
+   the endpoint URL is `https://otlp.nr-data.net`, not `https://otlp.nr-data.net/v1/logs`.
+2. Find out the API key or other authentication method required to send logs to your
+   OpenTelemetry provider. This is the key that the Modal integration will use to authenticate
+   with your provider. Modal can provide any key/value HTTP header pairs. For example, for
+   [New Relic](https://docs.newrelic.com/docs/opentelemetry/best-practices/opentelemetry-otlp/#api-key),
+   the header is `api-key`.
+3. Create a new OpenTelemetry Secret in Modal with one key per header. These keys should be
+   prefixed with `OTEL_HEADER_`, followed by the name of the header. The value of this
+   key should be the value of the header. For example, for New Relic, an example Secret
+   might look like `OTEL_HEADER_api-key: YOUR_API_KEY`. If you use the OpenTelemetry Secret
+   template, this will be pre-filled for you.
+4. Navigate to the [Modal metrics settings page](http://modal.com/settings/metrics) and configure
+   the OpenTelemetry push URL from step 1 and the Secret from step 3.
+5. Save your changes and use the test button to confirm that logs are being sent to your provider.
+   If it's all working, you should see a `Hello from Modal! 🚀` log from the `modal.test_logs` service.
+
+## Uninstalling the integration
+
+Once the integration is uninstalled, all logs will stop being sent to
+your provider.
+
+1. Navigate to the [Modal metrics settings page](http://modal.com/settings/metrics)
+   and disable the OpenTelemetry integration.
+
+#### Okta SSO
+
+# Okta SSO
+
+## Prerequisites
+
+- A Workspace that's on an [Enterprise](https://modal.com/pricing) plan
+- Admin access to the Workspace you want to configure with Okta Single-Sign-On (SSO)
+- Admin privileges for your Okta Organization
+
+## Supported features
+
+- IdP-initiated SSO
+- SP-initiated SSO
+- Just-In-Time account provisioning
+
+For more information on the listed features, visit the
+[Okta Glossary](https://help.okta.com/okta_help.htm?type=oie&id=ext_glossary).
+
+## Configuration
+
+### Read this before you enable "Require SSO"
+
+Enabling "Require SSO" will force all users to sign in via Okta. Ensure that you
+have admin access to your Modal Workspace through an Okta account before
+enabling.
+
+### Configuration steps
+
+#### Step 1: Add Modal app to Okta Applications
+
+1. Sign in to your Okta admin dashboard
+2. Navigate to the Applications tab and click "Browse App Catalog".
+   ![Okta browse application](../../assets/docs/okta-browse-applications.png)
+
+3. Select "Modal" and click "Done".
+4. Select the "Sign On" tab and click "Edit".
+   ![Okta sign on edit](../../assets/docs/okta-sign-on-edit.png)
+5. Fill out Workspace field to configure for your specific Modal workspace. See
+   [Step 2](https://modal.com/docs/guide/okta-sso#step-2-link-your-workspace-to-okta-modal-application)
+   if you're unsure what this is.
+   ![Okta add workspace](../../assets/docs/okta-add-workspace-username.png)
+
+#### Step 2: Link your Workspace to Okta Modal application
+
+1. Navigate to your application on the Okta Admin page.
+2. Copy the Metadata URL from the Okta Admin Console (It's under the "Sign On"
+   tab). ![Okta metadata url](../../assets/docs/okta-metadata-url.png)
+
+3. Sign in to https://modal.com and visit your [Workspace Management](https://modal.com/settings/workspace-management/identity-and-provisioning) page's `Identity and Provisioning` tab.
+4. Paste the Metadata URL in the input and click "Save Changes"
+
+#### Step 3: Assign users / groups and test the integration
+
+1. Navigate back to your Okta application on the Okta Admin dashboard.
+2. Click on the "Assignments" tab and add the appropriate people or groups.
+
+![Okta Assign Users](../../assets/docs/okta-assign-people.png)
+
+3. To test the integration, sign in as one of the users you assigned in the previous step.
+4. Click on the Modal application on the Okta Dashboard to initiate Single Sign-On.
+
+#### Notes
+
+The following SAML attributes are used by the integration:
+
+| Name      | Value          |
+| --------- | -------------- |
+| email     | user.email     |
+| firstName | user.firstName |
+| lastName  | user.lastName  |
+
+## SP-initiated SSO
+
+The sign-in process is initiated from https://modal.com/login/sso
+
+1. Enter your workspace name in the input
+2. Click "continue with SSO" to authenticate with Okta
+
+#### Slack notifications (beta)
+
+# Slack notifications (beta)
+
+You can integrate your Modal Workspace with Slack to receive timely essential notifications.
+
+## Prerequisites
+
+- You are a [Workspace Manager](https://modal.com/docs/guide/workspaces#administrating-workspace-members) in the Modal Workspace you're installing the Slack integration in.
+- You have permissions to install apps in your Slack workspace.
+
+## Supported notifications
+
+- Alerts for failed scheduled function runs.
+- Alerts for crash-looping containers in a function.
+- Alerts when any of your apps have client versions that are out of date.
+- Alerts when you hit your GPU resource limits.
+
+## Slack Permissions
+
+The Modal Slack app requests the following permissions to integrate with Slack:
+
+- Start direct messages with people
+- Send messages as @modal
+- Add shortcuts and/or slash commands that people can use
+- View basic information about public channels in a workspace
+- View basic information about private channels that Modal has been added to
+- View basic information about direct messages that Modal has been added to
+- View basic information about group direct messages that Modal has been added to
+- View people in a workspace
+
+## Configuration
+
+### Step 1: Install the Slack integration
+
+Visit the _Slack Integration_ section on your [settings](https://modal.com/settings/slack-integration) page in your Modal Workspace and click the **Add to Slack** button.
+
+### Step 2: Invite the Modal app to your Slack channel
+
+Navigate to the Slack channel and `/invite` the Modal app so that the app can post messages to the channel.
+
+![Adding an app to Slack channel](https://modal-cdn.com/cdnbot/slack-invite-app_vpxfskj_f0dc9524.webp)
+
+### Step 3: Add the Modal app to your Slack channel
+
+Navigate to the Slack channel you want to add the Modal app to and click on the channel header. On the integrations tab you can add the Modal app.
+
+![Add Modal app to Slack channel](../../assets/docs/slack-add-modal-app.jpg)
+
+### Step 4: Use `/modal link` to link the Slack channel to your Modal Workspace
+
+You'll be prompted to select the Workspace you want to link to the Slack channel. You can always unlink the Slack channel by visiting the _Slack Integration_ section on your [settings](https://modal.com/settings/slack-integration) page in your Modal Workspace.
+
+### Workspace & account settings
+
+#### Workspaces
+
+# Workspaces
+
+A **workspace** is an area where a user can deploy Modal apps and other
+resources. There are two types of workspaces: personal and shared. After a new
+user has signed up to Modal, a personal workspace is automatically created for
+them. The name of the personal workspace is based on your GitHub username, but
+it might be randomly generated if already taken or invalid.
+
+To collaborate with others, a new shared workspace needs to be created.
+
+## Create a Workspace
+
+All additional workspaces are shared workspaces, meaning you can invite others
+by email to collaborate with you. There are two ways to create a Modal workspace
+on the [settings](https://modal.com/settings/workspaces) page.
+
+![view of workspaces creation interface](https://modal-cdn.com/cdnbot/create-new-workspace-viewk0ka46_7_800f2053.webp)
+
+1. Create from [GitHub organization](https://docs.github.com/en/organizations). Allows members in GitHub organization to auto-join the workspace.
+
+2. Create from scratch. You can invite anyone to your workspace.
+
+If you're interested in having a workspace associated with your Okta
+organization, then check out our [Okta SSO docs](https://modal.com/docs/guide/okta-sso).
+
+If you're interested in using SSO through Google or other providers, then please reach out to us at [support@modal.com](mailto:support@modal.com).
+
+## Auto-joining a Workspace associated with a GitHub organization
+
+Note: This is only relevant for Workspaces created from a GitHub organization.
+
+Users can automatically join a Workspace on their [Workspace settings page](https://modal.com/settings/workspaces) if they are a member of the GitHub organization associated with the Workspace.
+
+To turn off this functionality a Workspace Manager can disable it on the **Workspace Management** tab of their Workspace's settings page.
+
+## Inviting new Workspace members
+
+To invite a new Workspace member, you can visit the [settings](https://modal.com/settings) page
+and navigate to the members tab for the appropriate workspace.
+
+You can either send an email invite or share an invite link. Both existing Modal
+users and non-existing users can use the links to join your workspace. If they
+are a new user a Modal account will be created for them.
+
+![invite member section](../../assets/screenshots/invite-member.png)
+
+## Create a token for a Workspace
+
+To interact with a Workspace's resources programmatically, you need to add an
+API token for that Workspace. Your existing API tokens are displayed on
+[the settings page](https://modal.com/settings/tokens) and new API tokens can be added for a
+particular Workspace.
+
+After adding a token for a Workspace to your Modal config file you can activate
+that Workspace's profile using the CLI (see below).
+
+As an manager or workspace owner you can manage active tokens for a workspace on
+[the member tokens page](https://modal.com/settings/tokens/member-tokens). For more information on API
+token management see the
+[documentation about configuration](https://modal.com/docs/reference/modal.config).
+
+## Switching active Workspace
+
+When on the dashboard or using the CLI, the active profile determines which
+personal or organizational Workspace is associated with your actions.
+
+### Dashboard
+
+You can switch between organization Workspaces and your Personal Workspace by
+using the workspace selector at the top of [the dashboard](https://modal.com/home).
+
+### CLI
+
+To switch the Workspace associated with CLI commands, use
+`modal profile activate`.
+
+## Administrating workspace members
+
+Workspaces have three different levels of access privileges:
+
+- Owner
+- Manager
+- Member
+
+The user that creates a workspace is automatically set as the **Owner** for that
+workspace. The owner can assign any other roles within the workspace, as well as
+remove other members of the workspace.
+
+A **Manager** within a workspace can assign all roles except **Owner** and can
+also remove other members of the workspace.
+
+A **Member** of a workspace can not assign any access privileges within the
+workspace but can otherwise perform any action like running and deploying apps
+and modify Secrets.
+
+As an Owner or Manager you can administrate the access privileges of other
+members on the `Workspace Management` tab in [settings](https://modal.com/settings/workspace-management).
+
+## Leaving a Workspace
+
+To leave a workspace, navigate to [the settings page](https://modal.com/settings/workspaces) and
+click "Leave" on a listed Workspace. There must be at least one owner assigned
+to a workspace.
+
+#### Environments
+
+# Environments
+
+Environments are sub-divisions of workspaces, allowing you to deploy the same app
+(or set of apps) in multiple instances for different purposes without changing
+your code. Typical use cases for environments include having one `dev`
+environment and one `prod` environment, preventing overwriting production apps
+when developing new features, while still being able to deploy changes to a
+"live" and potentially complex structure of apps.
+
+Each environment has its own set of [Secrets](https://modal.com/docs/guide/secrets) and any
+object lookups performed from an app in an environment will by default look for
+objects in the same environment.
+
+By default, every workspace has a single Environment called "main". New
+Environments can be created on the CLI:
+
+```sh
+modal environment create dev
+```
+
+(You can run `modal environment --help` for more info)
+
+Workspaces can have up to 1500 Environments.
+
+Once created, Environments show up as a dropdown menu in the navbar of the
+[Modal dashboard](https://modal.com/home), letting you set browse all Modal Apps and Secrets
+filtered by which Environment they were deployed to.
+
+Most CLI commands also support an `--env` flag letting you specify which
+Environment you intend to interact with, e.g.:
+
+```sh
+modal run --env=dev app.py
+modal volume create --env=dev storage
+```
+
+To set a default Environment for your current CLI profile you can use
+`modal config set-environment`, e.g.:
+
+```sh
+modal config set-environment dev
+```
+
+Alternatively, you can set the `MODAL_ENVIRONMENT` environment variable.
+
+## Environment web suffixes
+
+Environments have a 'web suffix' which is used to make
+[web endpoint URLs](https://modal.com/docs/guide/webhook-urls) unique across your workspace. One
+Environment is allowed to have no suffix (`""`).
+
+## Cross environment lookups
+
+It's possible to explicitly look up objects in Environments other than the Environment
+your app runs within:
+
+```python
+production_secret = modal.Secret.from_name(
+    "my-secret",
+    environment_name="main"
+)
+```
+
+```python notest
+modal.Function.from_name(
+    "my_app",
+    "some_function",
+    environment_name="dev"
+)
+```
+
+However, the `environment_name` argument is optional and omitting it will use
+the Environment from the object's associated App or calling context.
+
+#### Modal user account setup
+
+# Modal user account setup
+
+To run and deploy applications on Modal you'll need to sign up and create a user
+account.
+
+You can visit the [signup](https://modal.com/signup) page to begin the process or execute
+[`modal setup`](https://modal.com/docs/reference/cli/setup#modal-setup) on the command line.
+
+Users can also be provisioned through [Okta SSO](https://modal.com/docs/guide/okta-sso), which is
+an enterprise feature that you can request. For the typical user you'll sign-up
+using an existing GitHub account. If you're interested in authenticating with
+other identity providers let us know at <support@modal.com>.
+
+## What GitHub permissions does signing up require?
+
+- `user:email` — gives us the emails associated with the GitHub account.
+- `read:org` (invites only) — needed for Modal workspace invites. Note: this
+  only allows us to see what organization memberships you have
+  ([GitHub docs](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/scopes-for-oauth-apps)).
+  We won't be able to access any code repositories or other details.
+
+## How can I change my email?
+
+You can change your email on the [settings](https://modal.com/settings) page.
+
+#### Service users
+
+# Service Users (beta)
+
+Service users are programmatic accounts that allow automated systems to interact with Modal. They're ideal for CI/CD pipelines, automated deployments, and other workflows that need to authenticate.
+
+## Create a Service User
+
+Service users are only available for shared workspaces. You will need workspace owner or manager privileges to create service users.
+
+To create a service user:
+
+1. Go to your workspace [tokens settings page](https://modal.com/settings/tokens/service-users)
+2. Click **New Service User**
+3. Enter a name for your service user (must be lowercase alphanumeric, can contain hyphens or underscores)
+4. Click **Create**
+
+After creation, you'll see the `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET`. **This is the only time you can view the token secret** for security reasons.
+
+## Use Service User Tokens
+
+Set the service user credentials as environment variables in your automated environment:
+
+```bash
+export MODAL_TOKEN_ID=your-token-id
+export MODAL_TOKEN_SECRET=your-token-secret
+```
+
+Once configured, you can use Modal's CLI and Python SDK as usual:
+
+```bash
+modal deploy your_app.py
+```
+
+## Delete a Service User
+
+To remove a service user:
+
+1. Go to the [tokens settings page](https://modal.com/settings/tokens/service-users)
+2. Find the service user in the table
+3. Click **Delete** when you hover over the row
+
+## Permissions
+
+Service users have the same permissions as workspace members. They cannot do actions that are only permitted for a workspace owner or manager. To learn more about members, managers, and owners, see this [workspace](https://modal.com/docs/guide/workspaces#administrating-workspace-members) section.
+
+#### Billing
+
+# Billing
+
+Modal is serverless, which means you only pay for the compute you use or [reserve](https://modal.com/docs/guide/resources). Reservations are not required, and there are no minimum usage-time increments. Up-to-date unit pricing for all of our products is available on our [Pricing page](https://modal.com/pricing).
+
+## Billing frequency and incremental billing
+
+All workspaces are billed monthly. At the end of each billing cycle, you are auto-charged for the Modal usage incurred during that cycle (less any credits and incremental usage charges), as well as any subscription fees (non-Starter plans only) for the following billing cycle.
+
+In addition to monthly billing, you will be auto-charged for incremental usage the first time you exceed certain thresholds. These charges occur within the billing cycle.
+
+## Setting a workspace budget
+
+To set limits on how how much Modal usage can be incurred within the monthly billing period, go to the “Workspace budget” section of [Usage and Billing](https://modal.com/settings/usage). The max you can set this limit to is based on the history of prior successful charges to your payment method. Within a billing period, you may be charged for incremental usage when you pass certain thresholds. If those charges are successful, the max cap goes up.
+
+## Updating billing information
+
+To update your billing information, click on “Manage payment information” in the [Usage and Billing](https://modal.com/settings/usage) section of Settings. This will take you to a Stripe-hosted page where you can update your payment method and/or billing email.
+
+Note that you must have a payment method on file in order to use Modal. If you would like to remove your payment method and delete your workspace, please contact [support@modal.com](mailto:support@modal.com).
+
+## Viewing invoice history
+
+To view your invoice history and/or download receipts, click on “Manage payment information” in the [Usage and Billing](https://modal.com/settings/usage) section of Settings. You should see an “Invoice History” section at the bottom of the Stripe-hosted page.
+
+## Custom invoicing
+
+Invoiced billing, international bank transfer payment methods, splitting invoices, and other custom requirements are available to customers on an [Enterprise plan](https://modal.com/pricing) who have a usage commitment with us. Please contact [sales@modal.com](mailto:sales@modal.com).
+
+Questions on billing? Reach out to [billing@modal.com](mailto:billing@modal.com).
+
+### Other topics
+
+#### JavaScript/Go SDKs
+
+# Modal SDKs for JavaScript and Go
+
+Modal also provides SDKs (currently in beta) that enable using Modal Functions and Sandboxes from JavaScript/TypeScript and Go projects.
+
+While Python is the primary language for building Modal applications and implementing Modal Functions, these SDKs enable use cases like:
+
+- Using Sandboxes in JS/Go projects, to safely execute arbitrary commands, run untrusted user code, or as a safe environment for AI agents.
+- Directly calling Modal Functions without having to go through a web endpoint.
+- Interacting with Modal resources like Volumes, Secrets, Queues, etc. directly from JS/Go.
+
+We're working towards feature parity with the main Modal Python SDK, although defining Modal Functions will likely remain exclusive to Python.
+
+## Installation
+
+For installation instructions, see the READMEs for [JavaScript](https://github.com/modal-labs/libmodal/tree/main/modal-js) and [Go](https://github.com/modal-labs/libmodal/tree/main/modal-go) on GitHub.
+
+## JavaScript/TypeScript
+
+The `modal` package is [distributed via npm](https://www.npmjs.org/package/modal). See the [JS API reference documentation](https://modal-labs.github.io/libmodal/) for details.
+
+### Simple JavaScript Example
+
+```ts
+
+const modal = new ModalClient();
+
+const app = await modal.apps.fromName("libmodal-example", {
+  createIfMissing: true,
+});
+
+// Create a Sandbox with the specified Image, and mount a Volume
+const volume = await modal.volumes.fromName("libmodal-example-volume", {
+  createIfMissing: true,
+});
+const image = modal.images.fromRegistry("alpine:3.21");
+const sb = await modal.sandboxes.create(app, image, {
+  volumes: { "/mnt/volume": volume },
+});
+const p = await sb.exec(["cat", "/mnt/volume/message.txt"]);
+console.log(`Message: ${await p.stdout.readText()}`);
+await sb.terminate();
+
+// Call a previously deployed Modal Function
+const echo = await modal.functions.fromName("libmodal-example", "echo");
+console.log(await echo.remote(["Hello world!"]));
+```
+
+There are [many more examples available on GitHub](https://github.com/modal-labs/libmodal/blob/main/modal-js/README.md#documentation).
+
+## Go
+
+The `modal-go` package is [installed via go get](https://pkg.go.dev/github.com/modal-labs/libmodal/modal-go). See the [Go API reference documentation](https://pkg.go.dev/github.com/modal-labs/libmodal/modal-go#section-documentation) for details.
+
+### Simple Go Example
+
+```go
+package main
+
+import (
+	"context"
+	"fmt"
+	"io"
+
+	"github.com/modal-labs/libmodal/modal-go"
+)
+
+func main() {
+	// Skipping err handling throughout for brevity
+	ctx := context.Background()
+
+	mc, _ := modal.NewClient()
+
+	app, _err := mc.Apps.FromName(ctx, "libmodal-example", &modal.AppFromNameParams{CreateIfMissing: true})
+
+	// Create a Sandbox with the specified Image, and mount a Volume
+	volume, _ := mc.Volumes.FromName(ctx, "libmodal-example-volume", &modal.VolumeFromNameParams{CreateIfMissing: true})
+	image := mc.Images.FromRegistry("alpine:3.21", nil)
+	sb, _ := mc.Sandboxes.Create(ctx, app, image, &modal.SandboxCreateParams{
+		Volumes: map[string]*modal.Volume{"/mnt/volume": volume},
+	})
+	defer sb.Terminate(context.Background())
+	p, _ := sb.Exec(ctx, []string{"cat", "/mnt/volume/message.txt"}, nil)
+	stdout, _ := io.ReadAll(p.Stdout)
+	fmt.Printf("Message: %s\n", stdout)
+
+	// Call a previously deployed Modal Function
+	echo, _ := mc.Functions.FromName(ctx, "libmodal-example", "echo", nil)
+	result, _ := echo.Remote(ctx, []any{"Hello world!"}, nil)
+	fmt.Println(result)
+}
+```
+
+There are [many more examples available on GitHub](https://github.com/modal-labs/libmodal/blob/main/modal-go/README.md#documentation).
+
+## Support
+
+The JS and Go Modal SDKs are in active development, and we love to hear your feedback. If you have questions or suggestions, please reach out on the [Modal Community Slack](https://modal.com/slack).
+
+#### Modal 1.0 migration guide
+
+# Modal 1.0 migration guide
+
+We released version 1.0 of the Modal Python SDK in May 2025.
+This release signifies an increased commitment to API stability and implies
+some changes to our development workflow.
+
+Preceding the 1.0 release, we introduced a number of deprecations and changes
+based on feedback that we received from early users. These changes were intended
+to address pain points and reduce confusion about some aspects of the Modal API.
+While adapting to them requires some changes to existing code, we believe that
+they’ll make it easier to use Modal going forward.
+
+This page highlights the major changes for 1.0 and provides some advice for how
+to migrate your code to the new stable APIs. Most deprecations introduced prior
+to the release of v1.0 will not be enforced (actually cause breaking changes)
+until a subsequent minor (v1.x) release, but we recommend updating your code so
+that you can take advantage of new features and avoid any future issues.
+
+## Deprecating `Image.copy_*` methods
+
+_Introduced in: v0.72.11_
+
+We recently introduced new `Image` methods — `Image.add_local_dir` and
+`Image.add_local_file` — to replace the existing `Image.copy_local_dir` and
+`Image.copy_local_file`.
+
+The new methods subsume the functionality of the old ones, but their default
+behavior is different and more performant. By default, files will be mounted to
+the container at runtime rather than copied into a new `Image` layer. This can
+speed up development substantially when iterating on the contents of the files.
+
+Building a new `Image` layer should be necessary only when subsequent build
+steps will use the added files. In that case, you can pass `copy=True` in
+`Image.add_local_file` or `Image.add_local_dir`.
+
+The `Image.add_local_dir` method also has an `ignore=` parameter, which you can
+use to pass file-matching patterns (using dockerignore rules) or predicate
+functions to exclude files.
+
+## Deprecating `Mount` as part of the public API
+
+_Introduced in: v0.72.4_ | _Enforced in: v1.0.0_
+
+Currently, local files can be mounted to the container filesystem either by
+including them in the `Image` definition or by passing a `modal.Mount` object
+directly to the `App.function` or `App.cls` decorators. As part of the 1.0
+release, we are simplifying the container filesystem configuration to be defined
+only by the `Image` used for each Function. This implies deprecation of the
+following:
+
+- The `mount=` parameter of `App.function` and `App.cls`
+- The `context_mount=` parameter of several `modal.Image` methods
+- The `Image.copy_mount` method
+- The `Mount` object
+
+Code that uses the `mount=` parameter of `App.function` and `App.cls` should be
+migrated to pass those files / directories to the `Image` used by that Function
+or Cls, i.e. using the `Image.add_local_file`, `Image.add_local_dir`, or
+`Image.add_local_python_source` methods:
+
+```python notest
+# Mounting local files
+
+# Old way (deprecated)
+mount = modal.Mount.from_local_dir("data").add_local_file("config.yaml")
+@app.function(image=image, mount=mount)
+def f():
+    ...
+
+# New way
+image = image.add_local_dir("data", "/root/data").add_local_file("config.yaml", "/root/config.yaml")
+@app.function(image=image)
+def f():
+    ...
+
+## Mounting local Python source code
+
+# Old way (deprecated)
+mount = modal.Mount.from_local_python_packages("my-lib"))
+@app.function(image=image, mount=mount)
+def f()
+    ...
+
+# New way
+image = image.add_local_python_source("my-lib")
+@app.function(image=image)
+def f(...):
+    ...
+
+## Using Image.copy_mount
+
+# Old way (deprecated)
+mount = modal.Mount.from_local_dir("data").add_local_file("config.yaml")
+image.copy_mount(mount)
+
+# New way
+image.add_local_dir("data", "root/data").add_local_file("config.yaml", "/root/config.yaml")
+```
+
+Code that uses the `context_mount=` parameter of `Image.from_dockerfile` and
+`Image.dockerfile_commands` methods can delete that parameter; we now
+automatically infer the files that need to be included in the context.
+
+## Deprecating the `@modal.build` decorator
+
+_Introduced in: v0.72.17_
+
+As part of consolidating the filesystem configuration API, we are also
+deprecating the `modal.build` decorator.
+
+For use cases where `modal.build` would previously have been the suggested
+approach (e.g., downloading model weights or other large assets to the
+container filesystem), we now recommend using a `modal.Volume` instead. The
+main advantage of storing weights in a `Volume` instead of an `Image` is that
+the weights do not need to be re-downloaded every time you change something else
+about the `Image` definition.
+
+Many frameworks, such as Hugging Face, automatically cache downloaded model
+weights. When using these frameworks, you just need to ensure that you mount a
+`modal.Volume` to the expected location of the framework’s cache:
+
+```python notest
+cache_vol = modal.Volume.from_name("hf-hub-cache")
+@app.cls(
+    image=image.env({"HF_HUB_CACHE": "/cache"}),
+    volumes={"/cache": cache_vol},
+    ...
+)
+class Model:
+    @modal.enter()
+    def load_model(self):
+        self.model = ModelClass.from_pretrained(...)
+```
+
+For frameworks that don’t support automatic caching, you could write a separate
+function to download the weights and write them directly to the Volume, then
+`modal run` against this function before you deploy.
+
+In some cases (e.g., if the step runs very quickly), you may wish for the logic
+currently decorated with `@modal.build` to continue modifying the Image
+filesystem. In that case, you can extract the method as a standalone function
+and pass it to `Image.run_function`:
+
+```python notest
+def download_weights():
+    ...
+
+image = image.run_function(download_weights)
+```
+
+## Requiring explicit inclusion of local Python dependencies
+
+_Introduced in: 0.73.11_ | _Enforced in: 1.0.0_
+
+Prior to 1.0, Modal will inspect the modules that are imported when running
+your App code and automatically include any "local" modules in the remote
+container environment. This behavior is referred to as "automounting".
+
+While convenient, this approach has a number of edge cases and surprising
+behaviors, such as ignoring modules with imports that are deferred using
+`Image.imports`. Additionally, it is difficult to configure the automounting
+behavior to, e.g., ignore large data files that are stored within your local
+Python project directories.
+
+Going forward, it will be necessary to explicitly include the local dependencies
+of your Modal App. The easiest way to do this is with
+[`Image.add_local_python_source`](https://modal.com/docs/reference/modal.Image#add_local_python_source):
+
+```python notest
+import modal
+import helpers
+
+image = modal.Image.debian_slim().add_local_python_source("helpers")
+```
+
+In the period leading up to the change in default behavior, the Modal client
+will issue deprecation warnings when automounted modules are not included
+in the Image. Updating the Image definition will remove these warnings.
+
+Note that Modal will continue to automatically include the source module or
+package defining the App itself. We're introducing a new App or Function-level
+parameter, `include_source`, which can be set to `False` in cases where this is
+not desired (i.e., because your Image definition already includes the App
+source).
+
+## Renaming autoscaler parameters
+
+_Introduced in: v0.73.76_
+
+We're renaming several parameters that configure autoscaling behavior:
+
+- `keep_warm` is now `min_containers`
+- `concurrency_limit` is now `max_containers`
+- `container_idle_timeout` is now `scaledown_window`
+
+The renaming is intended to address some persistent confusion about
+the meaning of these parameters. The migration path is a simple
+find-and-replace operation.
+
+Additionally, we're promoting a fourth parameter, `buffer_containers`,
+from experimental status (previously `_experimental_buffer_containers`).
+Like `min_containers`, `buffer_containers` can help mitigate cold-start
+penalties by overprovisioning containers while the Function is active.
+
+## Renaming `modal.web_endpoint` to `modal.fastapi_endpoint`
+
+_Introduced in: v0.73.89_
+
+We're renaming the `modal.web_endpoint` decorator to `modal.fastapi_endpoint`
+so that the implicit dependency on FastAPI is more clear. This can be a
+simple name substitution in your code as the semantics are otherwise identical.
+
+We may reintroduce a lightweight `modal.web_endpoint` without external
+dependencies in the future.
+
+## Replacing `allow_concurrent_inputs` with `@modal.concurrent`
+
+_Introduced in: v0.73.148_
+
+The `allow_concurrent_inputs` parameter is being replaced with a new decorator,
+`@modal.concurrent`. The decorator can be applied either to a Function or a Cls.
+We're moving the input concurrency feature out of "Beta" status as part of this
+change.
+
+The new decorator exposes two distinct parameters: `max_inputs` (the limit
+on the number of inputs the Function will concurrently accept) and
+`target_inputs` (the level of concurrency targeted by the Modal autoscaler).
+The simplest migration path is to replace `allow_concurrent_inputs=N` with
+`@modal.concurrent(max_inputs=N)`:
+
+```python notest
+# Old way, with a function (deprecated)
+@app.function(allow_concurrent_inputs=1000)
+def f(...):
+    ...
+
+# New way, with a function
+@app.function()
+@modal.concurrent(max_inputs=1000)
+def f(...):
+    ...
+
+# Old way, with a class (deprecated)
+@app.cls(allow_concurrent_inputs=1000)
+class MyCls:
+    ...
+
+# New way, with a class
+@app.cls()
+@modal.concurrent(max_inputs=1000)
+class MyCls:
+    ...
+```
+
+Setting `target_inputs` along with `max_inputs` may benefit performance by
+reducing latency during periods where the container pool is scaling up. See the
+[input concurrency guide](https://modal.com/docs/guide/concurrent-inputs) for more information.
+
+## Deprecating the `.lookup` method on Modal objects
+
+_Introduced in: v0.72.56_
+
+Most Modal objects can be instantiated through two distinct methods:
+`.from_name` and `.lookup`. The redundancy between these methods is a persistent
+source of confusion.
+
+The `.from_name` method is lazy: it operates entirely locally and instantiates
+only a shell for the object. The local object won’t be associated with its
+identity on the Modal server until you interact with it. In contrast, the
+`.lookup` method is eager: it triggers a remote call to the Modal server, and it
+returns a fully-hydrated object.
+
+Because Modal objects can now be hydrated on-demand, when they are first
+used, there is rarely any need to eagerly hydrate. Therefore, we’re deprecating
+`.lookup` so that there’s only one obvious way to instantiate objects.
+
+In most cases, the migration is a simple find-and-replace of `.lookup` →
+`.from_name`.
+
+One exception is when your code needs to access object metadata, such as its ID,
+or a web endpoint's URL. In that case, you can explicitly force hydration of the
+object by calling its `.hydrate()` method. There may be other subtle consequences,
+such as errors being rasied at a different location if no object exists with the
+given name.
+
+## Removing support for custom Cls constructors
+
+_Introduced in: v0.74.0_
+
+Classes decorated with `App.cls` are no longer allowed to have a custom constructor
+(`__init__` method). Instead, class parameterization should be exposed using
+dataclass-style [`modal.parameter`](https://modal.com/docs/reference/modal.parameter) annotations:
+
+```python notest
+# Old way (deprecated)
+@app.cls()
+class MyCls:
+    def __init__(self, name: str = "Bert"):
+        self.name = name
+
+# New way
+@app.cls()
+class MyCls:
+    name: str = modal.parameter(default="Bert")
+```
+
+Modal will provide a synthetic constructor for classes that use `modal.parameter`.
+Arguments to the synthetic constructor must be passed using keywords, so you may
+need to update your calling code as well:
+
+```python notest
+obj = MyCls(name="Bert")  # name= is now required
+```
+
+We're making this change to address some persistent confusion about when
+constructors execute for remote calls and what operations are allowed to run in
+them. If your custom constructor performs any setup logic beyond storing the
+parameter values, you should move it to a method decorated with
+`@modal.enter()`.
+
+Additionally, we're reducing the types that we support as class parameters to
+a small number of primitives (`str`, `int`, `bool`, and `bytes`).
+
+Limiting class parameterization to primitive types will also allow us to provide
+better observability over parameterized class instances in the web dashboard,
+CLI, and other contexts where it is not possible to represent arbitrary Python
+objects.
+
+If you need to parameterize classes across more complex types, you can implement
+your own serialization logic, e.g. using strings as the wire format:
+
+```python notest
+@app.cls()
+class MyCls:
+    param_str: str = modal.parameter()
+
+    @modal.enter()
+    def deserialize_parameters(self):
+        self.param_obj = SomeComplexType.from_str(self.param_str)
+```
+
+We recommend adopting interpretable constructor arguments (i.e., prefer
+meaningful strings over pickled bytes) so that you will be able to get the most
+benefit from future improvements to parameterized class observability.
+
+## Simplifying Cls lookup patterns
+
+_Introduced in: v0.73.26_
+
+Modal previously supported several different patterns for looking up a `modal.Cls`
+and remotely invoking one of its methods:
+
+```python notest
+# Documented pattern
+MyCls = modal.Cls.from_name("my-app", "MyCls")
+obj = MyCls()
+obj.some_method.remote(...)
+
+# Alternate pattern: skipping the object instantiation
+MyCls = modal.Cls.from_name("my-app", "MyCls")
+MyCls.some_method.remote(...)
+
+# Alternate pattern: looking up the method as a Function
+f = modal.Function.lookup("my-app", "MyCls.some_method")
+f.remote(...)
+```
+
+While each pattern could successfully trigger a remote function call, there were
+a number of subtle differences in behavior between them.
+
+Going forward, we will only support the first pattern. Making remote calls to a
+method on a deployed Cls will require you to (a) look up the object using
+`modal.Cls` and (b) instantiate the object before calling its methods.
+
+## Deprecating `modal.gpu` objects
+
+_Introduced in: v0.73.31_
+
+The `modal.gpu` objects are being deprecated; going forward, all GPU resource
+configuration should be accomplished using strings.
+
+This should be an easy code substitution, e.g. `gpu=modal.gpu.H100()` can be
+replaced with `gpu="H100"`. When using the `count=` parameter of the GPU class,
+simply append it to the name with a colon (e.g. `gpu="H100:8"`). In the case of
+the `modal.gpu.A100(size="80GB")` variant, the name of the corresponding gpu is
+`"A100-80GB"`.
+
+Note that string arguments are case-insensitive, so `"H100"` and `"h100"` are
+both accepted.
+
+The main rationale for this change is that it will allow us to introduce new
+GPU models in the future without requring users to upgrade their SDK.
+
+## Requiring explicit invocation for module mode
+
+_Introduced in: 0.73.58_
+
+The Modal CLI allows you to reference the source code for your App as either
+a file path (e.g. `src/my_app.py`) or as a module name (e.g. `src.my_app`).
+
+As in Python, the choice has some implications for how relative imports are
+resolved. To make this more salient, Modal will mirror Python going forwared
+and require that you explicitly invoke module mode by passing `-m` on your
+command line (e.g., `modal deploy -m src.my_app`).
+
+#### File and project structure
+
+# Project structure
+
+## Apps spanning multiple files
+
+When your project spans multiple files, more care is required to package the
+full structure for running or deploying on Modal.
+
+There are two main considerations: (1) ensuring that all of your Functions get
+registered to the App, and (2) ensuring that any local dependencies get included
+in the Modal container.
+
+Say that you have a simple project that's distributed across three files:
+
+```
+src/
+├── app.py  # Defines the `modal.App` as a variable named `app`
+├── llm.py  # Imports `app` and decorates some functions
+└── web.py  # Imports `app` and decorates other functions
+```
+
+With this structure, if you deploy using `modal deploy src/app.py`, Modal won't
+discover the Functions defined in the other two modules, because they never get
+imported.
+
+If you instead run `modal deploy src/llm.py`, Modal will deploy the App with
+just the Functions defined in that module.
+
+One option would be to ensure that one module in the project transitively
+imports all of the other modules and to point the `modal deploy` CLI at it, but
+this approach can lead to an awkard project structure.
+
+### Defining your project as a Python package
+
+A better approach would be to define your project as a Python _package_ and to
+use the Modal CLI's "module mode" invocation pattern.
+
+In Python, a package is a directory containing an `__init__.py` file (and
+usually some other Python modules). If you have a `src/__init__.py` that
+imports all of the member modules, it will ensure that any decorated Functions
+contained within them get registered to the App:
+
+```python notest
+# Contents of __init__.py
+import .app
+import .llm
+import .web
+```
+
+_Important: use *relative* imports (`import .app`) between member modules._
+
+Unfortunately, it's not enough just to set this up and make your deploy command
+`modal deploy src/app.py`. Instead, you need to invoke Modal in _module mode_:
+`modal deploy -m src.app`. Note the use of the `-m` flag and the module path
+(`src.app` instead of `src/app.py`). Akin to `python -m ...`, this incantation
+treats the target as a package rather than just a single script.
+
+### App composition
+
+As your project grows in scope, it may become helpful to organize it into
+multiple component Apps, rather than having the project defined as one large
+monolith. That way, as you iterate during development, you can target a specific
+component, which will build faster and avoid any conflicts with concurrent work
+on other parts of the project.
+
+Projects set up this way can still be deployed as one unit by using `App.include`.
+Say our project from above defines separate Apps in `llm.py` and `web.py` and then
+adds a new `deploy.py` file:
+
+```python notest
+# Contents of deploy.py
+import modal
+
+from .llm import llm_app
+from .web import web_app
+
+app = modal.App("full-app").include(llm_app).include(web_app)
+```
+
+This lets you run `modal deploy -m src.deploy` to package everything in one
+step.
+
+**Note:** Since the multi-file app still has a single namespace for all
+functions, it's important to name your Modal functions uniquely across the
+project even when splitting it up across files: otherwise you risk some
+functions "shadowing" others with the same name.
+
+## Including local dependencies
+
+Another factor to consider is whether Modal will package all of the local
+dependencies that your App requires.
+
+Even if your Modal App itself can be contained to a single file, any local
+modules that file imports (like, say, a `helpers.py`) also need to be available
+in the Modal container.
+
+By default, Modal will automatically include the module or package where a
+Function is defined in all containers that run that Function. So if the project
+is set up as a package and the helper modules are part of that package, you
+should be all set. If you're not using a package setup, or if the local
+dependencies are external to your project's package, you'll need to explicitly
+include them in the Image, i.e. with `modal.Image.add_local_python_source`.
+
+**Note:** This behavior changed in Modal 1.0. Previously, Modal would
+"automount" any local dependencies that were imported by your App source into a
+container. This was changed to be more selective to avoid unnecessary inclusion
+of large local packages.
+
+#### Developing and debugging
+
+# Developing and debugging
+
+Modal makes it easy to run apps in the cloud, try code changes in the cloud, and
+debug remotely executing code as if it were right there on your laptop. To speed
+boost your inner dev loop, this guide provides a rundown of tools and techniques
+for developing and debugging software in Modal.
+
+## Interactivity
+
+You can launch a Modal App interactively and have it drop you right into the
+middle of the action, at an interesting callsite or the site of a runtime
+detonation.
+
+### Interactive functions
+
+It is possible to start the interactive Python debugger or start an `IPython`
+REPL right in the middle of your Modal App.
+
+To do so, you first need to run your App in "interactive" mode by using the
+`--interactive` / `-i` flag. In interactive mode, you can establish a connection
+to the calling terminal by calling `interact()` from within your function.
+
+For a simple example, you can accept user input with the built-in Python `input`
+function:
+
+```python
+@app.function()
+def my_fn(hidden):
+    modal.interact()
+
+    x = input("Enter a number: ")
+    if hidden == x:
+        print(f"Your number is {x}, which is the hidden value!")
+    else:
+        print(f"Your number is {x}, which is not the hidden value")
+```
+
+Now when you run your app with the `--interactive` flag, you're able to send
+inputs to your app, even though it's running in a remote container!
+
+```shell
+modal run -i guess_number.py::my_fn --hidden 5
+Enter a number: 5
+Your number is 5, which is the hidden value!
+```
+
+For a more interesting example, you can [`pip_install("ipython")`](https://modal.com/docs/reference/modal.Image#pip_install)
+and start an `IPython` REPL dynamically anywhere in your code:
+
+```python
+@app.function()
+def f():
+    model = expensive_function()
+    # play around with model
+    modal.interact()
+    import IPython
+    IPython.embed()
+```
+
+The built-in Python debugger can be initiated with the language's `breakpoint()`
+function. For convenience, breakpoints call `interact` automatically.
+
+```python
+@app.function()
+def f():
+    x = "10point3"
+    breakpoint()
+    answer = float(x)
+```
+
+### Debugging Running Containers
+
+#### Debug Shells
+
+Modal also lets you run interactive commands on your running Containers from the
+terminal -- much like `ssh`-ing into a traditional machine or cloud VM.
+
+To run a command inside a running Container, you first need to get the Container
+ID. You can view all running Containers and their Container IDs with
+[`modal container list`](https://modal.com/docs/reference/cli/container).
+
+After you obtain the Container ID, you can connect to the Container with `modal shell [container-id]`. This launches a "Debug Shell" that comes with some preinstalled tools:
+
+- `vim`
+- `nano`
+- `ps`
+- `strace`
+- `curl`
+- `py-spy`
+- and more!
+
+You can use a debug shell to examine or terminate running processes, modify the Container filesystem, run commands, and more. You can also install additional packages using your Container's package manager (ex. `apt`).
+
+Note that debug shells will terminate immediately once your Container has finished running.
+
+#### `modal container exec`
+
+You can also execute a specific command in a running Container with `modal container exec [container-id] [command...]`. For example, to see what files are in `/root`, you can run `modal container exec [container-id] ls /root`.
+
+```
+❯ modal container list
+                         Active Containers in environment: nathan-dev
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Container ID                  ┃ App ID                    ┃ App Name ┃ Start Time           ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
+│ ta-01JK47GVDMWMGPH8MQ0EW30Y25 │ ap-FSuhQ4LpvNAt5b6mKi1CDw │ my-app   │ 2025-02-02 16:02 EST │
+└───────────────────────────────┴───────────────────────────┴──────────┴──────────────────────┘
+
+❯ modal container exec ta-01JK47GVDMWMGPH8MQ0EW30Y25 ls /root
+__pycache__  test00.py
+```
+
+Note that your executed command will terminate immediately once your Container
+has finished running.
+
+By default, commands will be run within a
+[pseudoterminal (PTY)](https://en.wikipedia.org/wiki/Pseudoterminal), but this
+can be disabled with the `--no-pty` flag.
+
+#### Live container profiling
+
+When a container or input is seemingly stuck or not making progress,
+you can use the Modal web dashboard to find out what code that's executing in the
+container in real time. To do so, look for **Live Profiling** in the **Containers** tab in your
+function dashboard.
+
+![Live container profiling](https://modal-public-assets.s3.us-east-1.amazonaws.com/live-profiling-bigger.gif)
+
+### Debugging Container Images
+
+You can also launch an interactive shell in a new Container with the same
+environment as your Function. This is handy for debugging issues with your
+Image, interactively refining build commands, and exploring the contents of
+[`Volume`](https://modal.com/docs/reference/modal.Volume)s and
+[`NetworkFileSystem`](https://modal.com/docs/reference/modal.NetworkFileSystem)s.
+
+The primary interface for accessing this feature is the
+[`modal shell`](https://modal.com/docs/reference/cli/shell) CLI command, which accepts a Function
+name in your App (or prompts you to select one, if none is provided), and runs
+an interactive command on the same image as the Function, with the same
+[`Secret`](https://modal.com/docs/reference/modal.Secret)s and
+[`NetworkFileSystem`](https://modal.com/docs/reference/modal.NetworkFileSystem)s attached as the selected Function.
+
+The default command is `/bin/bash`, but you can override this with any other
+command of your choice using the `--cmd` flag.
+
+Note that `modal shell [filename].py` does not attach a shell to a running Container of the
+Function, but instead creates a fresh instance of the underlying Image. To attach a shell to a running Container, use `modal shell [container-id]` instead.
+
+## Live updating
+
+### Hot reloading with `modal serve`
+
+Modal has the command `modal serve <filename.py>`, which creates a loop that
+live updates an App when any of the supporting files change.
+
+Live updating works with web endpoints, syncing your changes as you make them,
+and it also works well with cron schedules and job queues.
+
+```python
+import modal
+
+app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))
+
+@app.function()
+@modal.fastapi_endpoint()
+def f():
+    return "I update on file edit!"
+
+@app.function(schedule=modal.Period(seconds=5))
+def run_me():
+    print("I also update on file edit!")
+```
+
+If you edit this file, the `modal serve` command will detect the change and
+update the code, without having to restart the command.
+
+## Observability
+
+Each running Modal App, including all ephemeral Apps, streams logs and resource
+metrics back to you for viewing.
+
+On start, an App will log a dashboard link that will take you its App page.
+
+```shell
+$ python3 main.py
+✓ Initialized. View app page at https://modal.com/apps/ap-XYZ1234.
+...
+```
+
+From this page you can access the following:
+
+- logs, both from your application and system-level logs from Modal
+- compute resource metrics (CPU, RAM, GPU)
+- function call history, including historical success/failure counts
+
+### Debug logs
+
+You can enable Modal's client debug logs by setting the `MODAL_LOGLEVEL` environment variable to `DEBUG`.
+Running the following will show debug logging from the Modal client running locally.
+
+```bash
+MODAL_LOGLEVEL=DEBUG modal run hello.py
+```
+
+To enable debug logs in the Modal client running in the remote container, you can set `MODAL_LOGLEVEL` using
+a Modal [`Secret`](https://modal.com/docs/reference/modal.Secret).
+
+```python
+@app.function(secrets=[modal.Secret.from_dict({"MODAL_LOGLEVEL": "DEBUG"})])
+def f():
+    print("Hello, world!")
+```
+
+### Client tracebacks
+
+To see a traceback (a.k.a [stack trace](https://en.wikipedia.org/wiki/Stack_trace)) for a client-side exception, you can set the `MODAL_TRACEBACK` environment variable to `1`.
+
+```bash
+MODAL_TRACEBACK=1 modal run my_app.py
+```
+
+We encourage you to report cases where you need to enable this functionality, as it's indication of an issue in Modal.
+
+#### Developing Modal code with LLMs
+
+# Developing Modal code with LLMs
+
+Excellent developer experience is at the core of Modal. This also means that Modal works well with code generation agents, especially those that can run CLI commands like `modal run` in an implement, test and debug loop, like Amp, Claude Code, Cursor's agent mode, Gemini CLI, etc.
+
+There are of course also many concepts and design patterns that are unique to Modal, so below we gather rules and guidelines that we have found useful when developing Modal code with LLMs. You can paste/import this into your `AGENTS.md`, `CLAUDE.md`, `.cursor/rules/modal.mdc`, etc. or use it as a starting point for your own rules or prompts.
+
+````markdown
+# Modal Rules and Guidelines for LLMs
+
+This file provides rules and guidelines for LLMs when implementing Modal code.
+
+## General
+
+- Modal is a serverless cloud platform for running Python code with minimal configuration
+- Designed for AI/ML workloads but supports general-purpose cloud compute
+- Serverless billing model - you only pay for resources used
+
+## Modal documentation
+
+- Extensive documentation is available at: modal.com/docs (and in markdown format at modal.com/llms-full.txt)
+- A large collection of examples is available at: modal.com/docs/examples (and github.com/modal-labs/modal-examples)
+- Reference documentation is available at: modal.com/docs/reference
+
+Always refer to documentation and examples for up-to-date functionality and exact syntax.
+
+## Core Modal concepts
+
+### App
+
+- A group of functions, classes and sandboxes that are deployed together.
+
+### Function
+
+- The basic unit of serverless execution on Modal.
+- Each Function executes in its own container, and you can configure different Images for different Functions within the same App:
+
+  ```python
+  image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .pip_install("torch", "transformers")
+    .apt_install("ffmpeg")
+    .run_commands("mkdir -p /models")
+  )
+
+  @app.function(image=image)
+  def square(x: int) -> int:
+    return x * x
+  ```
+
+- You can configure individual hardware requirements (CPU, memory, GPUs, etc.) for each Function.
+
+  ```python
+  @app.function(
+    gpu="H100",
+    memory=4096,
+    cpu=2,
+  )
+  def inference():
+    ...
+  ```
+
+  Some examples specificly for GPUs:
+
+  ```python
+  @app.function(gpu="A10G")  # Single GPU, e.g. T4, A10G, A100, H100, or "any"
+  @app.function(gpu="A100:2")  # Multiple GPUs, e.g. 2x A100 GPUs
+  @app.function(gpu=["H100", "A100", "any"]) # GPU with fallbacks
+  ```
+
+- Functions can be invoked in a number of ways. Some of the most common are:
+  - `foo.remote()` - Run the Function in a separate container in the cloud. This is by far the most common.
+  - `foo.local()` - Run the Function in the same context as the caller. Note: This does not necessarily mean locally on your machine.
+  - `foo.map()` - Parallel map over a set of inputs.
+  - `foo.spawn()` - Calls the function with the given arguments, without waiting for the results. Terminating the App will also terminate spawned functions.
+- Web endpoint: You can turn any Function into an HTTP web endpoint served by adding a decorator:
+
+  ```python
+  @app.function()
+  @modal.fastapi_endpoint()
+  def fastapi_endpoint():
+    return {"status": "ok"}
+
+  @app.function()
+  @modal.asgi_app()
+  def asgi_app():
+    app = FastAPI()
+    ...
+    return app
+  ```
+
+- You can run Functions on a schedule using e.g. `@app.function(schedule=modal.Period(minutes=5))` or `@app.function(schedule=modal.Cron("0 9 * * *"))`.
+
+### Classes (a.k.a. `Cls`)
+
+- For stateful operations with startup/shutdown lifecycle hooks. Example:
+
+  ```python
+  @app.cls(gpu="A100")
+  class ModelServer:
+      @modal.enter()
+      def load_model(self):
+          # Runs once when container starts
+          self.model = load_model()
+
+      @modal.method()
+      def predict(self, text: str) -> str:
+          return self.model.generate(text)
+
+      @modal.exit()
+      def cleanup(self):
+          # Runs when container stops
+          cleanup()
+  ```
+
+### Other important concepts
+
+- Image: Represents a container image that Functions can run in.
+- Sandbox: Allows defining containers at runtime and securely running arbitrary code inside them.
+- Volume: Provide a high-performance distributed file system for your Modal applications.
+- Secret: Enables securely providing credentials and other sensitive information to your Modal Functions.
+- Dict: Distributed key/value store, managed by Modal.
+- Queue: Distributed, FIFO queue, managed by Modal.
+
+## Differences from standard Python development
+
+- Modal always executes code in the cloud, even while you are developing. You can use Environments for separating development and production deployments.
+- Dependencies: It's common and encouraged to have different dependency requirements for different Functions within the same App. Consider defining dependencies in Image definitions (see Image docs) that are attached to Functions, rather than in global `requirements.txt`/`pyproject.toml` files, and putting `import` statements inside the Function `def`. Any code in the global scope needs to be executable in all environments where that App source will be used (locally, and any of the Images the App uses).
+
+## Modal coding style
+
+- Modal Apps, Volumes, and Secrets should be named using kebab-case.
+- Always use `import modal`, and qualified names like `modal.App()`, `modal.Image.debian_slim()`.
+- Modal evolves quickly, and prints helpful deprecation warnings when you `modal run` an App that uses deprecated features. When writing new code, never use deprecated features.
+
+## Common commands
+
+Running `modal --help` gives you a list of all available commands. All commands also support `--help` for more details.
+
+### Running your Modal app during development
+
+- `modal run path/to/your/app.py` - Run your app on Modal.
+- `modal run -m module.path.to.app` - Run your app on Modal, using the Python module path.
+- `modal serve modal_server.py` - Run web endpoint(s) associated with a Modal app, and hot-reload code on changes. Will print a URL to the web endpoint(s). Note: you need to use `Ctrl+C` to interrupt `modal serve`.
+
+### Deploying your Modal app
+
+- `modal deploy path/to/your/app.py` - Deploy your app (Functions, web endpoints, etc.) to Modal.
+- `modal deploy -m module.path.to.app` - Deploy your app to Modal, using the Python module path.
+
+Logs:
+
+- `modal app logs <app_name>` - Stream logs for a deployed app. Note: you need to use `Ctrl+C` to interrupt the stream.
+
+### Resource management
+
+- There are CLI commands for interacting with resources like `modal app list`, `modal volume list`, and similarly for `secret`, `dict`, `queue`, etc.
+- These also support other command than `list` - use e.g. `modal app --help` for more.
+
+## Testing and debugging
+
+- When using `app.deploy()`, you can wrap it in a `with modal.enable_output():` block to get more output.
+````
+
+#### Jupyter notebooks
+
+# Jupyter notebooks
+
+> **Note:** This page is about running Jupyter on Modal. For our hosted notebooks product with real-time collaboration, see [Modal Notebooks](https://modal.com/docs/guide/notebooks).
+
+You can use the Modal client library in notebook environments like Jupyter! Just
+`import modal` and use as normal. You will likely need to use [`app.run`](https://modal.com/docs/guide/apps#ephemeral-apps) to create an ephemeral app to run your functions:
+
+```python,notest
+# Cell 1
+
+import modal
+
+app = modal.App()
+
+@app.function()
+def my_function(x):
+    ...
+
+# Cell 2
+
+with modal.enable_output():
+    with app.run():
+        my_function.remote(42)
+```
+
+## Known issues
+
+- **Interactive shell and interactive functions are not supported.**
+
+  These can only be run within a live terminal session, so they are not
+  supported in notebooks.
+
+- **Local and remote Python versions must match.**
+
+  When defining Modal Functions in a Jupyter notebook, the Function automatically
+  has `serialized=True` set. This implies that the versions of Python and any third-
+  party libraries used in your Modal container must match the version you have locally,
+  so that the function can be deserialized remotely without errors.
+
+If you encounter issues not documented above, try restarting the notebook kernel, as it may be
+in a broken state, which is common in notebook development.
+
+If the issue persists, contact us [in our Slack](https://modal.com/slack).
+
+We are working on removing these known issues so that writing Modal applications
+in a notebook feels just like developing in regular Python modules and scripts.
+
+## Jupyter inside Modal
+
+You can run Jupyter in Modal using the `modal launch` command. For example:
+
+```
+$ modal launch jupyter --gpu a10g
+```
+
+That will start a Jupyter instance with an A10G GPU attached. You'll be able to
+access the app with via a
+[Modal Tunnel URL](https://modal.com/docs/guide/tunnels#tunnels-beta). Jupyter
+will stop running whenever you stop Modal call in your terminal.
+
+See `--help` for additional options.
+
+## Further examples
+
+- [Basic demonstration of running Modal in a notebook](https://github.com/modal-labs/modal-examples/blob/main/11_notebooks/basic.ipynb)
+- [Running Jupyter server within a Modal function](https://github.com/modal-labs/modal-examples/blob/main/11_notebooks/jupyter_inside_modal.py)
+
+#### Asynchronous API usage
+
+# Asynchronous API usage
+
+All of the functions in Modal are available in both standard (blocking) and
+asynchronous variants. The async interface can be accessed by appending `.aio`
+to any function in the Modal API.
+
+For example, instead of `my_modal_function.remote("hello")` in a blocking
+context, you can use `await my_modal_function.remote.aio("hello")` to get an
+asynchronous coroutine response, for use with Python's `asyncio` library.
+
+```python
+import asyncio
+import modal
+
+app = modal.App()
+
+@app.function()
+async def myfunc():
+    ...
+
+@app.local_entrypoint()
+async def main():
+    # execute 100 remote calls to myfunc in parallel
+    await asyncio.gather(*[myfunc.remote.aio() for i in range(100)])
+```
+
+This is an advanced feature. If you are comfortable with asynchronous
+programming, you can use this to create arbitrary parallel execution patterns,
+with the added benefit that any Modal functions will be executed remotely.
+
+## Async functions
+
+Regardless if you use an async runtime (like `asyncio`) in your usage of _Modal
+itself_, you are free to define your `app.function`-decorated function bodies
+as either async or blocking. Both kinds of definitions will work for remote
+Modal function calls from both any context.
+
+An async function can call a blocking function, and vice versa.
+
+```python
+@app.function()
+def blocking_function():
+    return 42
+
+@app.function()
+async def async_function():
+    x = await blocking_function.remote.aio()
+    return x * 10
+
+@app.local_entrypoint()
+def blocking_main():
+    print(async_function.remote())  # => 420
+```
+
+If a function is configured to support multiple concurrent inputs per container,
+the behavior varies slightly between blocking and async contexts:
+
+- In a blocking context, concurrent inputs will run on separate Python threads.
+  These are subject to the GIL, but they can still lead to race conditions if
+  used with non-threadsafe objects.
+- In an async context, concurrent inputs are simply scheduled as coroutines on
+  the executor thread. Everything remains single-threaded.
+
+#### Global variables
+
+# Global variables
+
+There are cases where you might want objects or data available in **global**
+scope. For example:
+
+- You need to use the data in a scheduled function (scheduled functions don't
+  accept arguments)
+- You need to construct objects (e.g. Secrets) in global scope to use as
+  function annotations
+- You don't want to clutter many function signatures with some common arguments
+  they all use, and pass the same arguments through many layers of function
+  calls.
+
+For these cases, you can use the `modal.is_local` function, which returns `True`
+if the app is running locally (initializing) or `False` if the app is executing
+in the cloud.
+
+For instance, to create a [`modal.Secret`](https://modal.com/docs/guide/secrets) that you can pass
+to your function decorators to create environment variables, you can run:
+
+```python
+import os
+
+if modal.is_local():
+    pg_password = modal.Secret.from_dict({"PGPASS": os.environ["MY_LOCAL_PASSWORD"]})
+else:
+    pg_password = modal.Secret.from_dict({})
+
+@app.function(secrets=[pg_password])
+def get_secret_data():
+    connection = psycopg2.connect(password=os.environ["PGPASS"])
+    ...
+```
+
+## Warning about regular module globals
+
+If you try to construct a global in module scope using some local data _without_
+using something like `modal.is_local`, it might have unexpected effects since
+your Python modules will be not only be loaded on your local machine, but also
+on the remote worker.
+
+E.g., this will typically not work:
+
+```python notest
+# blob.json doesn't exist on the remote worker, so this will cause an error there
+data_blob = open("blob.json", "r").read()
+
+@app.function()
+def foo():
+    print(data_blob)
+```
+
+#### Region selection
+
+# Region selection
+
+Modal allows you to specify which cloud region you would like to run a Function in. This may be useful if:
+
+- you are required (for regulatory reasons or by your customers) to process data within certain regions.
+- you want to reduce egress fees that result from reading data from a dependency like S3.
+- you have a latency-sensitive app where app endpoints need to run near an external DB.
+
+Note that regardless of what region your Function runs in, all Function inputs and outputs go through Modal's control plane in us-east-1.
+
+## Pricing
+
+A multiplier on top of our [base usage pricing](https://modal.com/pricing) will be applied to any function that has a cloud region defined.
+
+| **Region**                   | **Multiplier** |
+| ---------------------------- | -------------- |
+| Any region in US/EU/UK/AP    | 1.25x          |
+| Any region in CA/SA/ME/MX/AF | 2.5x           |
+
+Here's an example: let's say you have a function that uses 1 T4, 1 CPU core, and 1GB memory. You've specified that the function should run in `us-east-2`. The cost to run this function for 1 hour would be `((T4 hourly cost) + (CPU hourly cost for one core) + (Memory hourly cost for one GB)) * 1.25`.
+
+If you specify multiple regions and they span the two categories above, we will apply the smaller of the two multipliers.
+
+## Specifying a region
+
+To run your Modal Function in a specific region, pass a `region=` argument to the `function` decorator.
+
+```python
+import os
+import modal
+
+app = modal.App("...")
+
+@app.function(region="us-east") # also supports a list of options, for example region=["us-central", "us-east"]
+def f():
+    print(f"running in {os.environ['MODAL_REGION']}") # us-east-1, us-east-2, us-ashburn-1, etc.
+```
+
+You can specify a region in addition to the underlying cloud, `@app.function(cloud="aws", region="us-east")` would run your Function only in `"us-east-1"` or `"us-east-2"` for instance.
+
+## Region options
+
+Modal offers varying levels of granularity for regions. Use broader regions when possible, as this increases the pool of available resources your Function can be assigned to, which improves cold-start time and availability.
+
+<!-- TODO: auto-generate this table, this is not sustainable -->
+
+### United States ("us")
+
+Use `region="us"` to select any region in the United States.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "us-east"           "us-east-1"                    AWS Virginia
+                      "us-east-2"                    AWS Ohio
+                      "us-east1"                     GCP South Carolina
+                      "us-east4"                     GCP Virginia
+                      "us-east5"                     GCP Ohio
+                      "us-ashburn-1"                 OCI Virginia
+                      "eastus"                       AZR Virginia
+                      "eastus2"                      AZR Virginia
+ ---------------------------------------------------------------------
+  "us-central"        "us-central1"                  GCP Iowa
+                      "us-chicago-1"                 OCI Chicago
+                      "us-phoenix-1"                 OCI Phoenix
+                      "centralus"                    AZR Iowa
+                      "northcentralus"               AZR Illinois
+                      "southcentralus"               AZR Texas
+                      "westcentralus"                AZR Wyoming
+ ---------------------------------------------------------------------
+  "us-south"          "us-south1"                    GCP Dallas
+ ---------------------------------------------------------------------
+  "us-west"           "us-west-1"                    AWS California
+                      "us-west-2"                    AWS Oregon
+                      "us-west1"                     GCP Oregon
+                      "us-west2"                     GCP Los Angeles
+                      "us-west3"                     GCP Salt Lake City
+                      "us-west4"                     GCP Las Vegas
+                      "us-sanjose-1"                 OCI San Jose
+                      "westus"                       AZR California
+                      "westus2"                      AZR Washington
+                      "westus3"                      AZR Arizona
+```
+
+### European Economic Area ("eu")
+
+Use `region="eu"` to select any region in the European Economic Area (EEA). Notably, this does not include the UK.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "eu-west"           "eu-central-1"                 AWS Frankfurt
+                      "eu-west-1"                    AWS Ireland
+                      "eu-west-3"                    AWS Paris
+                      "europe-west1"                 GCP Belgium
+                      "europe-west3"                 GCP Frankfurt
+                      "europe-west4"                 GCP Netherlands
+                      "eu-frankfurt-1"               OCI Frankfurt
+                      "eu-paris-1"                   OCI Paris
+                      "westeurope"                   AZR Netherlands
+                      "germanywestcentral"           AZR Frankfurt
+                      "francecentral"                AZR Paris
+                      "polandcentral"                AZR Warsaw
+ ---------------------------------------------------------------------
+  "eu-north"          "eu-north-1"                   AWS Stockholm
+                      "northeurope"                  AZR Ireland
+                      "swedencentral"                AZR Sweden
+                      "norwayeast"                   AZR Oslo
+ ---------------------------------------------------------------------
+  "eu-south"          "eu-south-1"                   AWS Milan
+                      "eu-south-2"                   AWS Spain
+                      "italynorth"                   AZR Milan
+                      "spaincentral"                 AZR Madrid
+```
+
+### Asia–Pacific ("ap")
+
+Use `region="ap"` to select any region in Asia–Pacific.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "ap-northeast"      "ap-northeast-1"               AWS Tokyo
+                      "ap-northeast-2"               AWS Seoul
+                      "ap-northeast-3"               AWS Osaka
+                      "asia-east1"                   GCP Taiwan
+                      "asia-northeast1"              GCP Tokyo
+                      "asia-northeast3"              GCP Seoul
+                      "koreacentral"                 AZR Seoul
+                      "japaneast"                    AZR Tokyo
+                      "japanwest"                    AZR Osaka
+ ---------------------------------------------------------------------
+  "ap-southeast"      "ap-southeast-3"               AWS Jakarta
+                      "asia-southeast1"              GCP Singapore
+                      "southeastasia"                AZR Singapore
+                      "malaysiawest"                 AZR Kuala Lumpur
+ ---------------------------------------------------------------------
+  "ap-south"          "ap-south-1"                   AWS Mumbai
+                      "asia-south1"                  GCP Mumbai
+                      "asia-south2"                  GCP Delhi
+                      "centralindia"                 AZR Pune
+                      "westindia"                    AZR Mumbai
+                      "southindia"                   AZR Chennai
+ ---------------------------------------------------------------------
+  "ap-melbourne"      "ap-melbourne-1"               OCI Melbourne
+ ---------------------------------------------------------------------
+                      "australia-southeast1"         GCP Sydney
+                      "ap-sydney-1"                  OCI Sydney
+                      "australiaeast"                AZR Sydney
+```
+
+### United Kingdom ("uk")
+
+Use `region="uk"` to select any region in the United Kingdom.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "uk"                "eu-west-2"                    AWS London
+                      "europe-west2"                 GCP London
+                      "uk-london-1"                  OCI London
+                      "uksouth"                      AZR London
+                      "ukwest"                       AZR Cardiff
+```
+
+### Canada ("ca")
+
+Use `region="ca"` to select any region in Canada.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "ca"                "ca-central-1"                 AWS Montreal
+                      "northamerica-northeast2"      GCP Toronto
+                      "ca-toronto-1"                 OCI Toronto
+                      "ca-montreal-1"                OCI Montreal
+                      "canadacentral"                AZR Toronto
+                      "canadaeast"                   AZR Quebec
+```
+
+### Middle East ("me")
+
+Use `region="me"` to select any region in the Middle East.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "me"                "me-west1"                     GCP Tel Aviv
+                      "uaenorth"                     AZR Dubai
+                      "qatarcentral"                 AZR Doha
+```
+
+### South America ("sa")
+
+Use `region="sa"` to select any region in South America.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "sa"                "sa-east-1"                    AWS São Paulo
+                      "southamerica-east1"           GCP São Paulo
+                      "brazilsouth"                  AZR São Paulo
+```
+
+### Africa ("af")
+
+Use `region="af"` to select any region in Africa.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "af"                "southafricanorth"             AZR Johannesburg
+```
+
+### Mexico ("mx")
+
+Use `region="mx"` to select any region in Mexico.
+
+```
+     Broad            Specific                       Description
+ =====================================================================
+  "mx"                "mexicocentral"                AZR Mexico
+```
+
+### Region aliases
+
+The following are convenience aliases for selecting specific countries within broader geographic regions (e.g., Japan or Australia within Asia-Pacific).
+
+```
+     Alias            Specific                       Description
+ =====================================================================
+  "jp"                "ap-northeast-1"               AWS Tokyo
+                      "ap-northeast-3"               AWS Osaka
+                      "asia-northeast1"              GCP Tokyo
+                      "japaneast"                    AZR Tokyo
+                      "japanwest"                    AZR Osaka
+ ---------------------------------------------------------------------
+  "au"                "australia-southeast1"         GCP Sydney
+                      "ap-sydney-1"                  OCI Sydney
+                      "ap-melbourne-1"               OCI Melbourne
+                      "australiaeast"                AZR Sydney
+```
+
+## Region selection and GPU availability
+
+Region selection limits the pool of instances we can run your Functions on. As a result, you may observe higher wait times between when your Function is called and when it gets executed. Generally, we have higher availability in US/EU versus other regions. Whenever possible, select the broadest possible regions so you get the best resource availability.
+
+#### Container lifecycle hooks
+
+# Container lifecycle hooks
+
+Since Modal will reuse the same container for multiple inputs, sometimes you
+might want to run some code exactly once when the container starts or exits.
+
+To accomplish this, you need to use Modal's class syntax and the
+[`@app.cls`](https://modal.com/docs/reference/modal.App#cls) decorator. Specifically, you'll
+need to:
+
+1. Convert your function to a method by making it a member of a class.
+2. Decorate the class with `@app.cls(...)` with same arguments you previously
+   had for `@app.function(...)`.
+3. Instead of the `@app.function` decorator on the original method, use
+   `@method` or the appropriate decorator for a
+   [web endpoint](#lifecycle-hooks-for-web-endpoints).
+4. Add the correct method "hooks" to your class based on your need:
+   - `@enter` for one-time initialization (remote)
+   - `@exit` for one-time cleanup (remote)
+
+## `@enter`
+
+The container entry handler is called when a new container is started. This is
+useful for doing one-time initialization, such as loading model weights or
+importing packages that are only present in that image.
+
+To use, make your function a member of a class, and apply the `@enter()`
+decorator to one or more class methods:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.cls(cpu=8)
+class Model:
+    @modal.enter()
+    def run_this_on_container_startup(self):
+        import pickle
+        self.model = pickle.load(open("model.pickle"))
+
+    @modal.method()
+    def predict(self, x):
+        return self.model.predict(x)
+
+@app.local_entrypoint()
+def main():
+    Model().predict.remote(x=123)
+```
+
+When working with an [asynchronous Modal](https://modal.com/docs/guide/async) app, you may use an
+async method instead:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.cls(memory=1024)
+class Processor:
+    @modal.enter()
+    async def my_enter_method(self):
+        self.cache = await load_cache()
+
+    @modal.method()
+    async def run(self, x):
+        return await do_some_async_stuff(x, self.cache)
+
+@app.local_entrypoint()
+async def main():
+    await Processor().run.remote(x=123)
+```
+
+Note: The `@enter()` decorator replaces the earlier `__enter__` syntax, which
+has been deprecated.
+
+## `@exit`
+
+The container exit handler is called when a container is about to exit. It is
+useful for doing one-time cleanup, such as closing a database connection or
+saving intermediate results. To use, make your function a member of a class, and
+apply the `@exit()` decorator:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.cls()
+class ETLPipeline:
+    @modal.enter()
+    def open_connection(self):
+        import psycopg2
+        self.connection = psycopg2.connect(os.environ["DATABASE_URI"])
+
+    @modal.method()
+    def run(self):
+        # Run some queries
+        pass
+
+    @modal.exit()
+    def close_connection(self):
+        self.connection.close()
+
+@app.local_entrypoint()
+def main():
+    ETLPipeline().run.remote()
+```
+
+Exit handlers are also called when a container is [preempted](https://modal.com/docs/guide/preemption).
+The exit handler is given a grace period of 30 seconds to finish, and it will be
+killed if it takes longer than that to complete.
+
+## Lifecycle hooks for web endpoints
+
+Modal `@function`s that are [web endpoints](https://modal.com/docs/guide/webhooks) can be
+converted to the class syntax as well. Instead of `@modal.method`, simply use
+whichever of the web endpoint decorators (`@modal.fastapi_endpoint`,
+`@modal.asgi_app` or `@modal.wsgi_app`) you were using before.
+
+```python
+from fastapi import Request
+
+import modal
+
+image = modal.Image.debian_slim().pip_install("fastapi")
+app = modal.App("web-endpoint-cls", image=image)
+
+@app.cls()
+class Model:
+    @modal.enter()
+    def run_this_on_container_startup(self):
+        self.model = pickle.load(open("model.pickle"))
+
+    @modal.fastapi_endpoint()
+    def predict(self, request: Request):
+        ...
+```
+
+#### Parametrized functions
+
+# Parametrized functions
+
+A single Modal Function can be parametrized by a set of arguments, so that each unique combination of arguments will behave like an individual
+Modal Function with its own auto-scaling and lifecycle logic.
+
+For example, you might want to have a separate pool of containers for each unique user that invokes your Function. In this scenario, you would
+parametrize your Function by a user ID.
+
+To parametrize a Modal Function, you need to use Modal's [class syntax](https://modal.com/docs/guide/lifecycle-functions) and the
+[`@app.cls`](https://modal.com/docs/reference/modal.App#cls) decorator. Specifically, you'll need to:
+
+1. Convert your function to a method by making it a member of a class.
+2. Decorate the class with `@app.cls(...)` with the same arguments you previously
+   had for `@app.function(...)` or your [web endpoint decorator](https://modal.com/docs/guide/webhooks).
+3. If you previously used the `@app.function()` decorator on your function, replace it with `@modal.method()`.
+4. Define dataclass-style, type-annotated instance attributes with `modal.parameter()` and optionally set default values:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.cls()
+class MyClass:
+
+    foo: str = modal.parameter()
+    bar: int = modal.parameter(default=10)
+
+    @modal.method()
+    def baz(self, qux: str = "default") -> str:
+        return f"This code is running in container pool ({self.foo}, {self.bar}), with input qux={qux}"
+```
+
+The parameters create a keyword-only constructor for your class, and the methods can be called as follows:
+
+```python
+@app.local_entrypoint()
+def main():
+    m1 = MyClass(foo="hedgehog", bar=7)
+    m1.baz.remote()
+
+    m2 = MyClass(foo="fox")
+    m2.baz.remote(qux="override")
+```
+
+Function calls for each unique combination of values for `foo` and `bar` will run in their own separate container pools.
+If you re-constructed a `MyClass` with the same arguments in a different context, the calls to `baz` would be routed to the same set of containers as before.
+
+Some things to note:
+
+- The total size of the arguments is limited to 16 KiB.
+- Modal classes can still annotate types of regular class attributes, which are independent of parametrization, by either omitting `= modal.parameter()` or using `= modal.parameter(init=False)` to satisfy type checkers.
+- The support types are these primitives: `str`, `int`, `bool`, and `bytes`.
+- The legacy `__init__` constructor method is being removed, see [the 1.0 migration for details.](https://modal.com/docs/guide/modal-1-0-migration#removing-support-for-custom-cls-constructors)
+
+## Looking up a parametrized function
+
+If you want to call your parametrized function from a Python script running
+anywhere, you can use `Cls.lookup`:
+
+```python notest
+import modal
+
+MyClass = modal.Cls.from_name("parametrized-function-app", "MyClass")  # returns a class-like object
+m = MyClass(foo="snake", bar=12)
+m.baz.remote()
+```
+
+## Parametrized web endpoints
+
+Modal [web endpoints](https://modal.com/docs/guide/webhooks) can also be parametrized:
+
+```python
+app = modal.App("parametrized-endpoint")
+
+@app.cls()
+class MyClass():
+
+    foo: str = modal.parameter()
+    bar: int = modal.parameter(default=10)
+
+    @modal.fastapi_endpoint()
+    def baz(self, qux: str = "default") -> str:
+        ...
+```
+
+Parameters are specified in the URL as query parameter values.
+
+```bash
+curl "https://parametrized-endpoint.modal.run?foo=hedgehog&bar=7&qux=override"
+curl "https://parametrized-endpoint.modal.run?foo=hedgehog&qux=override"
+curl "https://parametrized-endpoint.modal.run?foo=hedgehog&bar=7"
+curl "https://parametrized-endpoint.modal.run?foo=hedgehog"
+```
+
+## Using parametrized functions with lifecycle functions
+
+Parametrized functions can be used with [lifecycle functions](https://modal.com/docs/guide/lifecycle-functions).
+
+For example, here is how you might parametrize the [`@enter`](https://modal.com/docs/guide/lifecycle-functions#enter) lifecycle function to load a specific model:
+
+```python
+@app.cls()
+class Model:
+
+    name: str = modal.parameter()
+    size: int = modal.parameter(default=100)
+
+    @modal.enter()
+    def load_model(self):
+        print(f"Loading model {self.name} with size {self.size}")
+        self.model = load_model_util(self.name, self.size)
+
+    @modal.method()
+    def generate(self, prompt: str) -> str:
+        return self.model.generate(prompt)
+```
+
+## Performance
+
+Currently, parametrized Function creation is rate limited to 1 per second, with the ability to burst to 1000. Please [get in touch](mailto:support@modal.com) if you need higher rate limits.
+
+#### S3 Gateway endpoints
+
+# S3 Gateway endpoints
+
+When running workloads in AWS, our system automatically uses a corresponding
+[S3 Gateway endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html)
+to ensure low costs, optimal performance, and network reliability between Modal and S3.
+
+Workloads running on Modal should not incur egress or ingress fees associated
+with S3 operations. No configuration is needed in order for your app to use S3 Gateway endpoints.
+S3 Gateway endpoints are automatically used when your app runs on AWS.
+
+## Endpoint configuration
+
+Only use the region-specific endpoint (`s3.<region>.amazonaws.com`) or the
+global AWS endpoint (`s3.amazonaws.com`). Using an S3 endpoint from one region
+in another **will not use the S3 Gateway Endpoint incurring networking costs**.
+
+Avoid specifying regional endpoints manually, as this can lead to unexpected cost
+or performance degradation.
+
+## Inter-region costs
+
+S3 Gateway endpoints guarantee no costs for network traffic within the same AWS region.
+However, if your Modal Function runs in one region but your bucket resides in a
+different region you will be billed for inter-region traffic.
+
+You can prevent this by scheduling your Modal App in the same region of your
+S3 bucket with [Region selection](https://modal.com/docs/guide/region-selection#region-selection).
+
+#### GPU Metrics
+
+# GPU Metrics
+
+Modal exposes a number of GPU metrics that help monitor the health and utilization of the GPUs you're using.
+
+- **GPU utilization %** is the percentage of time that the GPU was executing at least one CUDA kernel. This is the same metric reported as utilization by [`nvidia-smi`](https://modal.com/gpu-glossary/host-software/nvidia-smi). GPU utilization is helpful for determining the amount of time GPU work is blocked on CPU work, like PyTorch compute graph construction or input processing. However, it is far from indicating what fraction of the GPU's computing firepower (FLOPS or memory throughput, [CUDA Cores](https://modal.com/gpu-glossary/device-hardware/cuda-core), [SMs](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor)) is being used. See [this blog post](https://arthurchiao.art/blog/understanding-gpu-performance) for details.
+- **GPU power utilization %** is the percentage of the maximum power draw that the device is currently drawing. When aggregating across containers, we also report **Total GPU power usage** in Watts. Because high-performance GPUs are [fundamentally limited by power draw](https://www.thonking.ai/p/strangely-matrix-multiplications), both for computation and memory access, the power usage can be used as a proxy of how much work the GPU is doing. A fully-saturated GPU should draw at or near its entire power budget (which can also be found by running `nvidia-smi`).
+- **GPU temperature** is the temperature measured on the die of the GPU. Like power draw, which is the source of the thermal energy, the ability to efflux heat is a fundamental limit on GPU performance: continuing to draw full power without removing the waste heat would damage the system. At the highest temperatures readily observed in proper GPU deployments (i.e. mid-70s Celsius for an H100), increased error correction from thermal noise can already reduce performance. Generally, power utilization is a better proxy for performance, but we report temperature for completeness.
+- **GPU memory used** is the amount of memory allocated on the GPU, in bytes.
+
+In general, these metrics are useful signals or correlates of performance, but can't be used to directly debug performance issues. Instead, we (and [the manufacturers!](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy)) recommend tracing and profiling workloads. See [this example](https://modal.com/docs/examples/torch_profiling) of profiling PyTorch applications on Modal.
+
+## API Reference
+
+### Python API Reference
+
+#### App
+
+# modal.App
+
+```python
+class App(object)
+```
+
+A Modal App is a group of functions and classes that are deployed together.
+
+The app serves at least three purposes:
+
+* A unit of deployment for functions and classes.
+* Syncing of identities of (primarily) functions and classes across processes
+  (your local Python interpreter and every Modal container active in your application).
+* Manage log collection for everything that happens inside your code.
+
+**Registering functions with an app**
+
+The most common way to explicitly register an Object with an app is through the
+`@app.function()` decorator. It both registers the annotated function itself and
+other passed objects, like schedules and secrets, with the app:
+
+```python
+import modal
+
+app = modal.App()
+
+@app.function(
+    secrets=[modal.Secret.from_name("some_secret")],
+    schedule=modal.Period(days=1),
+)
+def foo():
+    pass
+```
+
+In this example, the secret and schedule are registered with the app.
+
+```python
+def __init__(
+    self,
+    name: Optional[str] = None,
+    *,
+    tags: Optional[dict[str, str]] = None,  # Additional metadata to set on the App
+    image: Optional[_Image] = None,  # Default Image for the App (otherwise default to `modal.Image.debian_slim()`)
+    secrets: Sequence[_Secret] = [],  # Secrets to add for all Functions in the App
+    volumes: dict[Union[str, PurePosixPath], _Volume] = {},  # Volume mounts to use for all Functions
+    include_source: bool = True,  # Default configuration for adding Function source file(s) to the Modal container
+) -> None:
+```
+
+Construct a new app, optionally with default image, mounts, secrets, or volumes.
+
+```python notest
+image = modal.Image.debian_slim().pip_install(...)
+secret = modal.Secret.from_name("my-secret")
+volume = modal.Volume.from_name("my-data")
+app = modal.App(image=image, secrets=[secret], volumes={"/mnt/data": volume})
+```
+## name
+
+```python
+@property
+def name(self) -> Optional[str]:
+```
+
+The user-provided name of the App.
+## app_id
+
+```python
+@property
+def app_id(self) -> Optional[str]:
+```
+
+Return the app_id of a running or stopped app.
+## description
+
+```python
+@property
+def description(self) -> Optional[str]:
+```
+
+The App's `name`, if available, or a fallback descriptive identifier.
+## lookup
+
+```python
+@staticmethod
+def lookup(
+    name: str,
+    *,
+    client: Optional[_Client] = None,
+    environment_name: Optional[str] = None,
+    create_if_missing: bool = False,
+) -> "_App":
+```
+
+Look up an App with a given name, creating a new App if necessary.
+
+Note that Apps created through this method will be in a deployed state,
+but they will not have any associated Functions or Classes. This method
+is mainly useful for creating an App to associate with a Sandbox:
+
+```python
+app = modal.App.lookup("my-app", create_if_missing=True)
+modal.Sandbox.create("echo", "hi", app=app)
+```
+## run
+
+```python
+@contextmanager
+def run(
+    self,
+    *,
+    client: Optional[_Client] = None,
+    detach: bool = False,
+    interactive: bool = False,
+    environment_name: Optional[str] = None,
+) -> AsyncGenerator["_App", None]:
+```
+
+Context manager that runs an ephemeral app on Modal.
+
+Use this as the main entry point for your Modal application. All calls
+to Modal Functions should be made within the scope of this context
+manager, and they will correspond to the current App.
+
+**Example**
+
+```python notest
+with app.run():
+    some_modal_function.remote()
+```
+
+To enable output printing (i.e., to see App logs), use `modal.enable_output()`:
+
+```python notest
+with modal.enable_output():
+    with app.run():
+        some_modal_function.remote()
+```
+
+Note that you should not invoke this in global scope of a file where you have
+Modal Functions or Classes defined, since that would run the block when the Function
+or Cls is imported in your containers as well. If you want to run it as your entrypoint,
+consider protecting it:
+
+```python
+if __name__ == "__main__":
+    with app.run():
+        some_modal_function.remote()
+```
+
+You can then run your script with:
+
+```shell
+python app_module.py
+```
+## deploy
+
+```python
+def deploy(
+    self,
+    *,
+    name: Optional[str] = None,  # Name for the deployment, overriding any set on the App
+    environment_name: Optional[str] = None,  # Environment to deploy the App in
+    tag: str = "",  # Optional metadata that is specific to this deployment
+    client: Optional[_Client] = None,  # Alternate client to use for communication with the server
+) -> typing_extensions.Self:
+```
+
+Deploy the App so that it is available persistently.
+
+Deployed Apps will be avaible for lookup or web-based invocations until they are stopped.
+Unlike with `App.run`, this method will return as soon as the deployment completes.
+
+This method is a programmatic alternative to the `modal deploy` CLI command.
+
+Examples:
+
+```python notest
+app = App("my-app")
+app.deploy()
+```
+
+To enable output printing (i.e., to see build logs), use `modal.enable_output()`:
+
+```python notest
+app = App("my-app")
+with modal.enable_output():
+    app.deploy()
+```
+
+Unlike with `App.run`, Function logs will not stream back to the local client after the
+App is deployed.
+
+Note that you should not invoke this method in global scope, as that would redeploy
+the App every time the file is imported. If you want to write a programmatic deployment
+script, protect this call so that it only runs when the file is executed directly:
+
+```python notest
+if __name__ == "__main__":
+    with modal.enable_output():
+        app.deploy()
+```
+
+Then you can deploy your app with:
+
+```shell
+python app_module.py
+```
+## local_entrypoint
+
+```python
+def local_entrypoint(
+    self, _warn_parentheses_missing: Any = None, *, name: Optional[str] = None
+) -> Callable[[Callable[..., Any]], _LocalEntrypoint]:
+```
+
+Decorate a function to be used as a CLI entrypoint for a Modal App.
+
+These functions can be used to define code that runs locally to set up the app,
+and act as an entrypoint to start Modal functions from. Note that regular
+Modal functions can also be used as CLI entrypoints, but unlike `local_entrypoint`,
+those functions are executed remotely directly.
+
+**Example**
+
+```python
+@app.local_entrypoint()
+def main():
+    some_modal_function.remote()
+```
+
+You can call the function using `modal run` directly from the CLI:
+
+```shell
+modal run app_module.py
+```
+
+Note that an explicit [`app.run()`](https://modal.com/docs/reference/modal.App#run) is not needed, as an
+[app](https://modal.com/docs/guide/apps) is automatically created for you.
+
+**Multiple Entrypoints**
+
+If you have multiple `local_entrypoint` functions, you can qualify the name of your app and function:
+
+```shell
+modal run app_module.py::app.some_other_function
+```
+
+**Parsing Arguments**
+
+If your entrypoint function take arguments with primitive types, `modal run` automatically parses them as
+CLI options.
+For example, the following function can be called with `modal run app_module.py --foo 1 --bar "hello"`:
+
+```python
+@app.local_entrypoint()
+def main(foo: int, bar: str):
+    some_modal_function.call(foo, bar)
+```
+
+Currently, `str`, `int`, `float`, `bool`, and `datetime.datetime` are supported.
+Use `modal run app_module.py --help` for more information on usage.
+## function
+
+```python
+@warn_on_renamed_autoscaler_settings
+def function(
+    self,
+    *,
+    image: Optional[_Image] = None,  # The image to run as the container for the function
+    schedule: Optional[Schedule] = None,  # An optional Modal Schedule for the function
+    env: Optional[dict[str, Optional[str]]] = None,  # Environment variables to set in the container
+    secrets: Optional[Collection[_Secret]] = None,  # Secrets to inject into the container as environment variables
+    gpu: Union[
+        GPU_T, list[GPU_T]
+    ] = None,  # GPU request as string ("any", "T4", ...), object (`modal.GPU.A100()`, ...), or a list of either
+    serialized: bool = False,  # Whether to send the function over using cloudpickle.
+    network_file_systems: dict[
+        Union[str, PurePosixPath], _NetworkFileSystem
+    ] = {},  # Mountpoints for Modal NetworkFileSystems
+    volumes: dict[
+        Union[str, PurePosixPath], Union[_Volume, _CloudBucketMount]
+    ] = {},  # Mount points for Modal Volumes & CloudBucketMounts
+    # Specify, in fractional CPU cores, how many CPU cores to request.
+    # Or, pass (request, limit) to additionally specify a hard limit in fractional CPU cores.
+    # CPU throttling will prevent a container from exceeding its specified limit.
+    cpu: Optional[Union[float, tuple[float, float]]] = None,
+    # Specify, in MiB, a memory request which is the minimum memory required.
+    # Or, pass (request, limit) to additionally specify a hard limit in MiB.
+    memory: Optional[Union[int, tuple[int, int]]] = None,
+    ephemeral_disk: Optional[int] = None,  # Specify, in MiB, the ephemeral disk size for the Function.
+    min_containers: Optional[int] = None,  # Minimum number of containers to keep warm, even when Function is idle.
+    max_containers: Optional[int] = None,  # Limit on the number of containers that can be concurrently running.
+    buffer_containers: Optional[int] = None,  # Number of additional idle containers to maintain under active load.
+    scaledown_window: Optional[int] = None,  # Max time (in seconds) a container can remain idle while scaling down.
+    proxy: Optional[_Proxy] = None,  # Reference to a Modal Proxy to use in front of this function.
+    retries: Optional[Union[int, Retries]] = None,  # Number of times to retry each input in case of failure.
+    timeout: int = 300,  # Maximum execution time for inputs and startup time in seconds.
+    startup_timeout: Optional[int] = None,  # Maximum startup time in seconds with higher precedence than `timeout`.
+    name: Optional[str] = None,  # Sets the Modal name of the function within the app
+    is_generator: Optional[
+        bool
+    ] = None,  # Set this to True if it's a non-generator function returning a [sync/async] generator object
+    cloud: Optional[str] = None,  # Cloud provider to run the function on. Possible values are aws, gcp, oci, auto.
+    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the function on.
+    nonpreemptible: bool = False,  # Whether to run the function on a nonpreemptible instance.
+    enable_memory_snapshot: bool = False,  # Enable memory checkpointing for faster cold starts.
+    block_network: bool = False,  # Whether to block network access
+    restrict_modal_access: bool = False,  # Whether to allow this function access to other Modal resources
+    single_use_containers: bool = False,  # When True, containers will shut down after handling a single input
+    i6pn: Optional[bool] = None,  # Whether to enable IPv6 container networking within the region.
+    # Whether the file or directory containing the Function's source should automatically be included
+    # in the container. When unset, falls back to the App-level configuration, or is otherwise True by default.
+    include_source: Optional[bool] = None,
+    experimental_options: Optional[dict[str, Any]] = None,
+    # Parameters below here are experimental. Use with caution!
+    _experimental_proxy_ip: Optional[str] = None,  # IP address of proxy
+    _experimental_custom_scaling_factor: Optional[float] = None,  # Custom scaling factor
+    _experimental_restrict_output: bool = False,  # Don't use pickle for return values
+    # Parameters below here are deprecated. Please update your code as suggested
+    keep_warm: Optional[int] = None,  # Replaced with `min_containers`
+    concurrency_limit: Optional[int] = None,  # Replaced with `max_containers`
+    container_idle_timeout: Optional[int] = None,  # Replaced with `scaledown_window`
+    allow_concurrent_inputs: Optional[int] = None,  # Replaced with the `@modal.concurrent` decorator
+    max_inputs: Optional[int] = None,  # Replaced with `single_use_containers`
+    _experimental_buffer_containers: Optional[int] = None,  # Now stable API with `buffer_containers`
+    _experimental_scheduler_placement: Optional[SchedulerPlacement] = None,  # Replaced in favor of
+    # using `region` and `nonpreemptible`
+) -> _FunctionDecoratorType:
+```
+
+Decorator to register a new Modal Function with this App.
+## cls
+
+```python
+@typing_extensions.dataclass_transform(field_specifiers=(parameter,), kw_only_default=True)
+@warn_on_renamed_autoscaler_settings
+def cls(
+    self,
+    *,
+    image: Optional[_Image] = None,  # The image to run as the container for the function
+    env: Optional[dict[str, Optional[str]]] = None,  # Environment variables to set in the container
+    secrets: Optional[Collection[_Secret]] = None,  # Secrets to inject into the container as environment variables
+    gpu: Union[
+        GPU_T, list[GPU_T]
+    ] = None,  # GPU request as string ("any", "T4", ...), object (`modal.GPU.A100()`, ...), or a list of either
+    serialized: bool = False,  # Whether to send the function over using cloudpickle.
+    network_file_systems: dict[
+        Union[str, PurePosixPath], _NetworkFileSystem
+    ] = {},  # Mountpoints for Modal NetworkFileSystems
+    volumes: dict[
+        Union[str, PurePosixPath], Union[_Volume, _CloudBucketMount]
+    ] = {},  # Mount points for Modal Volumes & CloudBucketMounts
+    # Specify, in fractional CPU cores, how many CPU cores to request.
+    # Or, pass (request, limit) to additionally specify a hard limit in fractional CPU cores.
+    # CPU throttling will prevent a container from exceeding its specified limit.
+    cpu: Optional[Union[float, tuple[float, float]]] = None,
+    # Specify, in MiB, a memory request which is the minimum memory required.
+    # Or, pass (request, limit) to additionally specify a hard limit in MiB.
+    memory: Optional[Union[int, tuple[int, int]]] = None,
+    ephemeral_disk: Optional[int] = None,  # Specify, in MiB, the ephemeral disk size for the Function.
+    min_containers: Optional[int] = None,  # Minimum number of containers to keep warm, even when Function is idle.
+    max_containers: Optional[int] = None,  # Limit on the number of containers that can be concurrently running.
+    buffer_containers: Optional[int] = None,  # Number of additional idle containers to maintain under active load.
+    scaledown_window: Optional[int] = None,  # Max time (in seconds) a container can remain idle while scaling down.
+    proxy: Optional[_Proxy] = None,  # Reference to a Modal Proxy to use in front of this function.
+    retries: Optional[Union[int, Retries]] = None,  # Number of times to retry each input in case of failure.
+    timeout: int = 300,  # Maximum execution time for inputs and startup time in seconds.
+    startup_timeout: Optional[int] = None,  # Maximum startup time in seconds with higher precedence than `timeout`.
+    cloud: Optional[str] = None,  # Cloud provider to run the function on. Possible values are aws, gcp, oci, auto.
+    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the function on.
+    nonpreemptible: bool = False,  # Whether to run the function on a non-preemptible instance.
+    enable_memory_snapshot: bool = False,  # Enable memory checkpointing for faster cold starts.
+    block_network: bool = False,  # Whether to block network access
+    restrict_modal_access: bool = False,  # Whether to allow this class access to other Modal resources
+    single_use_containers: bool = False,  # When True, containers will shut down after handling a single input
+    i6pn: Optional[bool] = None,  # Whether to enable IPv6 container networking within the region.
+    include_source: Optional[bool] = None,  # When `False`, don't automatically add the App source to the container.
+    experimental_options: Optional[dict[str, Any]] = None,
+    # Parameters below here are experimental. Use with caution!
+    _experimental_proxy_ip: Optional[str] = None,  # IP address of proxy
+    _experimental_custom_scaling_factor: Optional[float] = None,  # Custom scaling factor
+    _experimental_restrict_output: bool = False,  # Don't use pickle for return values
+    # Parameters below here are deprecated. Please update your code as suggested
+    keep_warm: Optional[int] = None,  # Replaced with `min_containers`
+    concurrency_limit: Optional[int] = None,  # Replaced with `max_containers`
+    container_idle_timeout: Optional[int] = None,  # Replaced with `scaledown_window`
+    allow_concurrent_inputs: Optional[int] = None,  # Replaced with the `@modal.concurrent` decorator
+    max_inputs: Optional[int] = None,  # Replaced with `single_use_containers`
+    _experimental_buffer_containers: Optional[int] = None,  # Now stable API with `buffer_containers`
+    _experimental_scheduler_placement: Optional[SchedulerPlacement] = None,  # Replaced in favor of
+    # using `region` and `nonpreemptible`
+) -> Callable[[Union[CLS_T, _PartialFunction]], CLS_T]:
+```
+
+Decorator to register a new Modal [Cls](https://modal.com/docs/reference/modal.Cls) with this App.
+## include
+
+```python
+def include(self, /, other_app: "_App", inherit_tags: bool = True) -> typing_extensions.Self:
+```
+
+Include another App's objects in this one.
+
+Useful for splitting up Modal Apps across different self-contained files.
+
+```python
+app_a = modal.App("a")
+@app.function()
+def foo():
+    ...
+
+app_b = modal.App("b")
+@app.function()
+def bar():
+    ...
+
+app_a.include(app_b)
+
+@app_a.local_entrypoint()
+def main():
+    # use function declared on the included app
+    bar.remote()
+```
+
+When `inherit_tags=True` any tags set on the other App will be inherited by this App
+(with this App's tags taking precedence in the case of conflicts).
+## set_tags
+
+```python
+def set_tags(self, tags: Mapping[str, str], *, client: Optional[_Client] = None) -> None:
+```
+
+Attach key-value metadata to the App.
+
+Tag metadata can be used to add organization-specific context to the App and can be
+included in billing reports and other informational APIs. Tags can also be set in
+the App constructor.
+
+Any tags set on the App before calling this method will be removed if they are not
+included in the argument (i.e., this method does not have `.update()` semantics).
+## get_tags
+
+```python
+def get_tags(self, *, client: Optional[_Client] = None) -> dict[str, str]:
+```
+
+Get the tags that are currently attached to the App.
+
+#### Client
+
+# modal.Client
+
+```python
+class Client(object)
+```
+
+## is_closed
+
+```python
+def is_closed(self) -> bool:
+```
+
+## hello
+
+```python
+def hello(self):
+```
+
+Connect to server and retrieve version information; raise appropriate error for various failures.
+## from_credentials
+
+```python
+@classmethod
+def from_credentials(cls, token_id: str, token_secret: str) -> "_Client":
+```
+
+Constructor based on token credentials; useful for managing Modal on behalf of third-party users.
+
+**Usage:**
+
+```python notest
+client = modal.Client.from_credentials("my_token_id", "my_token_secret")
+
+modal.Sandbox.create("echo", "hi", client=client, app=app)
+```
+## get_input_plane_metadata
+
+```python
+def get_input_plane_metadata(self, input_plane_region: str) -> list[tuple[str, str]]:
+```
+
+#### CloudBucketMount
+
+# modal.CloudBucketMount
+
+```python
+class CloudBucketMount(object)
+```
+
+Mounts a cloud bucket to your container. Currently supports AWS S3 buckets.
+
+S3 buckets are mounted using [AWS S3 Mountpoint](https://github.com/awslabs/mountpoint-s3).
+S3 mounts are optimized for reading large files sequentially. It does not support every file operation; consult
+[the AWS S3 Mountpoint documentation](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md)
+for more information.
+
+**AWS S3 Usage**
+
+```python
+import subprocess
+
+app = modal.App()
+secret = modal.Secret.from_name(
+    "aws-secret",
+    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"]
+    # Note: providing AWS_REGION can help when automatic detection of the bucket region fails.
+)
+
+@app.function(
+    volumes={
+        "/my-mount": modal.CloudBucketMount(
+            bucket_name="s3-bucket-name",
+            secret=secret,
+            read_only=True
+        )
+    }
+)
+def f():
+    subprocess.run(["ls", "/my-mount"], check=True)
+```
+
+**Cloudflare R2 Usage**
+
+Cloudflare R2 is [S3-compatible](https://developers.cloudflare.com/r2/api/s3/api/) so its setup looks
+very similar to S3. But additionally the `bucket_endpoint_url` argument must be passed.
+
+```python
+import subprocess
+
+app = modal.App()
+secret = modal.Secret.from_name(
+    "r2-secret",
+    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"]
+)
+
+@app.function(
+    volumes={
+        "/my-mount": modal.CloudBucketMount(
+            bucket_name="my-r2-bucket",
+            bucket_endpoint_url="https://<ACCOUNT ID>.r2.cloudflarestorage.com",
+            secret=secret,
+            read_only=True
+        )
+    }
+)
+def f():
+    subprocess.run(["ls", "/my-mount"], check=True)
+```
+
+**Google GCS Usage**
+
+Google Cloud Storage (GCS) is [S3-compatible](https://cloud.google.com/storage/docs/interoperability).
+GCS Buckets also require a secret with Google-specific key names (see below) populated with
+a [HMAC key](https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create).
+
+```python
+import subprocess
+
+app = modal.App()
+gcp_hmac_secret = modal.Secret.from_name(
+    "gcp-secret",
+    required_keys=["GOOGLE_ACCESS_KEY_ID", "GOOGLE_ACCESS_KEY_SECRET"]
+)
+
+@app.function(
+    volumes={
+        "/my-mount": modal.CloudBucketMount(
+            bucket_name="my-gcs-bucket",
+            bucket_endpoint_url="https://storage.googleapis.com",
+            secret=gcp_hmac_secret,
+        )
+    }
+)
+def f():
+    subprocess.run(["ls", "/my-mount"], check=True)
+```
+
+```python
+def __init__(self, bucket_name: str, bucket_endpoint_url: Optional[str] = None, key_prefix: Optional[str] = None, secret: Optional[modal.secret._Secret] = None, oidc_auth_role_arn: Optional[str] = None, read_only: bool = False, requester_pays: bool = False, force_path_style: bool = False) -> None
+```
+
+#### Cls
+
+# modal.Cls
+
+```python
+class Cls(modal.object.Object)
+```
+
+Cls adds method pooling and [lifecycle hook](https://modal.com/docs/guide/lifecycle-functions) behavior
+to [modal.Function](https://modal.com/docs/reference/modal.Function).
+
+Generally, you will not construct a Cls directly.
+Instead, use the [`@app.cls()`](https://modal.com/docs/reference/modal.App#cls) decorator on the App object.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## from_name
+
+```python
+@classmethod
+def from_name(
+    cls: type["_Cls"],
+    app_name: str,
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    client: Optional["_Client"] = None,
+) -> "_Cls":
+```
+
+Reference a Cls from a deployed App by its name.
+
+This is a lazy method that defers hydrating the local
+object with metadata from Modal servers until the first
+time it is actually used.
+
+```python
+Model = modal.Cls.from_name("other-app", "Model")
+```
+## with_options
+
+```python
+@warn_on_renamed_autoscaler_settings
+def with_options(
+    self: "_Cls",
+    *,
+    cpu: Optional[Union[float, tuple[float, float]]] = None,
+    memory: Optional[Union[int, tuple[int, int]]] = None,
+    gpu: GPU_T = None,
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    volumes: dict[Union[str, PurePosixPath], Union[_Volume, _CloudBucketMount]] = {},
+    retries: Optional[Union[int, Retries]] = None,
+    max_containers: Optional[int] = None,  # Limit on the number of containers that can be concurrently running.
+    buffer_containers: Optional[int] = None,  # Additional containers to scale up while Function is active.
+    scaledown_window: Optional[int] = None,  # Max amount of time a container can remain idle before scaling down.
+    timeout: Optional[int] = None,
+    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the function on.
+    cloud: Optional[str] = None,  # Cloud provider to run the function on. Possible values are aws, gcp, oci, auto.
+    # The following parameters are deprecated
+    concurrency_limit: Optional[int] = None,  # Now called `max_containers`
+    container_idle_timeout: Optional[int] = None,  # Now called `scaledown_window`
+    allow_concurrent_inputs: Optional[int] = None,  # See `.with_concurrency`
+) -> "_Cls":
+```
+
+Override the static Function configuration at runtime.
+
+This method will return a new instance of the cls that will autoscale independently of the
+original instance. Note that options cannot be "unset" with this method (i.e., if a GPU
+is configured in the `@app.cls()` decorator, passing `gpu=None` here will not create a
+CPU-only instance).
+
+**Usage:**
+
+You can use this method after looking up the Cls from a deployed App or if you have a
+direct reference to a Cls from another Function or local entrypoint on its App:
+
+```python notest
+Model = modal.Cls.from_name("my_app", "Model")
+ModelUsingGPU = Model.with_options(gpu="A100")
+ModelUsingGPU().generate.remote(input_prompt)  # Run with an A100 GPU
+```
+
+The method can be called multiple times to "stack" updates:
+
+```python notest
+Model.with_options(gpu="A100").with_options(scaledown_window=300)  # Use an A100 with slow scaledown
+```
+
+Note that container arguments (i.e. `volumes` and `secrets`) passed in subsequent calls
+will not be merged.
+## with_concurrency
+
+```python
+def with_concurrency(self: "_Cls", *, max_inputs: int, target_inputs: Optional[int] = None) -> "_Cls":
+```
+
+Create an instance of the Cls with input concurrency enabled or overridden with new values.
+
+**Usage:**
+
+```python notest
+Model = modal.Cls.from_name("my_app", "Model")
+ModelUsingGPU = Model.with_options(gpu="A100").with_concurrency(max_inputs=100)
+ModelUsingGPU().generate.remote(42)  # will run on an A100 GPU with input concurrency enabled
+```
+## with_batching
+
+```python
+def with_batching(self: "_Cls", *, max_batch_size: int, wait_ms: int) -> "_Cls":
+```
+
+Create an instance of the Cls with dynamic batching enabled or overridden with new values.
+
+**Usage:**
+
+```python notest
+Model = modal.Cls.from_name("my_app", "Model")
+ModelUsingGPU = Model.with_options(gpu="A100").with_batching(max_batch_size=100, batch_wait_ms=1000)
+ModelUsingGPU().generate.remote(42)  # will run on an A100 GPU with input concurrency enabled
+```
+
+#### Cron
+
+# modal.Cron
+
+```python
+class Cron(modal.schedule.Schedule)
+```
+
+Cron jobs are a type of schedule, specified using the
+[Unix cron tab](https://crontab.guru/) syntax.
+
+The alternative schedule type is the [`modal.Period`](https://modal.com/docs/reference/modal.Period).
+
+**Usage**
+
+```python
+import modal
+app = modal.App()
+
+@app.function(schedule=modal.Cron("* * * * *"))
+def f():
+    print("This function will run every minute")
+```
+
+We can specify different schedules with cron strings, for example:
+
+```python
+modal.Cron("5 4 * * *")  # run at 4:05am UTC every night
+modal.Cron("0 9 * * 4")  # runs every Thursday at 9am UTC
+```
+
+We can also optionally specify a timezone, for example:
+
+```python
+# Run daily at 6am New York time, regardless of whether daylight saving
+# is in effect (i.e. at 11am UTC in the winter, and 10am UTC in the summer):
+modal.Cron("0 6 * * *", timezone="America/New_York")
+```
+
+If no timezone is specified, the default is UTC.
+
+```python
+def __init__(
+    self,
+    cron_string: str,
+    timezone: str = "UTC",
+) -> None:
+```
+
+Construct a schedule that runs according to a cron expression string.
+
+#### Dict
+
+# modal.Dict
+
+```python
+class Dict(modal.object.Object)
+```
+
+Distributed dictionary for storage in Modal apps.
+
+Dict contents can be essentially any object so long as they can be serialized by
+`cloudpickle`. This includes other Modal objects. If writing and reading in different
+environments (eg., writing locally and reading remotely), it's necessary to have the
+library defining the data type installed, with compatible versions, on both sides.
+Additionally, cloudpickle serialization is not guaranteed to be deterministic, so it is
+generally recommended to use primitive types for keys.
+
+**Lifetime of a Dict and its items**
+
+An individual Dict entry will expire after 7 days of inactivity (no reads or writes). The
+Dict entries are written to durable storage.
+
+Legacy Dicts (created before 2025-05-20) will still have entries expire 30 days after being
+last added. Additionally, contents are stored in memory on the Modal server and could be lost
+due to unexpected server restarts. Eventually, these Dicts will be fully sunset.
+
+**Usage**
+
+```python
+from modal import Dict
+
+my_dict = Dict.from_name("my-persisted_dict", create_if_missing=True)
+
+my_dict["some key"] = "some value"
+my_dict[123] = 456
+
+assert my_dict["some key"] == "some value"
+assert my_dict[123] == 456
+```
+
+The `Dict` class offers a few methods for operations that are usually accomplished
+in Python with operators, such as `Dict.put` and `Dict.contains`. The advantage of
+these methods is that they can be safely called in an asynchronous context by using
+the `.aio` suffix on the method, whereas their operator-based analogues will always
+run synchronously and block the event loop.
+
+For more examples, see the [guide](https://modal.com/docs/guide/dicts-and-queues#modal-dicts).
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## objects
+
+```python
+class objects(object)
+```
+
+Namespace with methods for managing named Dict objects.
+
+### create
+
+```python
+@staticmethod
+def create(
+    name: str,  # Name to use for the new Dict
+    *,
+    allow_existing: bool = False,  # If True, no-op when the Dict already exists
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> None:
+```
+
+Create a new Dict object.
+
+**Examples:**
+
+```python notest
+modal.Dict.objects.create("my-dict")
+```
+
+Dicts will be created in the active environment, or another one can be specified:
+
+```python notest
+modal.Dict.objects.create("my-dict", environment_name="dev")
+```
+
+By default, an error will be raised if the Dict already exists, but passing
+`allow_existing=True` will make the creation attempt a no-op in this case.
+
+```python notest
+modal.Dict.objects.create("my-dict", allow_existing=True)
+```
+
+Note that this method does not return a local instance of the Dict. You can use
+`modal.Dict.from_name` to perform a lookup after creation.
+
+Added in v1.1.2.
+### list
+
+```python
+@staticmethod
+def list(
+    *,
+    max_objects: Optional[int] = None,  # Limit results to this size
+    created_before: Optional[Union[datetime, str]] = None,  # Limit based on creation date
+    environment_name: str = "",  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> builtins.list["_Dict"]:
+```
+
+Return a list of hydrated Dict objects.
+
+**Examples:**
+
+```python
+dicts = modal.Dict.objects.list()
+print([d.name for d in dicts])
+```
+
+Dicts will be retreived from the active environment, or another one can be specified:
+
+```python notest
+dev_dicts = modal.Dict.objects.list(environment_name="dev")
+```
+
+By default, all named Dict are returned, newest to oldest. It's also possible to limit the
+number of results and to filter by creation date:
+
+```python
+dicts = modal.Dict.objects.list(max_objects=10, created_before="2025-01-01")
+```
+
+Added in v1.1.2.
+### delete
+
+```python
+@staticmethod
+def delete(
+    name: str,  # Name of the Dict to delete
+    *,
+    allow_missing: bool = False,  # If True, don't raise an error if the Dict doesn't exist
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+):
+```
+
+Delete a named Dict.
+
+Warning: This deletes an *entire Dict*, not just a specific key.
+Deletion is irreversible and will affect any Apps currently using the Dict.
+
+**Examples:**
+
+```python notest
+await modal.Dict.objects.delete("my-dict")
+```
+
+Dicts will be deleted from the active environment, or another one can be specified:
+
+```python notest
+await modal.Dict.objects.delete("my-dict", environment_name="dev")
+```
+
+Added in v1.1.2.
+## name
+
+```python
+@property
+def name(self) -> Optional[str]:
+```
+
+## ephemeral
+
+```python
+@classmethod
+@contextmanager
+def ephemeral(
+    cls: type["_Dict"],
+    data: Optional[dict] = None,  # DEPRECATED
+    client: Optional[_Client] = None,
+    environment_name: Optional[str] = None,
+) -> Iterator["_Dict"]:
+```
+
+Creates a new ephemeral Dict within a context manager:
+
+Usage:
+```python
+from modal import Dict
+
+with Dict.ephemeral() as d:
+    d["foo"] = "bar"
+```
+
+```python notest
+async with Dict.ephemeral() as d:
+    await d.put.aio("foo", "bar")
+```
+## from_name
+
+```python
+@staticmethod
+def from_name(
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    create_if_missing: bool = False,
+    client: Optional[_Client] = None,
+) -> "_Dict":
+```
+
+Reference a named Dict, creating if necessary.
+
+This is a lazy method that defers hydrating the local
+object with metadata from Modal servers until the first
+time it is actually used.
+
+```python
+d = modal.Dict.from_name("my-dict", create_if_missing=True)
+d[123] = 456
+```
+## info
+
+```python
+@live_method
+def info(self) -> DictInfo:
+```
+
+Return information about the Dict object.
+## clear
+
+```python
+@live_method
+def clear(self) -> None:
+```
+
+Remove all items from the Dict.
+## get
+
+```python
+@live_method
+def get(self, key: Any, default: Optional[Any] = None) -> Any:
+```
+
+Get the value associated with a key.
+
+Returns `default` if key does not exist.
+## contains
+
+```python
+@live_method
+def contains(self, key: Any) -> bool:
+```
+
+Return if a key is present.
+## len
+
+```python
+@live_method
+def len(self) -> int:
+```
+
+Return the length of the Dict.
+
+Note: This is an expensive operation and will return at most 100,000.
+## update
+
+```python
+@live_method
+def update(self, other: Optional[Mapping] = None, /, **kwargs) -> None:
+```
+
+Update the Dict with additional items.
+## put
+
+```python
+@live_method
+def put(self, key: Any, value: Any, *, skip_if_exists: bool = False) -> bool:
+```
+
+Add a specific key-value pair to the Dict.
+
+Returns True if the key-value pair was added and False if it wasn't because the key already existed and
+`skip_if_exists` was set.
+## pop
+
+```python
+@live_method
+def pop(self, key: Any, default: Any = _NO_DEFAULT) -> Any:
+```
+
+Remove a key from the Dict, returning the value if it exists.
+
+If key is not found, return default if provided, otherwise raise KeyError.
+## keys
+
+```python
+@live_method_gen
+def keys(self) -> Iterator[Any]:
+```
+
+Return an iterator over the keys in this Dict.
+
+Note that (unlike with Python dicts) the return value is a simple iterator,
+and results are unordered.
+## values
+
+```python
+@live_method_gen
+def values(self) -> Iterator[Any]:
+```
+
+Return an iterator over the values in this Dict.
+
+Note that (unlike with Python dicts) the return value is a simple iterator,
+and results are unordered.
+## items
+
+```python
+@live_method_gen
+def items(self) -> Iterator[tuple[Any, Any]]:
+```
+
+Return an iterator over the (key, value) tuples in this Dict.
+
+Note that (unlike with Python dicts) the return value is a simple iterator,
+and results are unordered.
+
+#### Error
+
+# modal.Error
+
+```python
+class Error(Exception)
+```
+
+Base class for all Modal errors. See [`modal.exception`](https://modal.com/docs/reference/modal.exception)
+for the specialized error classes.
+
+**Usage**
+
+```python notest
+import modal
+
+try:
+    ...
+except modal.Error:
+    # Catch any exception raised by Modal's systems.
+    print("Responding to error...")
+```
+
+#### FilePatternMatcher
+
+# modal.FilePatternMatcher
+
+```python
+class FilePatternMatcher(modal.file_pattern_matcher._AbstractPatternMatcher)
+```
+
+Allows matching file Path objects against a list of patterns.
+
+**Usage:**
+```python
+from pathlib import Path
+from modal import FilePatternMatcher
+
+matcher = FilePatternMatcher("*.py")
+
+assert matcher(Path("foo.py"))
+
+# You can also negate the matcher.
+negated_matcher = ~matcher
+
+assert not negated_matcher(Path("foo.py"))
+```
+
+```python
+def __init__(self, *pattern: str) -> None:
+```
+
+Initialize a new FilePatternMatcher instance.
+
+Args:
+    pattern (str): One or more pattern strings.
+
+Raises:
+    ValueError: If an illegal exclusion pattern is provided.
+## can_prune_directories
+
+```python
+def can_prune_directories(self) -> bool:
+```
+
+Returns True if this pattern matcher allows safe early directory pruning.
+
+Directory pruning is safe when matching directories can be skipped entirely
+without missing any files that should be included. This is for example not
+safe when we have inverted/negated ignore patterns (e.g. "!**/*.py").
+## from_file
+
+```python
+@classmethod
+def from_file(cls, file_path: Union[str, Path]) -> "FilePatternMatcher":
+```
+
+Initialize a new FilePatternMatcher instance from a file.
+
+The patterns in the file will be read lazily when the matcher is first used.
+
+Args:
+    file_path (Path): The path to the file containing patterns.
+
+**Usage:**
+```python
+from modal import FilePatternMatcher
+
+matcher = FilePatternMatcher.from_file("/path/to/ignorefile")
+```
+
+#### Function
+
+# modal.Function
+
+```python
+class Function(typing.Generic, modal.object.Object)
+```
+
+Functions are the basic units of serverless execution on Modal.
+
+Generally, you will not construct a `Function` directly. Instead, use the
+`App.function()` decorator to register your Python functions with your App.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## update_autoscaler
+
+```python
+@live_method
+def update_autoscaler(
+    self,
+    *,
+    min_containers: Optional[int] = None,
+    max_containers: Optional[int] = None,
+    buffer_containers: Optional[int] = None,
+    scaledown_window: Optional[int] = None,
+) -> None:
+```
+
+Override the current autoscaler behavior for this Function.
+
+Unspecified parameters will retain their current value, i.e. either the static value
+from the function decorator, or an override value from a previous call to this method.
+
+Subsequent deployments of the App containing this Function will reset the autoscaler back to
+its static configuration.
+
+Examples:
+
+```python notest
+f = modal.Function.from_name("my-app", "function")
+
+# Always have at least 2 containers running, with an extra buffer when the Function is active
+f.update_autoscaler(min_containers=2, buffer_containers=1)
+
+# Limit this Function to avoid spinning up more than 5 containers
+f.update_autoscaler(max_containers=5)
+
+# Extend the scaledown window to increase the amount of time that idle containers stay alive
+f.update_autoscaler(scaledown_window=300)
+
+```
+## from_name
+
+```python
+@classmethod
+def from_name(
+    cls: type["_Function"],
+    app_name: str,
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    client: Optional[_Client] = None,
+) -> "_Function":
+```
+
+Reference a Function from a deployed App by its name.
+
+This is a lazy method that defers hydrating the local
+object with metadata from Modal servers until the first
+time it is actually used.
+
+```python
+f = modal.Function.from_name("other-app", "function")
+```
+## get_web_url
+
+```python
+@live_method
+def get_web_url(self) -> Optional[str]:
+```
+
+URL of a Function running as a web endpoint.
+## remote
+
+```python
+@live_method
+def remote(self, *args: P.args, **kwargs: P.kwargs) -> ReturnType:
+```
+
+Calls the function remotely, executing it with the given arguments and returning the execution's result.
+## remote_gen
+
+```python
+@live_method_gen
+def remote_gen(self, *args, **kwargs) -> AsyncGenerator[Any, None]:
+```
+
+Calls the generator remotely, executing it with the given arguments and returning the execution's result.
+## local
+
+```python
+def local(self, *args: P.args, **kwargs: P.kwargs) -> OriginalReturnType:
+```
+
+Calls the function locally, executing it with the given arguments and returning the execution's result.
+
+The function will execute in the same environment as the caller, just like calling the underlying function
+directly in Python. In particular, only secrets available in the caller environment will be available
+through environment variables.
+## spawn
+
+```python
+@live_method
+def spawn(self, *args: P.args, **kwargs: P.kwargs) -> "_FunctionCall[ReturnType]":
+```
+
+Calls the function with the given arguments, without waiting for the results.
+
+Returns a [`modal.FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) object
+that can later be polled or waited for using
+[`.get(timeout=...)`](https://modal.com/docs/reference/modal.FunctionCall#get).
+Conceptually similar to `multiprocessing.pool.apply_async`, or a Future/Promise in other contexts.
+## get_raw_f
+
+```python
+def get_raw_f(self) -> Callable[..., Any]:
+```
+
+Return the inner Python object wrapped by this Modal Function.
+## get_current_stats
+
+```python
+@live_method
+def get_current_stats(self) -> FunctionStats:
+```
+
+Return a `FunctionStats` object describing the current function's queue and runner counts.
+## map
+
+```python
+@warn_if_generator_is_not_consumed(function_name="Function.map")
+def map(
+    self,
+    *input_iterators: typing.Iterable[Any],  # one input iterator per argument in the mapped-over function/generator
+    kwargs={},  # any extra keyword arguments for the function
+    order_outputs: bool = True,  # return outputs in order
+    return_exceptions: bool = False,  # propagate exceptions (False) or aggregate them in the results list (True)
+    wrap_returned_exceptions: bool = True,
+) -> AsyncOrSyncIterable:
+```
+
+Parallel map over a set of inputs.
+
+Takes one iterator argument per argument in the function being mapped over.
+
+Example:
+```python
+@app.function()
+def my_func(a):
+    return a ** 2
+
+@app.local_entrypoint()
+def main():
+    assert list(my_func.map([1, 2, 3, 4])) == [1, 4, 9, 16]
+```
+
+If applied to a `app.function`, `map()` returns one result per input and the output order
+is guaranteed to be the same as the input order. Set `order_outputs=False` to return results
+in the order that they are completed instead.
+
+`return_exceptions` can be used to treat exceptions as successful results:
+
+```python
+@app.function()
+def my_func(a):
+    if a == 2:
+        raise Exception("ohno")
+    return a ** 2
+
+@app.local_entrypoint()
+def main():
+    # [0, 1, UserCodeException(Exception('ohno'))]
+    print(list(my_func.map(range(3), return_exceptions=True)))
+```
+## starmap
+
+```python
+@warn_if_generator_is_not_consumed(function_name="Function.starmap")
+def starmap(
+    self,
+    input_iterator: typing.Iterable[typing.Sequence[Any]],
+    *,
+    kwargs={},
+    order_outputs: bool = True,
+    return_exceptions: bool = False,
+    wrap_returned_exceptions: bool = True,
+) -> AsyncOrSyncIterable:
+```
+
+Like `map`, but spreads arguments over multiple function arguments.
+
+Assumes every input is a sequence (e.g. a tuple).
+
+Example:
+```python
+@app.function()
+def my_func(a, b):
+    return a + b
+
+@app.local_entrypoint()
+def main():
+    assert list(my_func.starmap([(1, 2), (3, 4)])) == [3, 7]
+```
+## for_each
+
+```python
+def for_each(self, *input_iterators, kwargs={}, ignore_exceptions: bool = False):
+```
+
+Execute function for all inputs, ignoring outputs. Waits for completion of the inputs.
+
+Convenient alias for `.map()` in cases where the function just needs to be called.
+as the caller doesn't have to consume the generator to process the inputs.
+## spawn_map
+
+```python
+def spawn_map(self, *input_iterators, kwargs={}) -> None:
+```
+
+Spawn parallel execution over a set of inputs, exiting as soon as the inputs are created (without waiting
+for the map to complete).
+
+Takes one iterator argument per argument in the function being mapped over.
+
+Example:
+```python
+@app.function()
+def my_func(a):
+    return a ** 2
+
+@app.local_entrypoint()
+def main():
+    my_func.spawn_map([1, 2, 3, 4])
+```
+
+Programmatic retrieval of results will be supported in a future update.
+
+#### FunctionCall
+
+# modal.FunctionCall
+
+```python
+class FunctionCall(typing.Generic, modal.object.Object)
+```
+
+A reference to an executed function call.
+
+Constructed using `.spawn(...)` on a Modal function with the same
+arguments that a function normally takes. Acts as a reference to
+an ongoing function call that can be passed around and used to
+poll or fetch function results at some later time.
+
+Conceptually similar to a Future/Promise/AsyncResult in other contexts and languages.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## num_inputs
+
+```python
+@live_method
+def num_inputs(self) -> int:
+```
+
+Get the number of inputs in the function call.
+## get
+
+```python
+@live_method
+def get(self, timeout: Optional[float] = None, *, index: int = 0) -> ReturnType:
+```
+
+Get the result of the index-th input of the function call.
+`.spawn()` calls have a single output, so only specifying `index=0` is valid.
+A non-zero index is useful when your function has multiple outputs, like via `.spawn_map()`.
+
+This function waits indefinitely by default. It takes an optional
+`timeout` argument that specifies the maximum number of seconds to wait,
+which can be set to `0` to poll for an output immediately.
+
+The returned coroutine is not cancellation-safe.
+## iter
+
+```python
+@live_method_gen
+def iter(self, *, start: int = 0, end: Optional[int] = None) -> Iterator[ReturnType]:
+```
+
+Iterate in-order over the results of the function call.
+
+Optionally, specify a range [start, end) to iterate over.
+
+Example:
+```python
+@app.function()
+def my_func(a):
+    return a ** 2
+
+@app.local_entrypoint()
+def main():
+    fc = my_func.spawn_map([1, 2, 3, 4])
+    assert list(fc.iter()) == [1, 4, 9, 16]
+    assert list(fc.iter(start=1, end=3)) == [4, 9]
+```
+
+If `end` is not provided, it will iterate over all results.
+## get_call_graph
+
+```python
+@live_method
+def get_call_graph(self) -> list[InputInfo]:
+```
+
+Returns a structure representing the call graph from a given root
+call ID, along with the status of execution for each node.
+
+See [`modal.call_graph`](https://modal.com/docs/reference/modal.call_graph) reference page
+for documentation on the structure of the returned `InputInfo` items.
+## cancel
+
+```python
+@live_method
+def cancel(
+    self,
+    # if true, containers running the inputs are forcibly terminated
+    terminate_containers: bool = False,
+):
+```
+
+Cancels the function call, which will stop its execution and mark its inputs as
+[`TERMINATED`](https://modal.com/docs/reference/modal.call_graph#modalcall_graphinputstatus).
+
+If `terminate_containers=True` - the containers running the cancelled inputs are all terminated
+causing any non-cancelled inputs on those containers to be rescheduled in new containers.
+## from_id
+
+```python
+@deprecate_aio_usage((2025, 11, 14), "FunctionCall.from_id")
+@classmethod
+def from_id(
+    cls, function_call_id: str, client: Optional["modal.client.Client"] = None
+) -> "modal.functions.FunctionCall[Any]":
+```
+
+Instantiate a FunctionCall object from an existing ID.
+
+Examples:
+
+```python notest
+# Spawn a FunctionCall and keep track of its object ID
+fc = my_func.spawn()
+fc_id = fc.object_id
+
+# Later, use the ID to re-instantiate the FunctionCall object
+fc = FunctionCall.from_id(fc_id)
+result = fc.get()
+```
+
+Note that it's only necessary to re-instantiate the `FunctionCall` with this method
+if you no longer have access to the original object returned from `Function.spawn`.
+## gather
+
+```python
+@staticmethod
+def gather(*function_calls: "_FunctionCall[T]") -> typing.Sequence[T]:
+```
+
+Wait until all Modal FunctionCall objects have results before returning.
+
+Accepts a variable number of `FunctionCall` objects, as returned by `Function.spawn()`.
+
+Returns a list of results from each FunctionCall, or raises an exception
+from the first failing function call.
+
+Examples:
+
+```python notest
+fc1 = slow_func_1.spawn()
+fc2 = slow_func_2.spawn()
+
+result_1, result_2 = modal.FunctionCall.gather(fc1, fc2)
+```
+
+*Added in v0.73.69*: This method replaces the deprecated `modal.functions.gather` function.
+
+#### Image
+
+# modal.Image
+
+```python
+class Image(modal.object.Object)
+```
+
+Base class for container images to run functions in.
+
+Do not construct this class directly; instead use one of its static factory methods,
+such as `modal.Image.debian_slim`, `modal.Image.from_registry`, or `modal.Image.micromamba`.
+
+## add_local_file
+
+```python
+def add_local_file(self, local_path: Union[str, Path], remote_path: str, *, copy: bool = False) -> "_Image":
+```
+
+Adds a local file to the image at `remote_path` within the container
+
+By default (`copy=False`), the files are added to containers on startup and are not built into the actual Image,
+which speeds up deployment.
+
+Set `copy=True` to copy the files into an Image layer at build time instead, similar to how
+[`COPY`](https://docs.docker.com/engine/reference/builder/#copy) works in a `Dockerfile`.
+
+copy=True can slow down iteration since it requires a rebuild of the Image and any subsequent
+build steps whenever the included files change, but it is required if you want to run additional
+build steps after this one.
+
+*Added in v0.66.40*: This method replaces the deprecated `modal.Image.copy_local_file` method.
+## add_local_dir
+
+```python
+def add_local_dir(
+    self,
+    local_path: Union[str, Path],
+    remote_path: str,
+    *,
+    copy: bool = False,
+    # Predicate filter function for file exclusion, which should accept a filepath and return `True` for exclusion.
+    # Defaults to excluding no files. If a Sequence is provided, it will be converted to a FilePatternMatcher.
+    # Which follows dockerignore syntax.
+    ignore: Union[Sequence[str], Callable[[Path], bool]] = [],
+) -> "_Image":
+```
+
+Adds a local directory's content to the image at `remote_path` within the container
+
+By default (`copy=False`), the files are added to containers on startup and are not built into the actual Image,
+which speeds up deployment.
+
+Set `copy=True` to copy the files into an Image layer at build time instead, similar to how
+[`COPY`](https://docs.docker.com/engine/reference/builder/#copy) works in a `Dockerfile`.
+
+copy=True can slow down iteration since it requires a rebuild of the Image and any subsequent
+build steps whenever the included files change, but it is required if you want to run additional
+build steps after this one.
+
+**Usage:**
+
+```python
+from modal import FilePatternMatcher
+
+image = modal.Image.debian_slim().add_local_dir(
+    "~/assets",
+    remote_path="/assets",
+    ignore=["*.venv"],
+)
+
+image = modal.Image.debian_slim().add_local_dir(
+    "~/assets",
+    remote_path="/assets",
+    ignore=lambda p: p.is_relative_to(".venv"),
+)
+
+image = modal.Image.debian_slim().add_local_dir(
+    "~/assets",
+    remote_path="/assets",
+    ignore=FilePatternMatcher("**/*.txt"),
+)
+
+# When including files is simpler than excluding them, you can use the `~` operator to invert the matcher.
+image = modal.Image.debian_slim().add_local_dir(
+    "~/assets",
+    remote_path="/assets",
+    ignore=~FilePatternMatcher("**/*.py"),
+)
+
+# You can also read ignore patterns from a file.
+image = modal.Image.debian_slim().add_local_dir(
+    "~/assets",
+    remote_path="/assets",
+    ignore=FilePatternMatcher.from_file("/path/to/ignorefile"),
+)
+```
+
+*Added in v0.66.40*: This method replaces the deprecated `modal.Image.copy_local_dir` method.
+## add_local_python_source
+
+```python
+def add_local_python_source(
+    self, *modules: str, copy: bool = False, ignore: Union[Sequence[str], Callable[[Path], bool]] = NON_PYTHON_FILES
+) -> "_Image":
+```
+
+Adds locally available Python packages/modules to containers
+
+Adds all files from the specified Python package or module to containers running the Image.
+
+Packages are added to the `/root` directory of containers, which is on the `PYTHONPATH`
+of any executed Modal Functions, enabling import of the module by that name.
+
+By default (`copy=False`), the files are added to containers on startup and are not built into the actual Image,
+which speeds up deployment.
+
+Set `copy=True` to copy the files into an Image layer at build time instead. This can slow down iteration since
+it requires a rebuild of the Image and any subsequent build steps whenever the included files change, but it is
+required if you want to run additional build steps after this one.
+
+**Note:** This excludes all dot-prefixed subdirectories or files and all `.pyc`/`__pycache__` files.
+To add full directories with finer control, use `.add_local_dir()` instead and specify `/root` as
+the destination directory.
+
+By default only includes `.py`-files in the source modules. Set the `ignore` argument to a list of patterns
+or a callable to override this behavior, e.g.:
+
+```py
+# includes everything except data.json
+modal.Image.debian_slim().add_local_python_source("mymodule", ignore=["data.json"])
+
+# exclude large files
+modal.Image.debian_slim().add_local_python_source(
+    "mymodule",
+    ignore=lambda p: p.stat().st_size > 1e9
+)
+```
+
+*Added in v0.67.28*: This method replaces the deprecated `modal.Mount.from_local_python_packages` pattern.
+## from_id
+
+```python
+@deprecate_aio_usage((2025, 11, 14), "Image.from_id")
+@classmethod
+def from_id(cls, image_id: str, client: Optional["modal.client.Client"] = None) -> typing_extensions.Self:
+```
+
+Construct an Image from an id and look up the Image result.
+
+The ID of an Image object can be accessed using `.object_id`.
+## build
+
+```python
+def build(self, app: "modal.app._App") -> "_Image":
+```
+
+Eagerly build an image.
+
+If your image was previously built, then this method will not rebuild your image
+and your cached image is returned.
+
+**Examples**
+
+```python
+image = modal.Image.debian_slim().uv_pip_install("scipy", "numpy")
+
+app = modal.App.lookup("build-image", create_if_missing=True)
+with modal.enable_output():  # To see logs in your local terminal
+    image.build(app)
+
+# Save the image id
+my_image_id = image.object_id
+
+# Reference the image with the id or uses it another context.
+built_image = modal.Image.from_id(my_image_id)
+```
+
+Alternatively, you can pre-build a image and use it in a sandbox.
+
+```python notest
+app = modal.App.lookup("sandbox-example", create_if_missing=True)
+
+with modal.enable_output():
+    image = modal.Image.debian_slim().uv_pip_install("scipy")
+    image.build(app)
+
+sb = modal.Sandbox.create("python", "-c", "import scipy; print(scipy)", app=app, image=image)
+print(sb.stdout.read())
+sb.terminate()
+```
+
+**Note**
+
+For defining Modal functions, images are built automatically when deploying or running an App.
+You do not need to built the image explicitly:
+
+```python notest
+app = modal.App()
+image = modal.Image.debian_slim()
+
+# No need to explicitly build the image for defining a function.
+@app.function(image=image)
+def f():
+    ...
+```
+## pip_install
+
+```python
+def pip_install(
+    self,
+    *packages: Union[str, list[str]],  # A list of Python packages, eg. ["numpy", "matplotlib>=3.5.0"]
+    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
+    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
+    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
+    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
+    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Install a list of Python packages using pip.
+
+**Examples**
+
+Simple installation:
+```python
+image = modal.Image.debian_slim().pip_install("click", "httpx~=0.23.3")
+```
+
+More complex installation:
+```python
+image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:12.2.0-devel-ubuntu22.04", add_python="3.11"
+    )
+    .pip_install(
+        "ninja",
+        "packaging",
+        "wheel",
+        "transformers==4.40.2",
+    )
+    .pip_install(
+        "flash-attn==2.5.8", extra_options="--no-build-isolation"
+    )
+)
+```
+## pip_install_private_repos
+
+```python
+def pip_install_private_repos(
+    self,
+    *repositories: str,
+    git_user: str,
+    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
+    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
+    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
+    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
+    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
+    gpu: GPU_T = None,
+    env: Optional[dict[str, Optional[str]]] = None,  # Environment variables to set in the container
+    secrets: Optional[Collection[_Secret]] = None,  # Secrets to inject into the container as environment variables
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+) -> "_Image":
+```
+
+Install a list of Python packages from private git repositories using pip.
+
+This method currently supports Github and Gitlab only.
+
+- **Github:** Provide a `modal.Secret` that contains a `GITHUB_TOKEN` key-value pair
+- **Gitlab:** Provide a `modal.Secret` that contains a `GITLAB_TOKEN` key-value pair
+
+These API tokens should have permissions to read the list of private repositories provided as arguments.
+
+We recommend using Github's ['fine-grained' access tokens](https://github.blog/2022-10-18-introducing-fine-grained-personal-access-tokens-for-github/).
+These tokens are repo-scoped, and avoid granting read permission across all of a user's private repos.
+
+**Example**
+
+```python
+image = (
+    modal.Image
+    .debian_slim()
+    .pip_install_private_repos(
+        "github.com/ecorp/private-one@1.0.0",
+        "github.com/ecorp/private-two@main"
+        "github.com/ecorp/private-three@d4776502"
+        # install from 'inner' directory on default branch.
+        "github.com/ecorp/private-four#subdirectory=inner",
+        git_user="erikbern",
+        secrets=[modal.Secret.from_name("github-read-private")],
+    )
+)
+```
+## pip_install_from_requirements
+
+```python
+def pip_install_from_requirements(
+    self,
+    requirements_txt: str,  # Path to a requirements.txt file.
+    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
+    *,
+    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
+    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
+    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
+    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Install a list of Python packages from a local `requirements.txt` file.
+## pip_install_from_pyproject
+
+```python
+def pip_install_from_pyproject(
+    self,
+    pyproject_toml: str,
+    optional_dependencies: list[str] = [],
+    *,
+    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
+    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
+    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
+    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
+    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Install dependencies specified by a local `pyproject.toml` file.
+
+`optional_dependencies` is a list of the keys of the
+optional-dependencies section(s) of the `pyproject.toml` file
+(e.g. test, doc, experiment, etc). When provided,
+all of the packages in each listed section are installed as well.
+## uv_pip_install
+
+```python
+def uv_pip_install(
+    self,
+    *packages: Union[str, list[str]],  # A list of Python packages, eg. ["numpy", "matplotlib>=3.5.0"]
+    requirements: Optional[list[str]] = None,  # Passes -r (--requirements) to uv pip install
+    find_links: Optional[str] = None,  # Passes -f (--find-links) to uv pip install
+    index_url: Optional[str] = None,  # Passes -i (--index-url) to uv pip install
+    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to uv pip install
+    pre: bool = False,  # Allow pre-releases using uv pip install --prerelease allow
+    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation"
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    uv_version: Optional[str] = None,  # uv version to use
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Install a list of Python packages using uv pip install.
+
+**Examples**
+
+Simple installation:
+```python
+image = modal.Image.debian_slim().uv_pip_install("torch==2.7.1", "numpy")
+```
+
+This method assumes that:
+- Python is on the `$PATH` and dependencies are installed with the first Python on the `$PATH`.
+- Shell supports backticks for substitution
+- `which` command is on the `$PATH`
+
+Added in v1.1.0.
+## poetry_install_from_file
+
+```python
+def poetry_install_from_file(
+    self,
+    poetry_pyproject_toml: str,
+    poetry_lockfile: Optional[str] = None,  # Path to lockfile. If not provided, uses poetry.lock in same directory.
+    *,
+    ignore_lockfile: bool = False,  # If set to True, do not use poetry.lock, even when present
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    # Selected optional dependency groups to install (See https://python-poetry.org/docs/cli/#install)
+    with_: list[str] = [],
+    # Selected optional dependency groups to exclude (See https://python-poetry.org/docs/cli/#install)
+    without: list[str] = [],
+    only: list[str] = [],  # Only install dependency groups specifed in this list.
+    poetry_version: Optional[str] = "latest",  # Version of poetry to install, or None to skip installation
+    # If set to True, use old installer. See https://github.com/python-poetry/poetry/issues/3336
+    old_installer: bool = False,
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Install poetry *dependencies* specified by a local `pyproject.toml` file.
+
+If not provided as argument the path to the lockfile is inferred. However, the
+file has to exist, unless `ignore_lockfile` is set to `True`.
+
+Note that the root project of the poetry project is not installed, only the dependencies.
+For including local python source files see `add_local_python_source`
+
+Poetry will be installed to the Image (using pip) unless `poetry_version` is set to None.
+Note that the interpretation of `poetry_version="latest"` depends on the Modal Image Builder
+version, with versions 2024.10 and earlier limiting poetry to 1.x.
+## uv_sync
+
+```python
+def uv_sync(
+    self,
+    uv_project_dir: str = "./",  # Path to local uv managed project
+    *,
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    groups: Optional[list[str]] = None,  # Dependency group to install using `uv sync --group`
+    extras: Optional[list[str]] = None,  # Optional dependencies to install using `uv sync --extra`
+    frozen: bool = True,  # If True, then we run `uv sync --frozen` when a uv.lock file is present
+    extra_options: str = "",  # Extra options to pass to `uv sync`
+    uv_version: Optional[str] = None,  # uv version to use
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Creates a virtual environment with the dependencies in a uv managed project with `uv sync`.
+
+**Examples**
+```python
+image = modal.Image.debian_slim().uv_sync()
+```
+
+The `pyproject.toml` and `uv.lock` in `uv_project_dir` are automatically added to the build context. The
+`uv_project_dir` is relative to the current working directory of where `modal` is called.
+
+NOTE: This does *not* install the project itself into the environment (this is equivalent to the
+`--no-install-project` flag in the `uv sync` command) and you would be expected to add any local python source
+files using `Image.add_local_python_source` or similar methods after this call.
+
+This ensures that updates to your project code wouldn't require reinstalling third-party dependencies
+after every change.
+
+uv workspaces are currently not supported.
+
+Added in v1.1.0.
+## dockerfile_commands
+
+```python
+def dockerfile_commands(
+    self,
+    *dockerfile_commands: Union[str, list[str]],
+    context_files: dict[str, str] = {},
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+    context_dir: Optional[Union[Path, str]] = None,  # Context for relative COPY commands
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    ignore: Union[Sequence[str], Callable[[Path], bool]] = AUTO_DOCKERIGNORE,
+) -> "_Image":
+```
+
+Extend an image with arbitrary Dockerfile-like commands.
+
+**Usage:**
+
+```python
+from modal import FilePatternMatcher
+
+# By default a .dockerignore file is used if present in the current working directory
+image = modal.Image.debian_slim().dockerfile_commands(
+    ["COPY data /data"],
+)
+
+image = modal.Image.debian_slim().dockerfile_commands(
+    ["COPY data /data"],
+    ignore=["*.venv"],
+)
+
+image = modal.Image.debian_slim().dockerfile_commands(
+    ["COPY data /data"],
+    ignore=lambda p: p.is_relative_to(".venv"),
+)
+
+image = modal.Image.debian_slim().dockerfile_commands(
+    ["COPY data /data"],
+    ignore=FilePatternMatcher("**/*.txt"),
+)
+
+# When including files is simpler than excluding them, you can use the `~` operator to invert the matcher.
+image = modal.Image.debian_slim().dockerfile_commands(
+    ["COPY data /data"],
+    ignore=~FilePatternMatcher("**/*.py"),
+)
+
+# You can also read ignore patterns from a file.
+image = modal.Image.debian_slim().dockerfile_commands(
+    ["COPY data /data"],
+    ignore=FilePatternMatcher.from_file("/path/to/dockerignore"),
+)
+```
+## entrypoint
+
+```python
+def entrypoint(
+    self,
+    entrypoint_commands: list[str],
+) -> "_Image":
+```
+
+Set the ENTRYPOINT for the image.
+## shell
+
+```python
+def shell(
+    self,
+    shell_commands: list[str],
+) -> "_Image":
+```
+
+Overwrite default shell for the image.
+## run_commands
+
+```python
+def run_commands(
+    self,
+    *commands: Union[str, list[str]],
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    volumes: Optional[dict[Union[str, PurePosixPath], _Volume]] = None,
+    gpu: GPU_T = None,
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+) -> "_Image":
+```
+
+Extend an image with a list of shell commands to run.
+## micromamba
+
+```python
+@staticmethod
+def micromamba(
+    python_version: Optional[str] = None,
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+) -> "_Image":
+```
+
+A Micromamba base image. Micromamba allows for fast building of small Conda-based containers.
+## micromamba_install
+
+```python
+def micromamba_install(
+    self,
+    # A list of Python packages, eg. ["numpy", "matplotlib>=3.5.0"]
+    *packages: Union[str, list[str]],
+    # A local path to a file containing package specifications
+    spec_file: Optional[str] = None,
+    # A list of Conda channels, eg. ["conda-forge", "nvidia"].
+    channels: list[str] = [],
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Install a list of additional packages using micromamba.
+## from_registry
+
+```python
+@staticmethod
+def from_registry(
+    tag: str,
+    secret: Optional[_Secret] = None,
+    *,
+    setup_dockerfile_commands: list[str] = [],
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    add_python: Optional[str] = None,
+    **kwargs,
+) -> "_Image":
+```
+
+Build a Modal Image from a public or private image registry, such as Docker Hub.
+
+The image must be built for the `linux/amd64` platform.
+
+If your image does not come with Python installed, you can use the `add_python` parameter
+to specify a version of Python to add to the image. Otherwise, the image is expected to
+have Python on PATH as `python`, along with `pip`.
+
+You may also use `setup_dockerfile_commands` to run Dockerfile commands before the
+remaining commands run. This might be useful if you want a custom Python installation or to
+set a `SHELL`. Prefer `run_commands()` when possible though.
+
+To authenticate against a private registry with static credentials, you must set the `secret` parameter to
+a `modal.Secret` containing a username (`REGISTRY_USERNAME`) and
+an access token or password (`REGISTRY_PASSWORD`).
+
+To authenticate against private registries with credentials from a cloud provider,
+use `Image.from_gcp_artifact_registry()` or `Image.from_aws_ecr()`.
+
+**Examples**
+
+```python
+modal.Image.from_registry("python:3.11-slim-bookworm")
+modal.Image.from_registry("ubuntu:22.04", add_python="3.11")
+modal.Image.from_registry("nvcr.io/nvidia/pytorch:22.12-py3")
+```
+## from_gcp_artifact_registry
+
+```python
+@staticmethod
+def from_gcp_artifact_registry(
+    tag: str,
+    secret: Optional[_Secret] = None,
+    *,
+    setup_dockerfile_commands: list[str] = [],
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    add_python: Optional[str] = None,
+    **kwargs,
+) -> "_Image":
+```
+
+Build a Modal image from a private image in Google Cloud Platform (GCP) Artifact Registry.
+
+You will need to pass a `modal.Secret` containing [your GCP service account key data](https://cloud.google.com/iam/docs/keys-create-delete#creating)
+as `SERVICE_ACCOUNT_JSON`. This can be done from the [Secrets](https://modal.com/secrets) page.
+Your service account should be granted a specific role depending on the GCP registry used:
+
+- For Artifact Registry images (`pkg.dev` domains) use
+  the ["Artifact Registry Reader"](https://cloud.google.com/artifact-registry/docs/access-control#roles) role
+- For Container Registry images (`gcr.io` domains) use
+  the ["Storage Object Viewer"](https://cloud.google.com/artifact-registry/docs/transition/setup-gcr-repo) role
+
+**Note:** This method does not use `GOOGLE_APPLICATION_CREDENTIALS` as that
+variable accepts a path to a JSON file, not the actual JSON string.
+
+See `Image.from_registry()` for information about the other parameters.
+
+**Example**
+
+```python
+modal.Image.from_gcp_artifact_registry(
+    "us-east1-docker.pkg.dev/my-project-1234/my-repo/my-image:my-version",
+    secret=modal.Secret.from_name(
+        "my-gcp-secret",
+        required_keys=["SERVICE_ACCOUNT_JSON"],
+    ),
+    add_python="3.11",
+)
+```
+## from_aws_ecr
+
+```python
+@staticmethod
+def from_aws_ecr(
+    tag: str,
+    secret: Optional[_Secret] = None,
+    *,
+    setup_dockerfile_commands: list[str] = [],
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    add_python: Optional[str] = None,
+    **kwargs,
+) -> "_Image":
+```
+
+Build a Modal image from a private image in AWS Elastic Container Registry (ECR).
+
+You will need to pass a `modal.Secret` containing either IAM user credentials or OIDC
+configuration to access the target ECR registry.
+
+For IAM user authentication, set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_REGION`.
+
+For OIDC authentication, set `AWS_ROLE_ARN` and `AWS_REGION`.
+
+IAM configuration details can be found in the AWS documentation for
+["Private repository policies"](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-policies.html).
+
+For more details on using an AWS role to access ECR, see the [OIDC integration guide](https://modal.com/docs/guide/oidc-integration).
+
+See `Image.from_registry()` for information about the other parameters.
+
+**Example**
+
+```python
+modal.Image.from_aws_ecr(
+    "000000000000.dkr.ecr.us-east-1.amazonaws.com/my-private-registry:my-version",
+    secret=modal.Secret.from_name(
+        "aws",
+        required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_REGION"],
+    ),
+    add_python="3.11",
+)
+```
+## from_dockerfile
+
+```python
+@staticmethod
+def from_dockerfile(
+    path: Union[str, Path],  # Filepath to Dockerfile.
+    *,
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    context_dir: Optional[Union[Path, str]] = None,  # Context for relative COPY commands
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+    add_python: Optional[str] = None,
+    build_args: dict[str, str] = {},
+    ignore: Union[Sequence[str], Callable[[Path], bool]] = AUTO_DOCKERIGNORE,
+) -> "_Image":
+```
+
+Build a Modal image from a local Dockerfile.
+
+If your Dockerfile does not have Python installed, you can use the `add_python` parameter
+to specify a version of Python to add to the image.
+
+**Usage:**
+
+```python
+from modal import FilePatternMatcher
+
+# By default a .dockerignore file is used if present in the current working directory
+image = modal.Image.from_dockerfile(
+    "./Dockerfile",
+    add_python="3.12",
+)
+
+image = modal.Image.from_dockerfile(
+    "./Dockerfile",
+    add_python="3.12",
+    ignore=["*.venv"],
+)
+
+image = modal.Image.from_dockerfile(
+    "./Dockerfile",
+    add_python="3.12",
+    ignore=lambda p: p.is_relative_to(".venv"),
+)
+
+image = modal.Image.from_dockerfile(
+    "./Dockerfile",
+    add_python="3.12",
+    ignore=FilePatternMatcher("**/*.txt"),
+)
+
+# When including files is simpler than excluding them, you can use the `~` operator to invert the matcher.
+image = modal.Image.from_dockerfile(
+    "./Dockerfile",
+    add_python="3.12",
+    ignore=~FilePatternMatcher("**/*.py"),
+)
+
+# You can also read ignore patterns from a file.
+image = modal.Image.from_dockerfile(
+    "./Dockerfile",
+    add_python="3.12",
+    ignore=FilePatternMatcher.from_file("/path/to/dockerignore"),
+)
+```
+## debian_slim
+
+```python
+@staticmethod
+def debian_slim(python_version: Optional[str] = None, force_build: bool = False) -> "_Image":
+```
+
+Default image, based on the official `python` Docker images.
+## apt_install
+
+```python
+def apt_install(
+    self,
+    *packages: Union[str, list[str]],  # A list of packages, e.g. ["ssh", "libpq-dev"]
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    env: Optional[dict[str, Optional[str]]] = None,
+    secrets: Optional[Collection[_Secret]] = None,
+    gpu: GPU_T = None,
+) -> "_Image":
+```
+
+Install a list of Debian packages using `apt`.
+
+**Example**
+
+```python
+image = modal.Image.debian_slim().apt_install("git")
+```
+## run_function
+
+```python
+def run_function(
+    self,
+    raw_f: Callable[..., Any],
+    *,
+    env: Optional[dict[str, Optional[str]]] = None,  # Environment variables to set in the container
+    secrets: Optional[Collection[_Secret]] = None,  # Secrets to inject into the container as environment variables
+    volumes: dict[Union[str, PurePosixPath], Union[_Volume, _CloudBucketMount]] = {},  # Volume mount paths
+    network_file_systems: dict[Union[str, PurePosixPath], _NetworkFileSystem] = {},  # NFS mount paths
+    gpu: Union[GPU_T, list[GPU_T]] = None,  # Requested GPU or or list of acceptable GPUs( e.g. ["A10", "A100"])
+    cpu: Optional[float] = None,  # How many CPU cores to request. This is a soft limit.
+    memory: Optional[int] = None,  # How much memory to request, in MiB. This is a soft limit.
+    timeout: int = 60 * 60,  # Maximum execution time of the function in seconds.
+    cloud: Optional[str] = None,  # Cloud provider to run the function on. Possible values are aws, gcp, oci, auto.
+    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the function on.
+    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
+    args: Sequence[Any] = (),  # Positional arguments to the function.
+    kwargs: dict[str, Any] = {},  # Keyword arguments to the function.
+    include_source: bool = True,  # Whether the builder container should have the Function's source added
+) -> "_Image":
+```
+
+Run user-defined function `raw_f` as an image build step.
+
+The function runs like an ordinary Modal Function, accepting a resource configuration and integrating
+with Modal features like Secrets and Volumes. Unlike ordinary Modal Functions, any changes to the
+filesystem state will be captured on container exit and saved as a new Image.
+
+**Note**
+
+Only the source code of `raw_f`, the contents of `**kwargs`, and any referenced *global* variables
+are used to determine whether the image has changed and needs to be rebuilt.
+If this function references other functions or variables, the image will not be rebuilt if you
+make changes to them. You can force a rebuild by changing the function's source code itself.
+
+**Example**
+
+```python notest
+
+def my_build_function():
+    open("model.pt", "w").write("parameters!")
+
+image = (
+    modal.Image
+        .debian_slim()
+        .pip_install("torch")
+        .run_function(my_build_function, secrets=[...], mounts=[...])
+)
+```
+## env
+
+```python
+def env(self, vars: dict[str, str]) -> "_Image":
+```
+
+Sets the environment variables in an Image.
+
+**Example**
+
+```python
+image = (
+    modal.Image.debian_slim()
+    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
+)
+```
+## workdir
+
+```python
+def workdir(self, path: Union[str, PurePosixPath]) -> "_Image":
+```
+
+Set the working directory for subsequent image build steps and function execution.
+
+**Example**
+
+```python
+image = (
+    modal.Image.debian_slim()
+    .run_commands("git clone https://xyz app")
+    .workdir("/app")
+    .run_commands("yarn install")
+)
+```
+## cmd
+
+```python
+def cmd(self, cmd: list[str]) -> "_Image":
+```
+
+Set the default command (`CMD`) to run when a container is started.
+
+Used with `modal.Sandbox`. Has no effect on `modal.Function`.
+
+**Example**
+
+```python
+image = (
+    modal.Image.debian_slim().cmd(["python", "app.py"])
+)
+```
+## imports
+
+```python
+@contextlib.contextmanager
+def imports(self):
+```
+
+Used to import packages in global scope that are only available when running remotely.
+By using this context manager you can avoid an `ImportError` due to not having certain
+packages installed locally.
+
+**Usage:**
+
+```python notest
+with image.imports():
+    import torch
+```
+
+#### NetworkFileSystem
+
+# modal.NetworkFileSystem
+
+```python
+class NetworkFileSystem(modal.object.Object)
+```
+
+A shared, writable file system accessible by one or more Modal functions.
+
+By attaching this file system as a mount to one or more functions, they can
+share and persist data with each other.
+
+**Note: `NetworkFileSystem` has been deprecated and will be removed.**
+
+**Usage**
+
+```python
+import modal
+
+nfs = modal.NetworkFileSystem.from_name("my-nfs", create_if_missing=True)
+app = modal.App()
+
+@app.function(network_file_systems={"/root/foo": nfs})
+def f():
+    pass
+
+@app.function(network_file_systems={"/root/goo": nfs})
+def g():
+    pass
+```
+
+Also see the CLI methods for accessing network file systems:
+
+```
+modal nfs --help
+```
+
+A `NetworkFileSystem` can also be useful for some local scripting scenarios, e.g.:
+
+```python notest
+nfs = modal.NetworkFileSystem.from_name("my-network-file-system")
+for chunk in nfs.read_file("my_db_dump.csv"):
+    ...
+```
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## from_name
+
+```python
+@staticmethod
+def from_name(
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    create_if_missing: bool = False,
+    client: Optional[_Client] = None,
+) -> "_NetworkFileSystem":
+```
+
+Reference a NetworkFileSystem by its name, creating if necessary.
+
+This is a lazy method that defers hydrating the local object with
+metadata from Modal servers until the first time it is actually
+used.
+
+```python notest
+nfs = NetworkFileSystem.from_name("my-nfs", create_if_missing=True)
+
+@app.function(network_file_systems={"/data": nfs})
+def f():
+    pass
+```
+## ephemeral
+
+```python
+@classmethod
+@contextmanager
+def ephemeral(
+    cls: type["_NetworkFileSystem"],
+    client: Optional[_Client] = None,
+    environment_name: Optional[str] = None,
+) -> Iterator["_NetworkFileSystem"]:
+```
+
+Creates a new ephemeral network filesystem within a context manager:
+
+Usage:
+```python
+with modal.NetworkFileSystem.ephemeral() as nfs:
+    assert nfs.listdir("/") == []
+```
+
+```python notest
+async with modal.NetworkFileSystem.ephemeral() as nfs:
+    assert await nfs.listdir("/") == []
+```
+## delete
+
+```python
+@staticmethod
+def delete(name: str, client: Optional[_Client] = None, environment_name: Optional[str] = None):
+```
+
+## write_file
+
+```python
+@live_method
+def write_file(self, remote_path: str, fp: BinaryIO, progress_cb: Optional[Callable[..., Any]] = None) -> int:
+```
+
+Write from a file object to a path on the network file system, atomically.
+
+Will create any needed parent directories automatically.
+
+If remote_path ends with `/` it's assumed to be a directory and the
+file will be uploaded with its current name to that directory.
+## read_file
+
+```python
+@live_method_gen
+def read_file(self, path: str) -> Iterator[bytes]:
+```
+
+Read a file from the network file system
+## iterdir
+
+```python
+@live_method_gen
+def iterdir(self, path: str) -> Iterator[FileEntry]:
+```
+
+Iterate over all files in a directory in the network file system.
+
+* Passing a directory path lists all files in the directory (names are relative to the directory)
+* Passing a file path returns a list containing only that file's listing description
+* Passing a glob path (including at least one * or ** sequence) returns all files matching
+that glob path (using absolute paths)
+## add_local_file
+
+```python
+@live_method
+def add_local_file(
+    self,
+    local_path: Union[Path, str],
+    remote_path: Optional[Union[str, PurePosixPath, None]] = None,
+    progress_cb: Optional[Callable[..., Any]] = None,
+):
+```
+
+## add_local_dir
+
+```python
+@live_method
+def add_local_dir(
+    self,
+    local_path: Union[Path, str],
+    remote_path: Optional[Union[str, PurePosixPath, None]] = None,
+    progress_cb: Optional[Callable[..., Any]] = None,
+):
+```
+
+## listdir
+
+```python
+@live_method
+def listdir(self, path: str) -> list[FileEntry]:
+```
+
+List all files in a directory in the network file system.
+
+* Passing a directory path lists all files in the directory (names are relative to the directory)
+* Passing a file path returns a list containing only that file's listing description
+* Passing a glob path (including at least one * or ** sequence) returns all files matching
+that glob path (using absolute paths)
+## remove_file
+
+```python
+@live_method
+def remove_file(self, path: str, recursive=False):
+```
+
+Remove a file in a network file system.
+
+#### Period
+
+# modal.Period
+
+```python
+class Period(modal.schedule.Schedule)
+```
+
+Create a schedule that runs every given time interval.
+
+**Usage**
+
+```python
+import modal
+app = modal.App()
+
+@app.function(schedule=modal.Period(days=1))
+def f():
+    print("This function will run every day")
+
+modal.Period(hours=4)          # runs every 4 hours
+modal.Period(minutes=15)       # runs every 15 minutes
+modal.Period(seconds=math.pi)  # runs every 3.141592653589793 seconds
+```
+
+Only `seconds` can be a float. All other arguments are integers.
+
+Note that `days=1` will trigger the function the same time every day.
+This does not have the same behavior as `seconds=84000` since days have
+different lengths due to daylight savings and leap seconds. Similarly,
+using `months=1` will trigger the function on the same day each month.
+
+This behaves similar to the
+[dateutil](https://dateutil.readthedocs.io/en/latest/relativedelta.html)
+package.
+
+```python
+def __init__(
+    self,
+    *,
+    years: int = 0,
+    months: int = 0,
+    weeks: int = 0,
+    days: int = 0,
+    hours: int = 0,
+    minutes: int = 0,
+    seconds: float = 0,
+) -> None:
+```
+
+#### Proxy
+
+# modal.Proxy
+
+```python
+class Proxy(modal.object.Object)
+```
+
+Proxy objects give your Modal containers a static outbound IP address.
+
+This can be used for connecting to a remote address with network whitelist, for example
+a database. See [the guide](https://modal.com/docs/guide/proxy-ips) for more information.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## from_name
+
+```python
+@staticmethod
+def from_name(
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    client: Optional[_Client] = None,
+) -> "_Proxy":
+```
+
+Reference a Proxy by its name.
+
+In contrast to most other Modal objects, new Proxy objects must be
+provisioned via the Dashboard and cannot be created on the fly from code.
+
+#### Queue
+
+# modal.Queue
+
+```python
+class Queue(modal.object.Object)
+```
+
+Distributed, FIFO queue for data flow in Modal apps.
+
+The queue can contain any object serializable by `cloudpickle`, including Modal objects.
+
+By default, the `Queue` object acts as a single FIFO queue which supports puts and gets (blocking and non-blocking).
+
+**Usage**
+
+```python
+from modal import Queue
+
+# Create an ephemeral queue which is anonymous and garbage collected
+with Queue.ephemeral() as my_queue:
+    # Putting values
+    my_queue.put("some value")
+    my_queue.put(123)
+
+    # Getting values
+    assert my_queue.get() == "some value"
+    assert my_queue.get() == 123
+
+    # Using partitions
+    my_queue.put(0)
+    my_queue.put(1, partition="foo")
+    my_queue.put(2, partition="bar")
+
+    # Default and "foo" partition are ignored by the get operation.
+    assert my_queue.get(partition="bar") == 2
+
+    # Set custom 10s expiration time on "foo" partition.
+    my_queue.put(3, partition="foo", partition_ttl=10)
+
+    # (beta feature) Iterate through items in place (read immutably)
+    my_queue.put(1)
+    assert [v for v in my_queue.iterate()] == [0, 1]
+
+# You can also create persistent queues that can be used across apps
+queue = Queue.from_name("my-persisted-queue", create_if_missing=True)
+queue.put(42)
+assert queue.get() == 42
+```
+
+For more examples, see the [guide](https://modal.com/docs/guide/dicts-and-queues#modal-queues).
+
+**Queue partitions (beta)**
+
+Specifying partition keys gives access to other independent FIFO partitions within the same `Queue` object.
+Across any two partitions, puts and gets are completely independent.
+For example, a put in one partition does not affect a get in any other partition.
+
+When no partition key is specified (by default), puts and gets will operate on a default partition.
+This default partition is also isolated from all other partitions.
+Please see the Usage section below for an example using partitions.
+
+**Lifetime of a queue and its partitions**
+
+By default, each partition is cleared 24 hours after the last `put` operation.
+A lower TTL can be specified by the `partition_ttl` argument in the `put` or `put_many` methods.
+Each partition's expiry is handled independently.
+
+As such, `Queue`s are best used for communication between active functions and not relied on for persistent storage.
+
+On app completion or after stopping an app any associated `Queue` objects are cleaned up.
+All its partitions will be cleared.
+
+**Limits**
+
+A single `Queue` can contain up to 100,000 partitions, each with up to 5,000 items. Each item can be up to 1 MiB.
+
+Partition keys must be non-empty and must not exceed 64 bytes.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## objects
+
+```python
+class objects(object)
+```
+
+Namespace with methods for managing named Queue objects.
+
+### create
+
+```python
+@staticmethod
+def create(
+    name: str,  # Name to use for the new Queue
+    *,
+    allow_existing: bool = False,  # If True, no-op when the Queue already exists
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> None:
+```
+
+Create a new Queue object.
+
+**Examples:**
+
+```python notest
+modal.Queue.objects.create("my-queue")
+```
+
+Queues will be created in the active environment, or another one can be specified:
+
+```python notest
+modal.Queue.objects.create("my-queue", environment_name="dev")
+```
+
+By default, an error will be raised if the Queue already exists, but passing
+`allow_existing=True` will make the creation attempt a no-op in this case.
+
+```python notest
+modal.Queue.objects.create("my-queue", allow_existing=True)
+```
+
+Note that this method does not return a local instance of the Queue. You can use
+`modal.Queue.from_name` to perform a lookup after creation.
+
+Added in v1.1.2.
+### list
+
+```python
+@staticmethod
+def list(
+    *,
+    max_objects: Optional[int] = None,  # Limit results to this size
+    created_before: Optional[Union[datetime, str]] = None,  # Limit based on creation date
+    environment_name: str = "",  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> builtins.list["_Queue"]:
+```
+
+Return a list of hydrated Queue objects.
+
+**Examples:**
+
+```python
+queues = modal.Queue.objects.list()
+print([q.name for q in queues])
+```
+
+Queues will be retreived from the active environment, or another one can be specified:
+
+```python notest
+dev_queues = modal.Queue.objects.list(environment_name="dev")
+```
+
+By default, all named Queues are returned, newest to oldest. It's also possible to limit the
+number of results and to filter by creation date:
+
+```python
+queues = modal.Queue.objects.list(max_objects=10, created_before="2025-01-01")
+```
+
+Added in v1.1.2.
+### delete
+
+```python
+@staticmethod
+def delete(
+    name: str,  # Name of the Queue to delete
+    *,
+    allow_missing: bool = False,  # If True, don't raise an error if the Queue doesn't exist
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+):
+```
+
+Delete a named Queue.
+
+Warning: This deletes an *entire Queue*, not just a specific entry or partition.
+Deletion is irreversible and will affect any Apps currently using the Queue.
+
+**Examples:**
+
+```python notest
+await modal.Queue.objects.delete("my-queue")
+```
+
+Queues will be deleted from the active environment, or another one can be specified:
+
+```python notest
+await modal.Queue.objects.delete("my-queue", environment_name="dev")
+```
+
+Added in v1.1.2.
+## name
+
+```python
+@property
+def name(self) -> Optional[str]:
+```
+
+## validate_partition_key
+
+```python
+@staticmethod
+def validate_partition_key(partition: Optional[str]) -> bytes:
+```
+
+## ephemeral
+
+```python
+@classmethod
+@contextmanager
+def ephemeral(
+    cls: type["_Queue"],
+    client: Optional[_Client] = None,
+    environment_name: Optional[str] = None,
+) -> Iterator["_Queue"]:
+```
+
+Creates a new ephemeral queue within a context manager:
+
+Usage:
+```python
+from modal import Queue
+
+with Queue.ephemeral() as q:
+    q.put(123)
+```
+
+```python notest
+async with Queue.ephemeral() as q:
+    await q.put.aio(123)
+```
+## from_name
+
+```python
+@staticmethod
+def from_name(
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    create_if_missing: bool = False,
+    client: Optional[_Client] = None,
+) -> "_Queue":
+```
+
+Reference a named Queue, creating if necessary.
+
+This is a lazy method the defers hydrating the local
+object with metadata from Modal servers until the first
+time it is actually used.
+
+```python
+q = modal.Queue.from_name("my-queue", create_if_missing=True)
+q.put(123)
+```
+## info
+
+```python
+@live_method
+def info(self) -> QueueInfo:
+```
+
+Return information about the Queue object.
+## clear
+
+```python
+@live_method
+def clear(self, *, partition: Optional[str] = None, all: bool = False) -> None:
+```
+
+Clear the contents of a single partition or all partitions.
+## get
+
+```python
+@live_method
+def get(
+    self, block: bool = True, timeout: Optional[float] = None, *, partition: Optional[str] = None
+) -> Optional[Any]:
+```
+
+Remove and return the next object in the queue.
+
+If `block` is `True` (the default) and the queue is empty, `get` will wait indefinitely for
+an object, or until `timeout` if specified. Raises a native `queue.Empty` exception
+if the `timeout` is reached.
+
+If `block` is `False`, `get` returns `None` immediately if the queue is empty. The `timeout` is
+ignored in this case.
+## get_many
+
+```python
+@live_method
+def get_many(
+    self, n_values: int, block: bool = True, timeout: Optional[float] = None, *, partition: Optional[str] = None
+) -> list[Any]:
+```
+
+Remove and return up to `n_values` objects from the queue.
+
+If there are fewer than `n_values` items in the queue, return all of them.
+
+If `block` is `True` (the default) and the queue is empty, `get` will wait indefinitely for
+at least 1 object to be present, or until `timeout` if specified. Raises the stdlib's `queue.Empty`
+exception if the `timeout` is reached.
+
+If `block` is `False`, `get` returns `None` immediately if the queue is empty. The `timeout` is
+ignored in this case.
+## put
+
+```python
+@live_method
+def put(
+    self,
+    v: Any,
+    block: bool = True,
+    timeout: Optional[float] = None,
+    *,
+    partition: Optional[str] = None,
+    partition_ttl: int = 24 * 3600,  # After 24 hours of no activity, this partition will be deletd.
+) -> None:
+```
+
+Add an object to the end of the queue.
+
+If `block` is `True` and the queue is full, this method will retry indefinitely or
+until `timeout` if specified. Raises the stdlib's `queue.Full` exception if the `timeout` is reached.
+If blocking it is not recommended to omit the `timeout`, as the operation could wait indefinitely.
+
+If `block` is `False`, this method raises `queue.Full` immediately if the queue is full. The `timeout` is
+ignored in this case.
+## put_many
+
+```python
+@live_method
+def put_many(
+    self,
+    vs: list[Any],
+    block: bool = True,
+    timeout: Optional[float] = None,
+    *,
+    partition: Optional[str] = None,
+    partition_ttl: int = 24 * 3600,  # After 24 hours of no activity, this partition will be deletd.
+) -> None:
+```
+
+Add several objects to the end of the queue.
+
+If `block` is `True` and the queue is full, this method will retry indefinitely or
+until `timeout` if specified. Raises the stdlib's `queue.Full` exception if the `timeout` is reached.
+If blocking it is not recommended to omit the `timeout`, as the operation could wait indefinitely.
+
+If `block` is `False`, this method raises `queue.Full` immediately if the queue is full. The `timeout` is
+ignored in this case.
+## len
+
+```python
+@live_method
+def len(self, *, partition: Optional[str] = None, total: bool = False) -> int:
+```
+
+Return the number of objects in the queue partition.
+## iterate
+
+```python
+@warn_if_generator_is_not_consumed()
+@live_method_gen
+def iterate(
+    self, *, partition: Optional[str] = None, item_poll_timeout: float = 0.0
+) -> AsyncGenerator[Any, None]:
+```
+
+(Beta feature) Iterate through items in the queue without mutation.
+
+Specify `item_poll_timeout` to control how long the iterator should wait for the next time before giving up.
+
+#### Retries
+
+# modal.Retries
+
+```python
+class Retries(object)
+```
+
+Adds a retry policy to a Modal function.
+
+**Usage**
+
+```python
+import modal
+app = modal.App()
+
+# Basic configuration.
+# This sets a policy of max 4 retries with 1-second delay between failures.
+@app.function(retries=4)
+def f():
+    pass
+
+# Fixed-interval retries with 3-second delay between failures.
+@app.function(
+    retries=modal.Retries(
+        max_retries=2,
+        backoff_coefficient=1.0,
+        initial_delay=3.0,
+    )
+)
+def g():
+    pass
+
+# Exponential backoff, with retry delay doubling after each failure.
+@app.function(
+    retries=modal.Retries(
+        max_retries=4,
+        backoff_coefficient=2.0,
+        initial_delay=1.0,
+    )
+)
+def h():
+    pass
+```
+
+```python
+def __init__(
+    self,
+    *,
+    # The maximum number of retries that can be made in the presence of failures.
+    max_retries: int,
+    # Coefficent controlling how much the retry delay increases each retry attempt.
+    # A backoff coefficient of 1.0 creates fixed-delay where the delay period always equals the initial delay.
+    backoff_coefficient: float = 2.0,
+    # Number of seconds that must elapse before the first retry occurs.
+    initial_delay: float = 1.0,
+    # Maximum length of retry delay in seconds, preventing the delay from growing infinitely.
+    max_delay: float = 60.0,
+):
+```
+
+Construct a new retries policy, supporting exponential and fixed-interval delays via a backoff coefficient.
+
+#### Sandbox
+
+# modal.Sandbox
+
+```python
+class Sandbox(modal.object.Object)
+```
+
+A `Sandbox` object lets you interact with a running sandbox. This API is similar to Python's
+[asyncio.subprocess.Process](https://docs.python.org/3/library/asyncio-subprocess.html#asyncio.subprocess.Process).
+
+Refer to the [guide](https://modal.com/docs/guide/sandbox) on how to spawn and use sandboxes.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## create
+
+```python
+@staticmethod
+def create(
+    *args: str,  # Set the CMD of the Sandbox, overriding any CMD of the container image.
+    # Associate the sandbox with an app. Required unless creating from a container.
+    app: Optional["modal.app._App"] = None,
+    name: Optional[str] = None,  # Optionally give the sandbox a name. Unique within an app.
+    image: Optional[_Image] = None,  # The image to run as the container for the sandbox.
+    env: Optional[dict[str, Optional[str]]] = None,  # Environment variables to set in the Sandbox.
+    secrets: Optional[Collection[_Secret]] = None,  # Secrets to inject into the Sandbox as environment variables.
+    network_file_systems: dict[Union[str, os.PathLike], _NetworkFileSystem] = {},
+    timeout: int = 300,  # Maximum lifetime of the sandbox in seconds.
+    # The amount of time in seconds that a sandbox can be idle before being terminated.
+    idle_timeout: Optional[int] = None,
+    workdir: Optional[str] = None,  # Working directory of the sandbox.
+    gpu: GPU_T = None,
+    cloud: Optional[str] = None,
+    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the sandbox on.
+    # Specify, in fractional CPU cores, how many CPU cores to request.
+    # Or, pass (request, limit) to additionally specify a hard limit in fractional CPU cores.
+    # CPU throttling will prevent a container from exceeding its specified limit.
+    cpu: Optional[Union[float, tuple[float, float]]] = None,
+    # Specify, in MiB, a memory request which is the minimum memory required.
+    # Or, pass (request, limit) to additionally specify a hard limit in MiB.
+    memory: Optional[Union[int, tuple[int, int]]] = None,
+    block_network: bool = False,  # Whether to block network access
+    # List of CIDRs the sandbox is allowed to access. If None, all CIDRs are allowed.
+    cidr_allowlist: Optional[Sequence[str]] = None,
+    volumes: dict[
+        Union[str, os.PathLike], Union[_Volume, _CloudBucketMount]
+    ] = {},  # Mount points for Modal Volumes and CloudBucketMounts
+    pty: bool = False,  # Enable a PTY for the Sandbox
+    # List of ports to tunnel into the sandbox. Encrypted ports are tunneled with TLS.
+    encrypted_ports: Sequence[int] = [],
+    # List of encrypted ports to tunnel into the sandbox, using HTTP/2.
+    h2_ports: Sequence[int] = [],
+    # List of ports to tunnel into the sandbox without encryption.
+    unencrypted_ports: Sequence[int] = [],
+    # Reference to a Modal Proxy to use in front of this Sandbox.
+    proxy: Optional[_Proxy] = None,
+    # Enable verbose logging for sandbox operations.
+    verbose: bool = False,
+    experimental_options: Optional[dict[str, bool]] = None,
+    # Enable memory snapshots.
+    _experimental_enable_snapshot: bool = False,
+    client: Optional[_Client] = None,
+    environment_name: Optional[str] = None,  # *DEPRECATED* Optionally override the default environment
+    pty_info: Optional[api_pb2.PTYInfo] = None,  # *DEPRECATED* Use `pty` instead. `pty` will override `pty_info`.
+) -> "_Sandbox":
+```
+
+Create a new Sandbox to run untrusted, arbitrary code.
+
+The Sandbox's corresponding container will be created asynchronously.
+
+**Usage**
+
+```python
+app = modal.App.lookup('sandbox-hello-world', create_if_missing=True)
+sandbox = modal.Sandbox.create("echo", "hello world", app=app)
+print(sandbox.stdout.read())
+sandbox.wait()
+```
+## from_name
+
+```python
+@staticmethod
+def from_name(
+    app_name: str,
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    client: Optional[_Client] = None,
+) -> "_Sandbox":
+```
+
+Get a running Sandbox by name from a deployed App.
+
+Raises a modal.exception.NotFoundError if no running sandbox is found with the given name.
+A Sandbox's name is the `name` argument passed to `Sandbox.create`.
+## from_id
+
+```python
+@staticmethod
+def from_id(sandbox_id: str, client: Optional[_Client] = None) -> "_Sandbox":
+```
+
+Construct a Sandbox from an id and look up the Sandbox result.
+
+The ID of a Sandbox object can be accessed using `.object_id`.
+## get_tags
+
+```python
+def get_tags(self) -> dict[str, str]:
+```
+
+Fetches any tags (key-value pairs) currently attached to this Sandbox from the server.
+## set_tags
+
+```python
+def set_tags(self, tags: dict[str, str], *, client: Optional[_Client] = None) -> None:
+```
+
+Set tags (key-value pairs) on the Sandbox. Tags can be used to filter results in `Sandbox.list`.
+## snapshot_filesystem
+
+```python
+def snapshot_filesystem(self, timeout: int = 55) -> _Image:
+```
+
+Snapshot the filesystem of the Sandbox.
+
+Returns an [`Image`](https://modal.com/docs/reference/modal.Image) object which
+can be used to spawn a new Sandbox with the same filesystem.
+## wait
+
+```python
+def wait(self, raise_on_termination: bool = True):
+```
+
+Wait for the Sandbox to finish running.
+## tunnels
+
+```python
+def tunnels(self, timeout: int = 50) -> dict[int, Tunnel]:
+```
+
+Get Tunnel metadata for the sandbox.
+
+Raises `SandboxTimeoutError` if the tunnels are not available after the timeout.
+
+Returns a dictionary of `Tunnel` objects which are keyed by the container port.
+
+NOTE: Previous to client [v0.64.153](https://modal.com/docs/reference/changelog#064153-2024-09-30), this
+returned a list of `TunnelData` objects.
+## create_connect_token
+
+```python
+def create_connect_token(
+    self, user_metadata: Optional[Union[str, dict[str, Any]]] = None
+) -> SandboxConnectCredentials:
+```
+
+[Alpha] Create a token for making HTTP connections to the Sandbox.
+
+Also accepts an optional user_metadata string or dict to associate with the token. This metadata
+will be added to the headers by the proxy when forwarding requests to the Sandbox.
+## reload_volumes
+
+```python
+def reload_volumes(self) -> None:
+```
+
+Reload all Volumes mounted in the Sandbox.
+
+Added in v1.1.0.
+## terminate
+
+```python
+def terminate(self) -> None:
+```
+
+Terminate Sandbox execution.
+
+This is a no-op if the Sandbox has already finished running.
+## poll
+
+```python
+def poll(self) -> Optional[int]:
+```
+
+Check if the Sandbox has finished running.
+
+Returns `None` if the Sandbox is still running, else returns the exit code.
+## exec
+
+```python
+def exec(
+    self,
+    *args: str,
+    stdout: StreamType = StreamType.PIPE,
+    stderr: StreamType = StreamType.PIPE,
+    timeout: Optional[int] = None,
+    workdir: Optional[str] = None,
+    env: Optional[dict[str, Optional[str]]] = None,  # Environment variables to set during command execution.
+    secrets: Optional[
+        Collection[_Secret]
+    ] = None,  # Secrets to inject as environment variables during command execution.
+    # Encode output as text.
+    text: bool = True,
+    # Control line-buffered output.
+    # -1 means unbuffered, 1 means line-buffered (only available if `text=True`).
+    bufsize: Literal[-1, 1] = -1,
+    pty: bool = False,  # Enable a PTY for the command
+    _pty_info: Optional[api_pb2.PTYInfo] = None,  # *DEPRECATED* Use `pty` instead. `pty` will override `pty_info`.
+    pty_info: Optional[api_pb2.PTYInfo] = None,  # *DEPRECATED* Use `pty` instead. `pty` will override `pty_info`.
+):
+```
+
+Execute a command in the Sandbox and return a ContainerProcess handle.
+
+See the [`ContainerProcess`](https://modal.com/docs/reference/modal.container_process#modalcontainer_processcontainerprocess)
+docs for more information.
+
+**Usage**
+
+```python fixture:sandbox
+process = sandbox.exec("bash", "-c", "for i in $(seq 1 3); do echo foo $i; sleep 0.1; done")
+for line in process.stdout:
+    print(line)
+```
+## open
+
+```python
+def open(
+    self,
+    path: str,
+    mode: Union["_typeshed.OpenTextMode", "_typeshed.OpenBinaryMode"] = "r",
+):
+```
+
+[Alpha] Open a file in the Sandbox and return a FileIO handle.
+
+See the [`FileIO`](https://modal.com/docs/reference/modal.file_io#modalfile_iofileio) docs for more information.
+
+**Usage**
+
+```python notest
+sb = modal.Sandbox.create(app=sb_app)
+f = sb.open("/test.txt", "w")
+f.write("hello")
+f.close()
+```
+## ls
+
+```python
+def ls(self, path: str) -> builtins.list[str]:
+```
+
+[Alpha] List the contents of a directory in the Sandbox.
+## mkdir
+
+```python
+def mkdir(self, path: str, parents: bool = False) -> None:
+```
+
+[Alpha] Create a new directory in the Sandbox.
+## rm
+
+```python
+def rm(self, path: str, recursive: bool = False) -> None:
+```
+
+[Alpha] Remove a file or directory in the Sandbox.
+## watch
+
+```python
+def watch(
+    self,
+    path: str,
+    filter: Optional[builtins.list[FileWatchEventType]] = None,
+    recursive: Optional[bool] = None,
+    timeout: Optional[int] = None,
+) -> Iterator[FileWatchEvent]:
+```
+
+[Alpha] Watch a file or directory in the Sandbox for changes.
+## stdout
+
+```python
+@property
+def stdout(self) -> _StreamReader[str]:
+```
+
+[`StreamReader`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamreader) for
+the sandbox's stdout stream.
+## stderr
+
+```python
+@property
+def stderr(self) -> _StreamReader[str]:
+```
+
+[`StreamReader`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamreader) for
+the Sandbox's stderr stream.
+## stdin
+
+```python
+@property
+def stdin(self) -> _StreamWriter:
+```
+
+[`StreamWriter`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamwriter) for
+the Sandbox's stdin stream.
+## returncode
+
+```python
+@property
+def returncode(self) -> Optional[int]:
+```
+
+Return code of the Sandbox process if it has finished running, else `None`.
+## list
+
+```python
+@staticmethod
+def list(
+    *, app_id: Optional[str] = None, tags: Optional[dict[str, str]] = None, client: Optional[_Client] = None
+) -> AsyncGenerator["_Sandbox", None]:
+```
+
+List all Sandboxes for the current Environment or App ID (if specified). If tags are specified, only
+Sandboxes that have at least those tags are returned. Returns an iterator over `Sandbox` objects.
+
+#### SandboxSnapshot
+
+# modal.SandboxSnapshot
+
+```python
+class SandboxSnapshot(modal.object.Object)
+```
+
+> Sandbox memory snapshots are in **early preview**.
+
+A `SandboxSnapshot` object lets you interact with a stored Sandbox snapshot that was created by calling
+`._experimental_snapshot()` on a Sandbox instance. This includes both the filesystem and memory state of
+the original Sandbox at the time the snapshot was taken.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## from_id
+
+```python
+@deprecate_aio_usage((2025, 12, 5), "SandboxSnapshot.from_id")
+@classmethod
+def from_id(
+    cls, sandbox_snapshot_id: str, client: Optional["modal.client.Client"] = None
+) -> typing_extensions.Self:
+```
+
+Construct a `SandboxSnapshot` object from a sandbox snapshot ID.
+
+#### Secret
+
+# modal.Secret
+
+```python
+class Secret(modal.object.Object)
+```
+
+Secrets provide a dictionary of environment variables for images.
+
+Secrets are a secure way to add credentials and other sensitive information
+to the containers your functions run in. You can create and edit secrets on
+[the dashboard](https://modal.com/secrets), or programmatically from Python code.
+
+See [the secrets guide page](https://modal.com/docs/guide/secrets) for more information.
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## objects
+
+```python
+class objects(object)
+```
+
+Namespace with methods for managing named Secret objects.
+
+### create
+
+```python
+@staticmethod
+def create(
+    name: str,  # Name to use for the new Secret
+    env_dict: dict[str, str],  # Key-value pairs to set in the Secret
+    *,
+    allow_existing: bool = False,  # If True, no-op when the Secret already exists
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> None:
+```
+
+Create a new Secret object.
+
+**Examples:**
+
+```python notest
+contents = {"MY_KEY": "my-value", "MY_OTHER_KEY": "my-other-value"}
+modal.Secret.objects.create("my-secret", contents)
+```
+
+Secrets will be created in the active environment, or another one can be specified:
+
+```python notest
+modal.Secret.objects.create("my-secret", contents, environment_name="dev")
+```
+
+By default, an error will be raised if the Secret already exists, but passing
+`allow_existing=True` will make the creation attempt a no-op in this case.
+If the `env_dict` data differs from the existing Secret, it will be ignored.
+
+```python notest
+modal.Secret.objects.create("my-secret", contents, allow_existing=True)
+```
+
+Note that this method does not return a local instance of the Secret. You can use
+`modal.Secret.from_name` to perform a lookup after creation.
+
+Added in v1.1.2.
+### list
+
+```python
+@staticmethod
+def list(
+    *,
+    max_objects: Optional[int] = None,  # Limit requests to this size
+    created_before: Optional[Union[datetime, str]] = None,  # Limit based on creation date
+    environment_name: str = "",  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> builtins.list["_Secret"]:
+```
+
+Return a list of hydrated Secret objects.
+
+**Examples:**
+
+```python
+secrets = modal.Secret.objects.list()
+print([s.name for s in secrets])
+```
+
+Secrets will be retreived from the active environment, or another one can be specified:
+
+```python notest
+dev_secrets = modal.Secret.objects.list(environment_name="dev")
+```
+
+By default, all named Secrets are returned, newest to oldest. It's also possible to limit the
+number of results and to filter by creation date:
+
+```python
+secrets = modal.Secret.objects.list(max_objects=10, created_before="2025-01-01")
+```
+
+Added in v1.1.2.
+### delete
+
+```python
+@staticmethod
+def delete(
+    name: str,  # Name of the Secret to delete
+    *,
+    allow_missing: bool = False,  # If True, don't raise an error if the Secret doesn't exist
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+):
+```
+
+Delete a named Secret.
+
+Warning: Deletion is irreversible and will affect any Apps currently using the Secret.
+
+**Examples:**
+
+```python notest
+await modal.Secret.objects.delete("my-secret")
+```
+
+Secrets will be deleted from the active environment, or another one can be specified:
+
+```python notest
+await modal.Secret.objects.delete("my-secret", environment_name="dev")
+```
+
+Added in v1.1.2.
+## name
+
+```python
+@property
+def name(self) -> Optional[str]:
+```
+
+## from_dict
+
+```python
+@staticmethod
+def from_dict(
+    env_dict: dict[
+        str, Optional[str]
+    ] = {},  # dict of entries to be inserted as environment variables in functions using the secret
+) -> "_Secret":
+```
+
+Create a secret from a str-str dictionary. Values can also be `None`, which is ignored.
+
+Usage:
+```python
+@app.function(secrets=[modal.Secret.from_dict({"FOO": "bar"})])
+def run():
+    print(os.environ["FOO"])
+```
+## from_local_environ
+
+```python
+@staticmethod
+def from_local_environ(
+    env_keys: list[str],  # list of local env vars to be included for remote execution
+) -> "_Secret":
+```
+
+Create secrets from local environment variables automatically.
+## from_dotenv
+
+```python
+@staticmethod
+def from_dotenv(path=None, *, filename=".env", client: Optional[_Client] = None) -> "_Secret":
+```
+
+Create secrets from a .env file automatically.
+
+If no argument is provided, it will use the current working directory as the starting
+point for finding a `.env` file. Note that it does not use the location of the module
+calling `Secret.from_dotenv`.
+
+If called with an argument, it will use that as a starting point for finding `.env` files.
+In particular, you can call it like this:
+```python
+@app.function(secrets=[modal.Secret.from_dotenv(__file__)])
+def run():
+    print(os.environ["USERNAME"])  # Assumes USERNAME is defined in your .env file
+```
+
+This will use the location of the script calling `modal.Secret.from_dotenv` as a
+starting point for finding the `.env` file.
+
+A file named `.env` is expected by default, but this can be overridden with the `filename`
+keyword argument:
+
+```python
+@app.function(secrets=[modal.Secret.from_dotenv(filename=".env-dev")])
+def run():
+    ...
+```
+## from_name
+
+```python
+@staticmethod
+def from_name(
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    required_keys: list[
+        str
+    ] = [],  # Optionally, a list of required environment variables (will be asserted server-side)
+    client: Optional[_Client] = None,
+) -> "_Secret":
+```
+
+Reference a Secret by its name.
+
+In contrast to most other Modal objects, named Secrets must be provisioned
+from the Dashboard. See other methods for alternate ways of creating a new
+Secret from code.
+
+```python
+secret = modal.Secret.from_name("my-secret")
+
+@app.function(secrets=[secret])
+def run():
+   ...
+```
+## info
+
+```python
+@live_method
+def info(self) -> SecretInfo:
+```
+
+Return information about the Secret object.
+
+#### Tunnel
+
+# modal.Tunnel
+
+```python
+class Tunnel(object)
+```
+
+A port forwarded from within a running Modal container. Created by `modal.forward()`.
+
+**Important:** This is an experimental API which may change in the future.
+
+```python
+def __init__(self, host: str, port: int, unencrypted_host: str, unencrypted_port: int) -> None
+```
+
+## url
+
+```python
+@property
+def url(self) -> str:
+```
+
+Get the public HTTPS URL of the forwarded port.
+## tls_socket
+
+```python
+@property
+def tls_socket(self) -> tuple[str, int]:
+```
+
+Get the public TLS socket as a (host, port) tuple.
+## tcp_socket
+
+```python
+@property
+def tcp_socket(self) -> tuple[str, int]:
+```
+
+Get the public TCP socket as a (host, port) tuple.
+
+#### Volume
+
+# modal.Volume
+
+```python
+class Volume(modal.object.Object)
+```
+
+A writeable volume that can be used to share files between one or more Modal functions.
+
+The contents of a volume is exposed as a filesystem. You can use it to share data between different functions, or
+to persist durable state across several instances of the same function.
+
+Unlike a networked filesystem, you need to explicitly reload the volume to see changes made since it was mounted.
+Similarly, you need to explicitly commit any changes you make to the volume for the changes to become visible
+outside the current container.
+
+Concurrent modification is supported, but concurrent modifications of the same files should be avoided! Last write
+wins in case of concurrent modification of the same file - any data the last writer didn't have when committing
+changes will be lost!
+
+As a result, volumes are typically not a good fit for use cases where you need to make concurrent modifications to
+the same file (nor is distributed file locking supported).
+
+Volumes can only be reloaded if there are no open files for the volume - attempting to reload with open files
+will result in an error.
+
+**Usage**
+
+```python
+import modal
+
+app = modal.App()
+volume = modal.Volume.from_name("my-persisted-volume", create_if_missing=True)
+
+@app.function(volumes={"/root/foo": volume})
+def f():
+    with open("/root/foo/bar.txt", "w") as f:
+        f.write("hello")
+    volume.commit()  # Persist changes
+
+@app.function(volumes={"/root/foo": volume})
+def g():
+    volume.reload()  # Fetch latest changes
+    with open("/root/foo/bar.txt", "r") as f:
+        print(f.read())
+```
+
+## hydrate
+
+```python
+def hydrate(self, client: Optional[_Client] = None) -> Self:
+```
+
+Synchronize the local object with its identity on the Modal server.
+
+It is rarely necessary to call this method explicitly, as most operations
+will lazily hydrate when needed. The main use case is when you need to
+access object metadata, such as its ID.
+
+*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
+## objects
+
+```python
+class objects(object)
+```
+
+Namespace with methods for managing named Volume objects.
+
+### create
+
+```python
+@staticmethod
+def create(
+    name: str,  # Name to use for the new Volume
+    *,
+    version: Optional[int] = None,  # Experimental: Configure the backend VolumeFS version
+    allow_existing: bool = False,  # If True, no-op when the Volume already exists
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> None:
+```
+
+Create a new Volume object.
+
+**Examples:**
+
+```python notest
+modal.Volume.objects.create("my-volume")
+```
+
+Volumes will be created in the active environment, or another one can be specified:
+
+```python notest
+modal.Volume.objects.create("my-volume", environment_name="dev")
+```
+
+By default, an error will be raised if the Volume already exists, but passing
+`allow_existing=True` will make the creation attempt a no-op in this case.
+
+```python notest
+modal.Volume.objects.create("my-volume", allow_existing=True)
+```
+
+Note that this method does not return a local instance of the Volume. You can use
+`modal.Volume.from_name` to perform a lookup after creation.
+
+Added in v1.1.2.
+### list
+
+```python
+@staticmethod
+def list(
+    *,
+    max_objects: Optional[int] = None,  # Limit requests to this size
+    created_before: Optional[Union[datetime, str]] = None,  # Limit based on creation date
+    environment_name: str = "",  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+) -> builtins.list["_Volume"]:
+```
+
+Return a list of hydrated Volume objects.
+
+**Examples:**
+
+```python
+volumes = modal.Volume.objects.list()
+print([v.name for v in volumes])
+```
+
+Volumes will be retreived from the active environment, or another one can be specified:
+
+```python notest
+dev_volumes = modal.Volume.objects.list(environment_name="dev")
+```
+
+By default, all named Volumes are returned, newest to oldest. It's also possible to limit the
+number of results and to filter by creation date:
+
+```python
+volumes = modal.Volume.objects.list(max_objects=10, created_before="2025-01-01")
+```
+
+Added in v1.1.2.
+### delete
+
+```python
+@staticmethod
+def delete(
+    name: str,  # Name of the Volume to delete
+    *,
+    allow_missing: bool = False,  # If True, don't raise an error if the Volume doesn't exist
+    environment_name: Optional[str] = None,  # Uses active environment if not specified
+    client: Optional[_Client] = None,  # Optional client with Modal credentials
+):
+```
+
+Delete a named Volume.
+
+Warning: This deletes an *entire Volume*, not just a specific file.
+Deletion is irreversible and will affect any Apps currently using the Volume.
+
+**Examples:**
+
+```python notest
+await modal.Volume.objects.delete("my-volume")
+```
+
+Volumes will be deleted from the active environment, or another one can be specified:
+
+```python notest
+await modal.Volume.objects.delete("my-volume", environment_name="dev")
+```
+
+Added in v1.1.2.
+## name
+
+```python
+@property
+def name(self) -> Optional[str]:
+```
+
+## read_only
+
+```python
+def read_only(self) -> "_Volume":
+```
+
+Configure Volume to mount as read-only.
+
+**Example**
+
+```python
+import modal
+
+volume = modal.Volume.from_name("my-volume", create_if_missing=True)
+
+@app.function(volumes={"/mnt/items": volume.read_only()})
+def f():
+    with open("/mnt/items/my-file.txt") as f:
+        return f.read()
+```
+
+The Volume is mounted as a read-only volume in a function. Any file system write operation into the
+mounted volume will result in an error.
+
+Added in v1.0.5.
+## from_name
+
+```python
+@staticmethod
+def from_name(
+    name: str,
+    *,
+    environment_name: Optional[str] = None,
+    create_if_missing: bool = False,
+    version: "typing.Optional[modal_proto.api_pb2.VolumeFsVersion.ValueType]" = None,
+    client: Optional[_Client] = None,
+) -> "_Volume":
+```
+
+Reference a Volume by name, creating if necessary.
+
+This is a lazy method that defers hydrating the local
+object with metadata from Modal servers until the first
+time is is actually used.
+
+```python
+vol = modal.Volume.from_name("my-volume", create_if_missing=True)
+
+app = modal.App()
+
+# Volume refers to the same object, even across instances of `app`.
+@app.function(volumes={"/data": vol})
+def f():
+    pass
+```
+## ephemeral
+
+```python
+@classmethod
+@contextmanager
+def ephemeral(
+    cls: type["_Volume"],
+    client: Optional[_Client] = None,
+    environment_name: Optional[str] = None,
+    version: "typing.Optional[modal_proto.api_pb2.VolumeFsVersion.ValueType]" = None,
+) -> AsyncGenerator["_Volume", None]:
+```
+
+Creates a new ephemeral volume within a context manager:
+
+Usage:
+```python
+import modal
+with modal.Volume.ephemeral() as vol:
+    assert vol.listdir("/") == []
+```
+
+```python notest
+async with modal.Volume.ephemeral() as vol:
+    assert await vol.listdir("/") == []
+```
+## info
+
+```python
+@live_method
+def info(self) -> VolumeInfo:
+```
+
+Return information about the Volume object.
+## commit
+
+```python
+@live_method
+def commit(self):
+```
+
+Commit changes to a mounted volume.
+
+If successful, the changes made are now persisted in durable storage and available to other containers accessing
+the volume.
+## reload
+
+```python
+@live_method
+def reload(self):
+```
+
+Make latest committed state of volume available in the running container.
+
+Any uncommitted changes to the volume, such as new or modified files, may implicitly be committed when
+reloading.
+
+Reloading will fail if there are open files for the volume.
+## iterdir
+
+```python
+@live_method_gen
+def iterdir(self, path: str, *, recursive: bool = True) -> Iterator[FileEntry]:
+```
+
+Iterate over all files in a directory in the volume.
+
+Passing a directory path lists all files in the directory. For a file path, return only that
+file's description. If `recursive` is set to True, list all files and folders under the path
+recursively.
+## listdir
+
+```python
+@live_method
+def listdir(self, path: str, *, recursive: bool = False) -> list[FileEntry]:
+```
+
+List all files under a path prefix in the modal.Volume.
+
+Passing a directory path lists all files in the directory. For a file path, return only that
+file's description. If `recursive` is set to True, list all files and folders under the path
+recursively.
+## read_file
+
+```python
+@live_method_gen
+def read_file(self, path: str) -> AsyncGenerator[bytes, None]:
+```
+
+Read a file from the modal.Volume.
+
+Note - this function is primarily intended to be used outside of a Modal App.
+For more information on downloading files from a Modal Volume, see
+[the guide](https://modal.com/docs/guide/volumes).
+
+**Example:**
+
+```python notest
+vol = modal.Volume.from_name("my-modal-volume")
+data = b""
+for chunk in vol.read_file("1mb.csv"):
+    data += chunk
+print(len(data))  # == 1024 * 1024
+```
+## remove_file
+
+```python
+@live_method
+def remove_file(self, path: str, recursive: bool = False) -> None:
+```
+
+Remove a file or directory from a volume.
+## copy_files
+
+```python
+@live_method
+def copy_files(self, src_paths: Sequence[str], dst_path: str, recursive: bool = False) -> None:
+```
+
+Copy files within the volume from src_paths to dst_path.
+The semantics of the copy operation follow those of the UNIX cp command.
+
+The `src_paths` parameter is a list. If you want to copy a single file, you should pass a list with a
+single element.
+
+`src_paths` and `dst_path` should refer to the desired location *inside* the volume. You do not need to prepend
+the volume mount path.
+
+**Usage**
+
+```python notest
+vol = modal.Volume.from_name("my-modal-volume")
+
+vol.copy_files(["bar/example.txt"], "bar2")  # Copy files to another directory
+vol.copy_files(["bar/example.txt"], "bar/example2.txt")  # Rename a file by copying
+```
+
+Note that if the volume is already mounted on the Modal function, you should use normal filesystem operations
+like `os.rename()` and then `commit()` the volume. The `copy_files()` method is useful when you don't have
+the volume mounted as a filesystem, e.g. when running a script on your local computer.
+## batch_upload
+
+```python
+@live_method_contextmanager
+@contextmanager
+def batch_upload(self, force: bool = False) -> AsyncGenerator["_AbstractVolumeUploadContextManager", None]:
+```
+
+Initiate a batched upload to a volume.
+
+To allow overwriting existing files, set `force` to `True` (you cannot overwrite existing directories with
+uploaded files regardless).
+
+**Example:**
+
+```python notest
+vol = modal.Volume.from_name("my-modal-volume")
+
+with vol.batch_upload() as batch:
+    batch.put_file("local-path.txt", "/remote-path.txt")
+    batch.put_directory("/local/directory/", "/remote/directory")
+    batch.put_file(io.BytesIO(b"some data"), "/foobar")
+```
+## rename
+
+```python
+@staticmethod
+def rename(
+    old_name: str,
+    new_name: str,
+    *,
+    client: Optional[_Client] = None,
+    environment_name: Optional[str] = None,
+):
+```
+
+#### asgi app
+
+# modal.asgi_app
+
+```python
+def asgi_app(
+    *,
+    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
+    custom_domains: Optional[Iterable[str]] = None,  # Deploy this endpoint on a custom domain.
+    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
+) -> Callable[[Union[_PartialFunction, NullaryFuncOrMethod]], _PartialFunction]:
+```
+
+Decorator for registering an ASGI app with a Modal function.
+
+Asynchronous Server Gateway Interface (ASGI) is a standard for Python
+synchronous and asynchronous apps, supported by all popular Python web
+libraries. This is an advanced decorator that gives full flexibility in
+defining one or more web endpoints on Modal.
+
+**Usage:**
+
+```python
+from typing import Callable
+
+@app.function()
+@modal.asgi_app()
+def create_asgi() -> Callable:
+    ...
+```
+
+To learn how to use Modal with popular web frameworks, see the
+[guide on web endpoints](https://modal.com/docs/guide/webhooks).
+
+#### batched
+
+# modal.batched
+
+```python
+def batched(
+    *,
+    max_batch_size: int,
+    wait_ms: int,
+) -> Callable[
+    [Union[_PartialFunction[P, ReturnType, ReturnType], Callable[P, ReturnType]]],
+    _PartialFunction[P, ReturnType, ReturnType],
+]:
+```
+
+Decorator for functions or class methods that should be batched.
+
+**Usage**
+
+```python
+# Stack the decorator under `@app.function()` to enable dynamic batching
+@app.function()
+@modal.batched(max_batch_size=4, wait_ms=1000)
+async def batched_multiply(xs: list[int], ys: list[int]) -> list[int]:
+    return [x * y for x, y in zip(xs, ys)]
+
+# call batched_multiply with individual inputs
+# batched_multiply.remote.aio(2, 100)
+
+# With `@app.cls()`, apply the decorator to a method (this may change in the future)
+@app.cls()
+class BatchedClass:
+    @modal.batched(max_batch_size=4, wait_ms=1000)
+    def batched_multiply(self, xs: list[int], ys: list[int]) -> list[int]:
+        return [x * y for x, y in zip(xs, ys)]
+```
+
+See the [dynamic batching guide](https://modal.com/docs/guide/dynamic-batching) for more information.
+
+#### call graph
+
+# modal.call_graph
+
+## modal.call_graph.InputInfo
+
+```python
+class InputInfo(object)
+```
+
+Simple data structure storing information about a function input.
+
+```python
+def __init__(self, input_id: str, function_call_id: str, task_id: str, status: modal.call_graph.InputStatus, function_name: str, module_name: str, children: list['InputInfo']) -> None
+```
+
+## modal.call_graph.InputStatus
+
+```python
+class InputStatus(enum.IntEnum)
+```
+
+Enum representing status of a function input.
+
+The possible values are:
+
+* `PENDING`
+* `SUCCESS`
+* `FAILURE`
+* `INIT_FAILURE`
+* `TERMINATED`
+* `TIMEOUT`
+
+#### concurrent
+
+# modal.concurrent
+
+```python
+def concurrent(
+    *,
+    max_inputs: Optional[int] = None,  # Hard limit on each container's input concurrency
+    target_inputs: Optional[int] = None,  # Input concurrency that Modal's autoscaler should target
+) -> Callable[
+    [Union[Callable[P, ReturnType], _PartialFunction[P, ReturnType, ReturnType]]],
+    _PartialFunction[P, ReturnType, ReturnType],
+]:
+```
+
+Decorator that allows individual containers to handle multiple inputs concurrently.
+
+The concurrency mechanism depends on whether the function is async or not:
+- Async functions will run inputs on a single thread as asyncio tasks.
+- Synchronous functions will use multi-threading. The code must be thread-safe.
+
+Input concurrency will be most useful for workflows that are IO-bound
+(e.g., making network requests) or when running an inference server that supports
+dynamic batching.
+
+When `target_inputs` is set, Modal's autoscaler will try to provision resources
+such that each container is running that many inputs concurrently, rather than
+autoscaling based on `max_inputs`. Containers may burst up to up to `max_inputs`
+if resources are insufficient to remain at the target concurrency, e.g. when the
+arrival rate of inputs increases. This can trade-off a small increase in average
+latency to avoid larger tail latencies from input queuing.
+
+**Examples:**
+```python
+# Stack the decorator under `@app.function()` to enable input concurrency
+@app.function()
+@modal.concurrent(max_inputs=100)
+async def f(data):
+    # Async function; will be scheduled as asyncio task
+    ...
+
+# With `@app.cls()`, apply the decorator at the class level, not on individual methods
+@app.cls()
+@modal.concurrent(max_inputs=100, target_inputs=80)
+class C:
+    @modal.method()
+    def f(self, data):
+        # Sync function; must be thread-safe
+        ...
+
+```
+
+*Added in v0.73.148:* This decorator replaces the `allow_concurrent_inputs` parameter
+in `@app.function()` and `@app.cls()`.
+
+#### config
+
+# modal.config
+
+Modal intentionally keeps configurability to a minimum.
+
+The main configuration options are the API tokens: the token id and the token secret.
+These can be configured in two ways:
+
+1. By running the `modal token set` command.
+   This writes the tokens to `.modal.toml` file in your home directory.
+2. By setting the environment variables `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET`.
+   This takes precedence over the previous method.
+
+.modal.toml
+---------------
+
+The `.modal.toml` file is generally stored in your home directory.
+It should look like this::
+
+```toml
+[default]
+token_id = "ak-12345..."
+token_secret = "as-12345..."
+```
+
+You can create this file manually, or you can run the `modal token set ...`
+command (see below).
+
+Setting tokens using the CLI
+----------------------------
+
+You can set a token by running the command::
+
+```
+modal token set \
+  --token-id <token id> \
+  --token-secret <token secret>
+```
+
+This will write the token id and secret to `.modal.toml`.
+
+If the token id or secret is provided as the string `-` (a single dash),
+then it will be read in a secret way from stdin instead.
+
+Other configuration options
+---------------------------
+
+Other possible configuration options are:
+
+* `loglevel` (in the .toml file) / `MODAL_LOGLEVEL` (as an env var).
+  Defaults to `WARNING`. Set this to `DEBUG` to see internal messages.
+* `logs_timeout` (in the .toml file) / `MODAL_LOGS_TIMEOUT` (as an env var).
+  Defaults to 10.
+  Number of seconds to wait for logs to drain when closing the session,
+  before giving up.
+* `max_throttle_wait` (in the .toml file) / `MODAL_MAX_THROTTLE_WAIT` (as an env var).
+  Defaults to None (no limit).
+  Maximum number of seconds to wait when requests are being throttled (i.e., due
+  to rate limiting or other cases that can normally be resolved through backoff).
+* `force_build` (in the .toml file) / `MODAL_FORCE_BUILD` (as an env var).
+  Defaults to False.
+  When set, ignores the Image cache and builds all Image layers. Note that this
+  will break the cache for all images based on the rebuilt layers, so other images
+  may rebuild on subsequent runs / deploys even if the config is reverted.
+* `ignore_cache` (in the .toml file) / `MODAL_IGNORE_CACHE` (as an env var).
+  Defaults to False.
+  When set, ignores the Image cache and builds all Image layers. Unlike `force_build`,
+  this will not overwrite the cache for other images that have the same recipe.
+  Subsequent runs that do not use this option will pull the *previous* Image from
+  the cache, if one exists. It can be useful for testing an App's robustness to
+  Image rebuilds without clobbering Images used by other Apps.
+* `traceback` (in the .toml file) / `MODAL_TRACEBACK` (as an env var).
+  Defaults to False. Enables printing full tracebacks on unexpected CLI
+  errors, which can be useful for debugging client issues.
+* `log_pattern` (in the .toml file) / `MODAL_LOG_PATTERN` (as an env var).
+  Defaults to `"[modal-client] %(asctime)s %(message)s"`
+  The log formatting pattern that will be used by the modal client itself.
+  See https://docs.python.org/3/library/logging.html#logrecord-attributes for available
+  log attributes.
+* `dev_suffix` (in the .toml file) / `MODAL_DEV_SUFFIX` (as an env var).
+  Overrides the default `-dev` suffix added to URLs generated for web endpoints
+  when the App is ephemeral (i.e., created via `modal serve`). Must be a short
+  alphanumeric string.
+
+Meta-configuration
+------------------
+
+Some "meta-options" are set using environment variables only:
+
+* `MODAL_CONFIG_PATH` lets you override the location of the .toml file,
+  by default `~/.modal.toml`.
+* `MODAL_PROFILE` lets you use multiple sections in the .toml file
+  and switch between them. It defaults to "default".
+
+## modal.config.Config
+
+```python
+class Config(object)
+```
+
+Singleton that holds configuration used by Modal internally.
+
+```python
+def __init__(self):
+```
+
+### get
+
+```python
+def get(self, key, profile=None, use_env=True):
+```
+
+Looks up a configuration value.
+
+Will check (in decreasing order of priority):
+1. Any environment variable of the form MODAL_FOO_BAR (when use_env is True)
+2. Settings in the user's .toml configuration file
+3. The default value of the setting
+### override_locally
+
+```python
+def override_locally(self, key: str, value: str):
+    # Override setting in this process by overriding environment variable for the setting
+    #
+    # Does NOT write back to settings file etc.
+```
+
+### to_dict
+
+```python
+def to_dict(self):
+```
+
+## modal.config.config_profiles
+
+```python
+def config_profiles():
+```
+
+List the available modal profiles in the .modal.toml file.
+## modal.config.config_set_active_profile
+
+```python
+def config_set_active_profile(profile: str) -> None:
+```
+
+Set the user's active modal profile by writing it to the `.modal.toml` file.
+
+#### container process
+
+# modal.container_process
+
+## modal.container_process.ContainerProcess
+
+```python
+class ContainerProcess(typing.Generic)
+```
+
+Represents a running process in a container.
+
+```python
+def __init__(
+    self,
+    process_id: str,
+    task_id: str,
+    client: _Client,
+    stdout: StreamType = StreamType.PIPE,
+    stderr: StreamType = StreamType.PIPE,
+    exec_deadline: Optional[float] = None,
+    text: bool = True,
+    by_line: bool = False,
+    command_router_client: Optional[TaskCommandRouterClient] = None,
+) -> None:
+```
+
+### stdout
+
+```python
+@property
+def stdout(self) -> _StreamReader[T]:
+```
+
+StreamReader for the container process's stdout stream.
+### stderr
+
+```python
+@property
+def stderr(self) -> _StreamReader[T]:
+```
+
+StreamReader for the container process's stderr stream.
+### stdin
+
+```python
+@property
+def stdin(self) -> _StreamWriter:
+```
+
+StreamWriter for the container process's stdin stream.
+### returncode
+
+```python
+@property
+def returncode(self) -> int:
+```
+
+### poll
+
+```python
+def poll(self) -> Optional[int]:
+```
+
+Check if the container process has finished running.
+
+Returns `None` if the process is still running, else returns the exit code.
+### wait
+
+```python
+def wait(self) -> int:
+```
+
+Wait for the container process to finish running. Returns the exit code.
+
+#### current function call id
+
+# modal.current_function_call_id
+
+```python
+def current_function_call_id() -> Optional[str]:
+```
+
+Returns the function call ID for the current input.
+
+Can only be called from Modal function (i.e. in a container context).
+
+```python
+from modal import current_function_call_id
+
+@app.function()
+def process_stuff():
+    print(f"Starting to process input from {current_function_call_id()}")
+```
+
+#### current input id
+
+# modal.current_input_id
+
+```python
+def current_input_id() -> Optional[str]:
+```
+
+Returns the input ID for the current input.
+
+Can only be called from Modal function (i.e. in a container context).
+
+```python
+from modal import current_input_id
+
+@app.function()
+def process_stuff():
+    print(f"Starting to process {current_input_id()}")
+```
+
+#### enable output
+
+# modal.enable_output
+
+```python
+@contextlib.contextmanager
+def enable_output(show_progress: bool = True) -> Generator[None, None, None]:
+```
+
+Context manager that enable output when using the Python SDK.
+
+This will print to stdout and stderr things such as
+1. Logs from running functions
+2. Status of creating objects
+3. Map progress
+
+Example:
+```python
+app = modal.App()
+with modal.enable_output():
+    with app.run():
+        ...
+```
+
+#### enter
+
+# modal.enter
+
+```python
+def enter(
+    *,
+    snap: bool = False,
+) -> Callable[[Union[_PartialFunction, NullaryMethod]], _PartialFunction]:
+```
+
+Decorator for methods which should be executed when a new container is started.
+
+See the [lifeycle function guide](https://modal.com/docs/guide/lifecycle-functions#enter) for more information.
+
+#### exception
+
+# modal.exception
+
+Modal-specific exception types.
+
+## Notes on `grpclib.GRPCError` migration
+
+Historically, the Modal SDK could propagate `grpclib.GRPCError` exceptions out
+to user code.  As of v1.3, we are in the process of gracefully migrating to
+always raising a Modal exception type in these cases. To avoid breaking user
+code that relies on catching `grpclib.GRPCError`, a subset of Modal exception
+types temporarily inherit from `grpclib.GRPCError`.
+
+We encourage users to migrate any code that currently catches `grpclib.GRPCError`
+to instead catch the appropriate Modal exception type. The following mapping
+between GRPCError status codes and Modal exception types is currently in use:
+
+```
+CANCELLED -> ServiceError
+UNKNOWN -> ServiceError
+INVALID_ARGUMENT -> InvalidError
+DEADLINE_EXCEEDED -> ServiceError
+NOT_FOUND -> NotFoundError
+ALREADY_EXISTS -> AlreadyExistsError
+PERMISSION_DENIED -> PermissionDeniedError
+RESOURCE_EXHAUSTED -> ResourceExhaustedError
+FAILED_PRECONDITION -> ConflictError
+ABORTED -> ConflictError
+OUT_OF_RANGE -> InvalidError
+UNIMPLEMENTED -> UnimplementedError
+INTERNAL -> InternalError
+UNAVAILABLE -> ServiceError
+DATA_LOSS -> DataLossError
+UNAUTHENTICATED -> AuthError
+```
+
+## modal.exception.AlreadyExistsError
+
+```python
+class AlreadyExistsError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when a resource creation conflicts with an existing resource.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.AsyncUsageWarning
+
+```python
+class AsyncUsageWarning(UserWarning)
+```
+
+Warning emitted when a blocking Modal interface is used in an async context.
+
+## modal.exception.AuthError
+
+```python
+class AuthError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when a client has missing or invalid authentication.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.ClientClosed
+
+```python
+class ClientClosed(modal.exception.Error)
+```
+
+## modal.exception.ConflictError
+
+```python
+class ConflictError(modal.exception.InvalidError, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when a resource conflict occurs between the request and current system state.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.ConnectionError
+
+```python
+class ConnectionError(modal.exception.Error)
+```
+
+Raised when an issue occurs while connecting to the Modal servers.
+
+## modal.exception.DataLossError
+
+```python
+class DataLossError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when data is lost or corrupted.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.DeprecationError
+
+```python
+class DeprecationError(UserWarning)
+```
+
+UserWarning category emitted when a deprecated Modal feature or API is used.
+
+## modal.exception.DeserializationError
+
+```python
+class DeserializationError(modal.exception.Error)
+```
+
+Raised to provide more context when an error is encountered during deserialization.
+
+## modal.exception.ExecTimeoutError
+
+```python
+class ExecTimeoutError(modal.exception.TimeoutError)
+```
+
+Raised when a container process exceeds its execution duration limit and times out.
+
+## modal.exception.ExecutionError
+
+```python
+class ExecutionError(modal.exception.Error)
+```
+
+Raised when something unexpected happened during runtime.
+
+## modal.exception.FilesystemExecutionError
+
+```python
+class FilesystemExecutionError(modal.exception.Error)
+```
+
+Raised when an unknown error is thrown during a container filesystem operation.
+
+## modal.exception.FunctionTimeoutError
+
+```python
+class FunctionTimeoutError(modal.exception.TimeoutError)
+```
+
+Raised when a Function exceeds its execution duration limit and times out.
+
+## modal.exception.InputCancellation
+
+```python
+class InputCancellation(BaseException)
+```
+
+Raised when the current input is cancelled by the task
+
+Intentionally a BaseException instead of an Exception, so it won't get
+caught by unspecified user exception clauses that might be used for retries and
+other control flow.
+
+## modal.exception.InteractiveTimeoutError
+
+```python
+class InteractiveTimeoutError(modal.exception.TimeoutError)
+```
+
+Raised when interactive frontends time out while trying to connect to a container.
+
+## modal.exception.InternalError
+
+```python
+class InternalError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when an internal error occurs in the Modal system.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.InternalFailure
+
+```python
+class InternalFailure(modal.exception.Error)
+```
+
+Retriable internal error.
+
+## modal.exception.InvalidError
+
+```python
+class InvalidError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when user does something invalid.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.ModuleNotMountable
+
+```python
+class ModuleNotMountable(Exception)
+```
+
+## modal.exception.MountUploadTimeoutError
+
+```python
+class MountUploadTimeoutError(modal.exception.TimeoutError)
+```
+
+Raised when a Mount upload times out.
+
+## modal.exception.NotFoundError
+
+```python
+class NotFoundError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when a requested resource was not found.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.OutputExpiredError
+
+```python
+class OutputExpiredError(modal.exception.TimeoutError)
+```
+
+Raised when the Output exceeds expiration and times out.
+
+## modal.exception.PendingDeprecationError
+
+```python
+class PendingDeprecationError(UserWarning)
+```
+
+Soon to be deprecated feature. Only used intermittently because of multi-repo concerns.
+
+## modal.exception.PermissionDeniedError
+
+```python
+class PermissionDeniedError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when a user does not have permission to perform the requested operation.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.RemoteError
+
+```python
+class RemoteError(modal.exception.Error)
+```
+
+Raised when an error occurs on the Modal server.
+
+## modal.exception.RequestSizeError
+
+```python
+class RequestSizeError(modal.exception.Error)
+```
+
+Raised when an operation produces a gRPC request that is rejected by the server for being too large.
+
+## modal.exception.ResourceExhaustedError
+
+```python
+class ResourceExhaustedError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when a server-side resource has been exhausted, e.g. a quota or rate limit.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.SandboxTerminatedError
+
+```python
+class SandboxTerminatedError(modal.exception.Error)
+```
+
+Raised when a Sandbox is terminated for an internal reason.
+
+## modal.exception.SandboxTimeoutError
+
+```python
+class SandboxTimeoutError(modal.exception.TimeoutError)
+```
+
+Raised when a Sandbox exceeds its execution duration limit and times out.
+
+## modal.exception.SerializationError
+
+```python
+class SerializationError(modal.exception.Error)
+```
+
+Raised to provide more context when an error is encountered during serialization.
+
+## modal.exception.ServerWarning
+
+```python
+class ServerWarning(UserWarning)
+```
+
+Warning originating from the Modal server and re-issued in client code.
+
+## modal.exception.ServiceError
+
+```python
+class ServiceError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when an error occurs in basic client/server communication.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.TimeoutError
+
+```python
+class TimeoutError(modal.exception.Error)
+```
+
+Base class for Modal timeouts.
+
+## modal.exception.UnimplementedError
+
+```python
+class UnimplementedError(modal.exception.Error, modal.exception._GRPCErrorWrapper)
+```
+
+Raised when a requested operation is not implemented or not supported.
+
+```python
+def __init__(self, message: Optional[str] = None):
+    # Override GRPCError's init and repr to behave more like a regular Exception
+    # (We don't customize these anywhere in our custom error types currently).
+```
+
+### message
+
+```python
+@property
+def message(self) -> str:
+```
+
+### status
+
+```python
+@property
+def status(self) -> grpclib.Status:
+```
+
+### details
+
+```python
+@property
+def details(self) -> Any:
+```
+
+## modal.exception.VersionError
+
+```python
+class VersionError(modal.exception.Error)
+```
+
+Raised when the current client version of Modal is unsupported.
+
+## modal.exception.VolumeUploadTimeoutError
+
+```python
+class VolumeUploadTimeoutError(modal.exception.TimeoutError)
+```
+
+Raised when a Volume upload times out.
+
+## modal.exception.simulate_preemption
+
+```python
+def simulate_preemption(wait_seconds: int, jitter_seconds: int = 0):
+```
+
+Utility for simulating a preemption interrupt after `wait_seconds` seconds.
+The first interrupt is the SIGINT signal. After 30 seconds, a second
+interrupt will trigger.
+
+This second interrupt simulates SIGKILL, and should not be caught.
+Optionally add between zero and `jitter_seconds` seconds of additional waiting before first interrupt.
+
+**Usage:**
+
+```python notest
+import time
+from modal.exception import simulate_preemption
+
+simulate_preemption(3)
+
+try:
+    time.sleep(4)
+except KeyboardInterrupt:
+    print("got preempted") # Handle interrupt
+    raise
+```
+
+See https://modal.com/docs/guide/preemption for more details on preemption
+handling.
+
+#### exit
+
+# modal.exit
+
+```python
+def exit(_warn_parentheses_missing=None) -> Callable[[NullaryMethod], _PartialFunction]:
+```
+
+Decorator for methods which should be executed when a container is about to exit.
+
+See the [lifeycle function guide](https://modal.com/docs/guide/lifecycle-functions#exit) for more information.
+
+#### fastapi endpoint
+
+# modal.fastapi_endpoint
+
+```python
+def fastapi_endpoint(
+    *,
+    method: str = "GET",  # REST method for the created endpoint.
+    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
+    custom_domains: Optional[Iterable[str]] = None,  # Custom fully-qualified domain name (FQDN) for the endpoint.
+    docs: bool = False,  # Whether to enable interactive documentation for this endpoint at /docs.
+    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
+) -> Callable[
+    [Union[_PartialFunction[P, ReturnType, ReturnType], Callable[P, ReturnType]]],
+    _PartialFunction[P, ReturnType, ReturnType],
+]:
+```
+
+Convert a function into a basic web endpoint by wrapping it with a FastAPI App.
+
+Modal will internally use [FastAPI](https://fastapi.tiangolo.com/) to expose a
+simple, single request handler. If you are defining your own `FastAPI` application
+(e.g. if you want to define multiple routes), use `@modal.asgi_app` instead.
+
+The endpoint created with this decorator will automatically have
+[CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) enabled
+and can leverage many of FastAPI's features.
+
+For more information on using Modal with popular web frameworks, see our
+[guide on web endpoints](https://modal.com/docs/guide/webhooks).
+
+*Added in v0.73.82*: This function replaces the deprecated `@web_endpoint` decorator.
+
+#### file io
+
+# modal.file_io
+
+## modal.file_io.FileIO
+
+```python
+class FileIO(typing.Generic)
+```
+
+[Alpha] FileIO handle, used in the Sandbox filesystem API.
+
+The API is designed to mimic Python's io.FileIO.
+
+Currently this API is in Alpha and is subject to change. File I/O operations
+may be limited in size to 100 MiB, and the throughput of requests is
+restricted in the current implementation. For our recommendations on large file transfers
+see the Sandbox [filesystem access guide](https://modal.com/docs/guide/sandbox-files).
+
+**Usage**
+
+```python notest
+import modal
+
+app = modal.App.lookup("my-app", create_if_missing=True)
+
+sb = modal.Sandbox.create(app=app)
+f = sb.open("/tmp/foo.txt", "w")
+f.write("hello")
+f.close()
+```
+
+```python
+def __init__(self, client: _Client, task_id: str) -> None:
+```
+
+### create
+
+```python
+@classmethod
+def create(
+    cls, path: str, mode: Union["_typeshed.OpenTextMode", "_typeshed.OpenBinaryMode"], client: _Client, task_id: str
+) -> "_FileIO":
+```
+
+Create a new FileIO handle.
+### read
+
+```python
+def read(self, n: Optional[int] = None) -> T:
+```
+
+Read n bytes from the current position, or the entire remaining file if n is None.
+### readline
+
+```python
+def readline(self) -> T:
+```
+
+Read a single line from the current position.
+### readlines
+
+```python
+def readlines(self) -> Sequence[T]:
+```
+
+Read all lines from the current position.
+### write
+
+```python
+def write(self, data: Union[bytes, str]) -> None:
+```
+
+Write data to the current position.
+
+Writes may not appear until the entire buffer is flushed, which
+can be done manually with `flush()` or automatically when the file is
+closed.
+### flush
+
+```python
+def flush(self) -> None:
+```
+
+Flush the buffer to disk.
+### seek
+
+```python
+def seek(self, offset: int, whence: int = 0) -> None:
+```
+
+Move to a new position in the file.
+
+`whence` defaults to 0 (absolute file positioning); other values are 1
+(relative to the current position) and 2 (relative to the file's end).
+### ls
+
+```python
+@classmethod
+def ls(cls, path: str, client: _Client, task_id: str) -> list[str]:
+```
+
+List the contents of the provided directory.
+### mkdir
+
+```python
+@classmethod
+def mkdir(cls, path: str, client: _Client, task_id: str, parents: bool = False) -> None:
+```
+
+Create a new directory.
+### rm
+
+```python
+@classmethod
+def rm(cls, path: str, client: _Client, task_id: str, recursive: bool = False) -> None:
+```
+
+Remove a file or directory in the Sandbox.
+### watch
+
+```python
+@classmethod
+def watch(
+    cls,
+    path: str,
+    client: _Client,
+    task_id: str,
+    filter: Optional[list[FileWatchEventType]] = None,
+    recursive: bool = False,
+    timeout: Optional[int] = None,
+) -> Iterator[FileWatchEvent]:
+```
+
+### close
+
+```python
+def close(self) -> None:
+```
+
+Flush the buffer and close the file.
+## modal.file_io.FileWatchEvent
+
+```python
+class FileWatchEvent(object)
+```
+
+FileWatchEvent(paths: list[str], type: modal.file_io.FileWatchEventType)
+
+```python
+def __init__(self, paths: list[str], type: modal.file_io.FileWatchEventType) -> None
+```
+
+## modal.file_io.FileWatchEventType
+
+```python
+class FileWatchEventType(enum.Enum)
+```
+
+An enumeration.
+
+The possible values are:
+
+* `Unknown`
+* `Access`
+* `Create`
+* `Modify`
+* `Remove`
+
+#### forward
+
+# modal.forward
+
+```python
+@contextmanager
+def forward(
+    port: int, *, unencrypted: bool = False, h2_enabled: bool = False, client: Optional[_Client] = None
+) -> Iterator[Tunnel]:
+```
+
+Expose a port publicly from inside a running Modal container, with TLS.
+
+If `unencrypted` is set, this also exposes the TCP socket without encryption on a random port
+number. This can be used to SSH into a container (see example below). Note that it is on the public Internet, so
+make sure you are using a secure protocol over TCP.
+
+If `h2_enabled` is set, the TLS server will advertise support for HTTP/2.
+
+**Important:** This is an experimental API which may change in the future.
+
+**Usage:**
+
+```python notest
+import modal
+from flask import Flask
+
+app = modal.App(image=modal.Image.debian_slim().pip_install("Flask"))
+flask_app = Flask(__name__)
+
+@flask_app.route("/")
+def hello_world():
+    return "Hello, World!"
+
+@app.function()
+def run_app():
+    # Start a web server inside the container at port 8000. `modal.forward(8000)` lets us
+    # expose that port to the world at a random HTTPS URL.
+    with modal.forward(8000) as tunnel:
+        print("Server listening at", tunnel.url)
+        flask_app.run("0.0.0.0", 8000)
+
+    # When the context manager exits, the port is no longer exposed.
+```
+
+**Raw TCP usage:**
+
+```python
+import socket
+import threading
+
+import modal
+
+def run_echo_server(port: int):
+    """Run a TCP echo server listening on the given port."""
+    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    sock.bind(("0.0.0.0", port))
+    sock.listen(1)
+
+    while True:
+        conn, addr = sock.accept()
+        print("Connection from:", addr)
+
+        # Start a new thread to handle the connection
+        def handle(conn):
+            with conn:
+                while True:
+                    data = conn.recv(1024)
+                    if not data:
+                        break
+                    conn.sendall(data)
+
+        threading.Thread(target=handle, args=(conn,)).start()
+
+app = modal.App()
+
+@app.function()
+def tcp_tunnel():
+    # This exposes port 8000 to public Internet traffic over TCP.
+    with modal.forward(8000, unencrypted=True) as tunnel:
+        # You can connect to this TCP socket from outside the container, for example, using `nc`:
+        #  nc <HOST> <PORT>
+        print("TCP tunnel listening at:", tunnel.tcp_socket)
+        run_echo_server(8000)
+```
+
+**SSH example:**
+This assumes you have a rsa keypair in `~/.ssh/id_rsa{.pub}`, this is a bare-bones example
+letting you SSH into a Modal container.
+
+```python
+import subprocess
+import time
+
+import modal
+
+app = modal.App()
+image = (
+    modal.Image.debian_slim()
+    .apt_install("openssh-server")
+    .run_commands("mkdir /run/sshd")
+    .add_local_file("~/.ssh/id_rsa.pub", "/root/.ssh/authorized_keys", copy=True)
+)
+
+@app.function(image=image, timeout=3600)
+def some_function():
+    subprocess.Popen(["/usr/sbin/sshd", "-D", "-e"])
+    with modal.forward(port=22, unencrypted=True) as tunnel:
+        hostname, port = tunnel.tcp_socket
+        connection_cmd = f'ssh -p {port} root@{hostname}'
+        print(f"ssh into container using: {connection_cmd}")
+        time.sleep(3600)  # keep alive for 1 hour or until killed
+```
+
+If you intend to use this more generally, a suggestion is to put the subprocess and port
+forwarding code in an `@enter` lifecycle method of an @app.cls, to only make a single
+ssh server and port for each container (and not one for each input to the function).
+
+#### gpu
+
+# modal.gpu
+
+**GPU configuration shortcodes**
+
+You can pass a wide range of `str` values for the `gpu` parameter of
+[`@app.function`](https://modal.com/docs/reference/modal.App#function).
+
+For instance:
+- `gpu="H100"` will attach 1 H100 GPU to each container
+- `gpu="L40S"` will attach 1 L40S GPU to each container
+- `gpu="T4:4"` will attach 4 T4 GPUs to each container
+
+You can see a list of Modal GPU options in the
+[GPU docs](https://modal.com/docs/guide/gpu).
+
+**Example**
+
+```python
+@app.function(gpu="A100-80GB:4")
+def my_gpu_function():
+    ... # This will have 4 A100-80GB with each container
+```
+
+**Deprecation notes**
+
+An older deprecated way to configure GPU is also still supported,
+but will be removed in future versions of Modal. Examples:
+
+- `gpu=modal.gpu.H100()` will attach 1 H100 GPU to each container
+- `gpu=modal.gpu.T4(count=4)` will attach 4 T4 GPUs to each container
+- `gpu=modal.gpu.A100()` will attach 1 A100-40GB GPUs to each container
+- `gpu=modal.gpu.A100(size="80GB")` will attach 1 A100-80GB GPUs to each container
+
+## modal.gpu.A100
+
+```python
+class A100(modal.gpu._GPUConfig)
+```
+
+[NVIDIA A100 Tensor Core](https://www.nvidia.com/en-us/data-center/a100/) GPU class.
+
+The flagship data center GPU of the Ampere architecture. Available in 40GB and 80GB GPU memory configurations.
+
+```python
+def __init__(
+    self,
+    *,
+    count: int = 1,  # Number of GPUs per container. Defaults to 1.
+    size: Union[str, None] = None,  # Select GB configuration of GPU device: "40GB" or "80GB". Defaults to "40GB".
+):
+```
+
+## modal.gpu.A10G
+
+```python
+class A10G(modal.gpu._GPUConfig)
+```
+
+[NVIDIA A10G Tensor Core](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) GPU class.
+
+A mid-tier data center GPU based on the Ampere architecture, providing 24 GB of memory.
+A10G GPUs deliver up to 3.3x better ML training performance, 3x better ML inference performance,
+and 3x better graphics performance, in comparison to NVIDIA T4 GPUs.
+
+```python
+def __init__(
+    self,
+    *,
+    # Number of GPUs per container. Defaults to 1.
+    # Useful if you have very large models that don't fit on a single GPU.
+    count: int = 1,
+):
+```
+
+## modal.gpu.Any
+
+```python
+class Any(modal.gpu._GPUConfig)
+```
+
+Selects any one of the GPU classes available within Modal, according to availability.
+
+```python
+def __init__(self, *, count: int = 1):
+```
+
+## modal.gpu.H100
+
+```python
+class H100(modal.gpu._GPUConfig)
+```
+
+[NVIDIA H100 Tensor Core](https://www.nvidia.com/en-us/data-center/h100/) GPU class.
+
+The flagship data center GPU of the Hopper architecture.
+Enhanced support for FP8 precision and a Transformer Engine that provides up to 4X faster training
+over the prior generation for GPT-3 (175B) models.
+
+```python
+def __init__(
+    self,
+    *,
+    # Number of GPUs per container. Defaults to 1.
+    # Useful if you have very large models that don't fit on a single GPU.
+    count: int = 1,
+):
+```
+
+## modal.gpu.L4
+
+```python
+class L4(modal.gpu._GPUConfig)
+```
+
+[NVIDIA L4 Tensor Core](https://www.nvidia.com/en-us/data-center/l4/) GPU class.
+
+A mid-tier data center GPU based on the Ada Lovelace architecture, providing 24GB of GPU memory.
+Includes RTX (ray tracing) support.
+
+```python
+def __init__(
+    self,
+    count: int = 1,  # Number of GPUs per container. Defaults to 1.
+):
+```
+
+## modal.gpu.L40S
+
+```python
+class L40S(modal.gpu._GPUConfig)
+```
+
+[NVIDIA L40S](https://www.nvidia.com/en-us/data-center/l40s/) GPU class.
+
+The L40S is a data center GPU for the Ada Lovelace architecture. It has 48 GB of on-chip
+GDDR6 RAM and enhanced support for FP8 precision.
+
+```python
+def __init__(
+    self,
+    *,
+    # Number of GPUs per container. Defaults to 1.
+    # Useful if you have very large models that don't fit on a single GPU.
+    count: int = 1,
+):
+```
+
+## modal.gpu.T4
+
+```python
+class T4(modal.gpu._GPUConfig)
+```
+
+[NVIDIA T4 Tensor Core](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPU class.
+
+A low-cost data center GPU based on the Turing architecture, providing 16GB of GPU memory.
+
+```python
+def __init__(
+    self,
+    count: int = 1,  # Number of GPUs per container. Defaults to 1.
+):
+```
+
+## modal.gpu.parse_gpu_config
+
+```python
+def parse_gpu_config(value: GPU_T) -> api_pb2.GPUConfig:
+```
+
+#### interact
+
+# modal.interact
+
+```python
+def interact() -> None:
+```
+
+Enable interactivity with user input inside a Modal container.
+
+See the [interactivity guide](https://modal.com/docs/guide/developing-debugging#interactivity)
+for more information on how to use this function.
+
+#### io streams
+
+# modal.io_streams
+
+## modal.io_streams.StreamReader
+
+```python
+class StreamReader(typing.Generic)
+```
+
+Retrieve logs from a stream (`stdout` or `stderr`).
+
+As an asynchronous iterable, the object supports the `for` and `async for`
+statements. Just loop over the object to read in chunks.
+
+### file_descriptor
+
+```python
+@property
+def file_descriptor(self) -> int:
+```
+
+Possible values are `1` for stdout and `2` for stderr.
+### read
+
+```python
+def read(self) -> T:
+```
+
+Fetch the entire contents of the stream until EOF.
+## modal.io_streams.StreamWriter
+
+```python
+class StreamWriter(object)
+```
+
+Provides an interface to buffer and write logs to a sandbox or container process stream (`stdin`).
+
+### write
+
+```python
+def write(self, data: Union[bytes, bytearray, memoryview, str]) -> None:
+```
+
+Write data to the stream but does not send it immediately.
+
+This is non-blocking and queues the data to an internal buffer. Must be
+used along with the `drain()` method, which flushes the buffer.
+
+**Usage**
+
+```python fixture:sandbox
+proc = sandbox.exec(
+    "bash",
+    "-c",
+    "while read line; do echo $line; done",
+)
+proc.stdin.write(b"foo\n")
+proc.stdin.write(b"bar\n")
+proc.stdin.write_eof()
+proc.stdin.drain()
+```
+### write_eof
+
+```python
+def write_eof(self) -> None:
+```
+
+Close the write end of the stream after the buffered data is drained.
+
+If the process was blocked on input, it will become unblocked after
+`write_eof()`. This method needs to be used along with the `drain()`
+method, which flushes the EOF to the process.
+### drain
+
+```python
+def drain(self) -> None:
+```
+
+Flush the write buffer and send data to the running process.
+
+This is a flow control method that blocks until data is sent. It returns
+when it is appropriate to continue writing data to the stream.
+
+**Usage**
+
+```python notest
+writer.write(data)
+writer.drain()
+```
+
+Async usage:
+```python notest
+writer.write(data)  # not a blocking operation
+await writer.drain.aio()
+```
+
+#### is local
+
+# modal.is_local
+
+```python
+def is_local() -> bool:
+```
+
+Returns if we are currently on the machine launching/deploying a Modal app
+
+Returns `True` when executed locally on the user's machine.
+Returns `False` when executed from a Modal container in the cloud.
+
+#### method
+
+# modal.method
+
+```python
+def method(
+    *,
+    # Set this to True if it's a non-generator function returning
+    # a [sync/async] generator object
+    is_generator: Optional[bool] = None,
+) -> _MethodDecoratorType:
+```
+
+Decorator for methods that should be transformed into a Modal Function registered against this class's App.
+
+**Usage:**
+
+```python
+@app.cls(cpu=8)
+class MyCls:
+
+    @modal.method()
+    def f(self):
+        ...
+```
+
+#### parameter
+
+# modal.parameter
+
+```python
+def parameter(*, default: Any = _no_default, init: bool = True) -> Any:
+```
+
+Used to specify options for modal.cls parameters, similar to dataclass.field for dataclasses
+```
+class A:
+    a: str = modal.parameter()
+
+```
+
+If `init=False` is specified, the field is not considered a parameter for the
+Modal class and not used in the synthesized constructor. This can be used to
+optionally annotate the type of a field that's used internally, for example values
+being set by @enter lifecycle methods, without breaking type checkers, but it has
+no runtime effect on the class.
+
+#### web endpoint
+
+# modal.web_endpoint
+
+```python
+def web_endpoint(
+    *,
+    method: str = "GET",  # REST method for the created endpoint.
+    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
+    docs: bool = False,  # Whether to enable interactive documentation for this endpoint at /docs.
+    custom_domains: Optional[
+        Iterable[str]
+    ] = None,  # Create an endpoint using a custom domain fully-qualified domain name (FQDN).
+    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
+) -> Callable[
+    [Union[_PartialFunction[P, ReturnType, ReturnType], Callable[P, ReturnType]]],
+    _PartialFunction[P, ReturnType, ReturnType],
+]:
+```
+
+Register a basic web endpoint with this application.
+
+DEPRECATED: This decorator has been renamed to `@modal.fastapi_endpoint`.
+
+This is the simple way to create a web endpoint on Modal. The function
+behaves as a [FastAPI](https://fastapi.tiangolo.com/) handler and should
+return a response object to the caller.
+
+Endpoints created with `@modal.web_endpoint` are meant to be simple, single
+request handlers and automatically have
+[CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) enabled.
+For more flexibility, use `@modal.asgi_app`.
+
+To learn how to use Modal with popular web frameworks, see the
+[guide on web endpoints](https://modal.com/docs/guide/webhooks).
+
+#### web server
+
+# modal.web_server
+
+```python
+def web_server(
+    port: int,
+    *,
+    startup_timeout: float = 5.0,  # Maximum number of seconds to wait for the web server to start.
+    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
+    custom_domains: Optional[Iterable[str]] = None,  # Deploy this endpoint on a custom domain.
+    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
+) -> Callable[[Union[_PartialFunction, NullaryFuncOrMethod]], _PartialFunction]:
+```
+
+Decorator that registers an HTTP web server inside the container.
+
+This is similar to `@asgi_app` and `@wsgi_app`, but it allows you to expose a full HTTP server
+listening on a container port. This is useful for servers written in other languages like Rust,
+as well as integrating with non-ASGI frameworks like aiohttp and Tornado.
+
+**Usage:**
+
+```python
+import subprocess
+
+@app.function()
+@modal.web_server(8000)
+def my_file_server():
+    subprocess.Popen("python -m http.server -d / 8000", shell=True)
+```
+
+The above example starts a simple file server, displaying the contents of the root directory.
+Here, requests to the web endpoint will go to external port 8000 on the container. The
+`http.server` module is included with Python, but you could run anything here.
+
+Internally, the web server is transparently converted into a web endpoint by Modal, so it has
+the same serverless autoscaling behavior as other web endpoints.
+
+For more info, see the [guide on web endpoints](https://modal.com/docs/guide/webhooks).
+
+#### wsgi app
+
+# modal.wsgi_app
+
+```python
+def wsgi_app(
+    *,
+    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
+    custom_domains: Optional[Iterable[str]] = None,  # Deploy this endpoint on a custom domain.
+    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
+) -> Callable[[Union[_PartialFunction, NullaryFuncOrMethod]], _PartialFunction]:
+```
+
+Decorator for registering a WSGI app with a Modal function.
+
+Web Server Gateway Interface (WSGI) is a standard for synchronous Python web apps.
+It has been [succeeded by the ASGI interface](https://asgi.readthedocs.io/en/latest/introduction.html#wsgi-compatibility)
+which is compatible with ASGI and supports additional functionality such as web sockets.
+Modal supports ASGI via [`asgi_app`](https://modal.com/docs/reference/modal.asgi_app).
+
+**Usage:**
+
+```python
+from typing import Callable
+
+@app.function()
+@modal.wsgi_app()
+def create_wsgi() -> Callable:
+    ...
+```
+
+To learn how to use this decorator with popular web frameworks, see the
+[guide on web endpoints](https://modal.com/docs/guide/webhooks).
+
+### CLI Reference
+
+### modal app
+
+# `modal app`
+
+Manage deployed and running apps.
+
+**Usage**:
+
+```shell
+modal app [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `list`: List Modal apps that are currently deployed/running or recently stopped.
+* `logs`: Show App logs, streaming while active.
+* `rollback`: Redeploy a previous version of an App.
+* `stop`: Stop an app.
+* `history`: Show App deployment history, for a currently deployed app
+
+## `modal app list`
+
+List Modal apps that are currently deployed/running or recently stopped.
+
+**Usage**:
+
+```shell
+modal app list [OPTIONS]
+```
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+## `modal app logs`
+
+Show App logs, streaming while active.
+
+**Examples:**
+
+Get the logs based on an app ID:
+
+```
+modal app logs ap-123456
+```
+
+Get the logs for a currently deployed App based on its name:
+
+```
+modal app logs my-app
+```
+
+**Usage**:
+
+```shell
+modal app logs [OPTIONS] [APP_IDENTIFIER]
+```
+
+**Arguments**:
+
+* `[APP_IDENTIFIER]`: App name or ID
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--timestamps`: Show timestamps for each log line
+* `--help`: Show this message and exit.
+
+## `modal app rollback`
+
+Redeploy a previous version of an App.
+
+Note that the App must currently be in a "deployed" state.
+Rollbacks will appear as a new deployment in the App history, although
+the App state will be reset to the state at the time of the previous deployment.
+
+**Examples:**
+
+Rollback an App to its previous version:
+
+```
+modal app rollback my-app
+```
+
+Rollback an App to a specific version:
+
+```
+modal app rollback my-app v3
+```
+
+Rollback an App using its App ID instead of its name:
+
+```
+modal app rollback ap-abcdefghABCDEFGH123456
+```
+
+**Usage**:
+
+```shell
+modal app rollback [OPTIONS] [APP_IDENTIFIER] [VERSION]
+```
+
+**Arguments**:
+
+* `[APP_IDENTIFIER]`: App name or ID
+* `[VERSION]`: Target version for rollback.
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal app stop`
+
+Stop an app.
+
+**Usage**:
+
+```shell
+modal app stop [OPTIONS] [APP_IDENTIFIER]
+```
+
+**Arguments**:
+
+* `[APP_IDENTIFIER]`: App name or ID
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal app history`
+
+Show App deployment history, for a currently deployed app
+
+**Examples:**
+
+Get the history based on an app ID:
+
+```
+modal app history ap-123456
+```
+
+Get the history for a currently deployed App based on its name:
+
+```
+modal app history my-app
+```
+
+**Usage**:
+
+```shell
+modal app history [OPTIONS] [APP_IDENTIFIER]
+```
+
+**Arguments**:
+
+* `[APP_IDENTIFIER]`: App name or ID
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+### modal config
+
+# `modal config`
+
+Manage client configuration for the current profile.
+
+Refer to https://modal.com/docs/reference/modal.config for a full explanation
+of what these options mean, and how to set them.
+
+**Usage**:
+
+```shell
+modal config [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `show`: Show current configuration values (debugging command).
+* `set-environment`: Set the default Modal environment for the active profile
+
+## `modal config show`
+
+Show current configuration values (debugging command).
+
+**Usage**:
+
+```shell
+modal config show [OPTIONS]
+```
+
+**Options**:
+
+* `--redact / --no-redact`: Redact the `token_secret` value.  \[default: redact]
+* `--help`: Show this message and exit.
+
+## `modal config set-environment`
+
+Set the default Modal environment for the active profile
+
+The default environment of a profile is used when no --env flag is passed to `modal run`, `modal deploy` etc.
+
+If no default environment is set, and there exists multiple environments in a workspace, an error will be raised
+when running a command that requires an environment.
+
+**Usage**:
+
+```shell
+modal config set-environment [OPTIONS] ENVIRONMENT_NAME
+```
+
+**Arguments**:
+
+* `ENVIRONMENT_NAME`: \[required]
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+### modal container
+
+# `modal container`
+
+Manage and connect to running containers.
+
+**Usage**:
+
+```shell
+modal container [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `list`: List all containers that are currently running.
+* `logs`: Show logs for a specific container, streaming while active.
+* `exec`: Execute a command in a container.
+* `stop`: Stop a currently-running container and reassign its in-progress inputs.
+
+## `modal container list`
+
+List all containers that are currently running.
+
+**Usage**:
+
+```shell
+modal container list [OPTIONS]
+```
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+## `modal container logs`
+
+Show logs for a specific container, streaming while active.
+
+**Usage**:
+
+```shell
+modal container logs [OPTIONS] CONTAINER_ID
+```
+
+**Arguments**:
+
+* `CONTAINER_ID`: Container ID  \[required]
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+## `modal container exec`
+
+Execute a command in a container.
+
+**Usage**:
+
+```shell
+modal container exec [OPTIONS] CONTAINER_ID COMMAND...
+```
+
+**Arguments**:
+
+* `CONTAINER_ID`: Container ID  \[required]
+* `COMMAND...`: A command to run inside the container.
+
+To pass command-line flags or options, add `--` before the start of your commands. For example: `modal container exec <id> -- /bin/bash -c 'echo hi'`  \[required]
+
+**Options**:
+
+* `--pty / --no-pty`: Run the command using a PTY.
+* `--help`: Show this message and exit.
+
+## `modal container stop`
+
+Stop a currently-running container and reassign its in-progress inputs.
+
+This will send the container a SIGINT signal that Modal will handle.
+
+**Usage**:
+
+```shell
+modal container stop [OPTIONS] CONTAINER_ID
+```
+
+**Arguments**:
+
+* `CONTAINER_ID`: Container ID  \[required]
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+### modal deploy
+
+# `modal deploy`
+
+Deploy a Modal application.
+
+**Usage:**
+modal deploy my_script.py
+modal deploy -m my_package.my_mod
+
+**Usage**:
+
+```shell
+modal deploy [OPTIONS] APP_REF
+```
+
+**Arguments**:
+
+* `APP_REF`: Path to a Python file with an app to deploy  \[required]
+
+**Options**:
+
+* `--name TEXT`: Name of the deployment.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--stream-logs / --no-stream-logs`: Stream logs from the app upon deployment.  \[default: no-stream-logs]
+* `--tag TEXT`: Tag the deployment with a version.
+* `-m`: Interpret argument as a Python module path instead of a file/script path
+* `--help`: Show this message and exit.
+
+### modal dict
+
+# `modal dict`
+
+Manage `modal.Dict` objects and inspect their contents.
+
+**Usage**:
+
+```shell
+modal dict [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `create`: Create a named Dict object.
+* `list`: List all named Dicts.
+* `clear`: Clear the contents of a named Dict by deleting all of its data.
+* `delete`: Delete a named Dict and all of its data.
+* `get`: Print the value for a specific key.
+* `items`: Print the contents of a Dict.
+
+## `modal dict create`
+
+Create a named Dict object.
+
+Note: This is a no-op when the Dict already exists.
+
+**Usage**:
+
+```shell
+modal dict create [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal dict list`
+
+List all named Dicts.
+
+**Usage**:
+
+```shell
+modal dict list [OPTIONS]
+```
+
+**Options**:
+
+* `--json / --no-json`: \[default: no-json]
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal dict clear`
+
+Clear the contents of a named Dict by deleting all of its data.
+
+**Usage**:
+
+```shell
+modal dict clear [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal dict delete`
+
+Delete a named Dict and all of its data.
+
+**Usage**:
+
+```shell
+modal dict delete [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `--allow-missing`: Don't error if the Dict doesn't exist.
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal dict get`
+
+Print the value for a specific key.
+
+Note: When using the CLI, keys are always interpreted as having a string type.
+
+**Usage**:
+
+```shell
+modal dict get [OPTIONS] NAME KEY
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+* `KEY`: \[required]
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal dict items`
+
+Print the contents of a Dict.
+
+Note: By default, this command truncates the contents. Use the `N` argument to control the
+amount of data shown or the `--all` option to retrieve the entire Dict, which may be slow.
+
+**Usage**:
+
+```shell
+modal dict items [OPTIONS] NAME [N]
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+* `[N]`: Limit the number of entries shown  \[default: 20]
+
+**Options**:
+
+* `-a, --all`: Ignore N and print all entries in the Dict (may be slow)
+* `-r, --repr`: Display items using `repr()` to see more details
+* `--json / --no-json`: \[default: no-json]
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+### modal environment
+
+# `modal environment`
+
+Create and interact with Environments
+
+Environments are sub-divisons of workspaces, allowing you to deploy the same app
+in different namespaces. Each environment has their own set of Secrets and any
+lookups performed from an app in an environment will by default look for entities
+in the same environment.
+
+Typical use cases for environments include having one for development and one for
+production, to prevent overwriting production apps when developing new features
+while still being able to deploy changes to a live environment.
+
+**Usage**:
+
+```shell
+modal environment [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `list`: List all environments in the current workspace
+* `create`: Create a new environment in the current workspace
+* `delete`: Delete an environment in the current workspace
+* `update`: Update the name or web suffix of an environment
+
+## `modal environment list`
+
+List all environments in the current workspace
+
+**Usage**:
+
+```shell
+modal environment list [OPTIONS]
+```
+
+**Options**:
+
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+## `modal environment create`
+
+Create a new environment in the current workspace
+
+**Usage**:
+
+```shell
+modal environment create [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: Name of the new environment. Must be unique. Case sensitive  \[required]
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+## `modal environment delete`
+
+Delete an environment in the current workspace
+
+Deletes all apps in the selected environment and deletes the environment irrevocably.
+
+**Usage**:
+
+```shell
+modal environment delete [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: Name of the environment to be deleted. Case sensitive  \[required]
+
+**Options**:
+
+* `-y, --yes`: Run without pausing for confirmation.
+* `--help`: Show this message and exit.
+
+## `modal environment update`
+
+Update the name or web suffix of an environment
+
+**Usage**:
+
+```shell
+modal environment update [OPTIONS] CURRENT_NAME
+```
+
+**Arguments**:
+
+* `CURRENT_NAME`: \[required]
+
+**Options**:
+
+* `--set-name TEXT`: New name of the environment
+* `--set-web-suffix TEXT`: New web suffix of environment (empty string is no suffix)
+* `--help`: Show this message and exit.
+
+### modal launch
+
+# `modal launch`
+
+[Experimental] Open a serverless app instance on Modal.
+
+**Usage**:
+
+```shell
+modal launch [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `jupyter`: Start Jupyter Lab on Modal.
+* `vscode`: Start Visual Studio Code on Modal.
+
+## `modal launch jupyter`
+
+Start Jupyter Lab on Modal.
+
+**Usage**:
+
+```shell
+modal launch jupyter [OPTIONS]
+```
+
+**Options**:
+
+* `--cpu INTEGER`: \[default: 8]
+* `--memory INTEGER`: \[default: 32768]
+* `--gpu TEXT`
+* `--timeout INTEGER`: \[default: 3600]
+* `--image TEXT`: \[default: ubuntu:22.04]
+* `--add-python TEXT`: \[default: 3.11]
+* `--mount TEXT`
+* `--volume TEXT`
+* `--detach / --no-detach`: \[default: no-detach]
+* `--help`: Show this message and exit.
+
+## `modal launch vscode`
+
+Start Visual Studio Code on Modal.
+
+**Usage**:
+
+```shell
+modal launch vscode [OPTIONS]
+```
+
+**Options**:
+
+* `--cpu INTEGER`: \[default: 8]
+* `--memory INTEGER`: \[default: 32768]
+* `--gpu TEXT`
+* `--image TEXT`: \[default: debian:12]
+* `--timeout INTEGER`: \[default: 3600]
+* `--mount TEXT`
+* `--volume TEXT`
+* `--detach / --no-detach`: \[default: no-detach]
+* `--help`: Show this message and exit.
+
+### modal nfs
+
+# `modal nfs`
+
+Read and edit `modal.NetworkFileSystem` file systems.
+
+**Usage**:
+
+```shell
+modal nfs [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `list`: List the names of all network file systems.
+* `create`: Create a named network file system.
+* `ls`: List files and directories in a network file system.
+* `put`: Upload a file or directory to a network file system.
+* `get`: Download a file from a network file system.
+* `rm`: Delete a file or directory from a network file system.
+* `delete`: Delete a named, persistent modal.NetworkFileSystem.
+
+## `modal nfs list`
+
+List the names of all network file systems.
+
+**Usage**:
+
+```shell
+modal nfs list [OPTIONS]
+```
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+## `modal nfs create`
+
+Create a named network file system.
+
+**Usage**:
+
+```shell
+modal nfs create [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal nfs ls`
+
+List files and directories in a network file system.
+
+**Usage**:
+
+```shell
+modal nfs ls [OPTIONS] VOLUME_NAME [PATH]
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `[PATH]`: \[default: /]
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal nfs put`
+
+Upload a file or directory to a network file system.
+
+Remote parent directories will be created as needed.
+
+Ending the REMOTE_PATH with a forward slash (/), it's assumed to be a directory and the file
+will be uploaded with its current name under that directory.
+
+**Usage**:
+
+```shell
+modal nfs put [OPTIONS] VOLUME_NAME LOCAL_PATH [REMOTE_PATH]
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `LOCAL_PATH`: \[required]
+* `[REMOTE_PATH]`: \[default: /]
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal nfs get`
+
+Download a file from a network file system.
+
+Specifying a glob pattern (using any `*` or `**` patterns) as the `remote_path` will download
+all matching files, preserving their directory structure.
+
+For example, to download an entire network file system into `dump_volume`:
+
+```
+modal nfs get <volume-name> "**" dump_volume
+```
+
+Use "-" as LOCAL_DESTINATION to write file contents to standard output.
+
+**Usage**:
+
+```shell
+modal nfs get [OPTIONS] VOLUME_NAME REMOTE_PATH [LOCAL_DESTINATION]
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `REMOTE_PATH`: \[required]
+* `[LOCAL_DESTINATION]`: \[default: .]
+
+**Options**:
+
+* `--force / --no-force`: \[default: no-force]
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal nfs rm`
+
+Delete a file or directory from a network file system.
+
+**Usage**:
+
+```shell
+modal nfs rm [OPTIONS] VOLUME_NAME REMOTE_PATH
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `REMOTE_PATH`: \[required]
+
+**Options**:
+
+* `-r, --recursive`: Delete directory recursively
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal nfs delete`
+
+Delete a named, persistent modal.NetworkFileSystem.
+
+**Usage**:
+
+```shell
+modal nfs delete [OPTIONS] NFS_NAME
+```
+
+**Arguments**:
+
+* `NFS_NAME`: Name of the modal.NetworkFileSystem to be deleted. Case sensitive  \[required]
+
+**Options**:
+
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+### modal profile
+
+# `modal profile`
+
+Switch between Modal profiles.
+
+**Usage**:
+
+```shell
+modal profile [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `activate`: Change the active Modal profile.
+* `current`: Print the currently active Modal profile.
+* `list`: Show all Modal profiles and highlight the active one.
+
+## `modal profile activate`
+
+Change the active Modal profile.
+
+**Usage**:
+
+```shell
+modal profile activate [OPTIONS] PROFILE
+```
+
+**Arguments**:
+
+* `PROFILE`: Modal profile to activate.  \[required]
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+## `modal profile current`
+
+Print the currently active Modal profile.
+
+**Usage**:
+
+```shell
+modal profile current [OPTIONS]
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+## `modal profile list`
+
+Show all Modal profiles and highlight the active one.
+
+**Usage**:
+
+```shell
+modal profile list [OPTIONS]
+```
+
+**Options**:
+
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+### modal queue
+
+# `modal queue`
+
+Manage `modal.Queue` objects and inspect their contents.
+
+**Usage**:
+
+```shell
+modal queue [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `create`: Create a named Queue.
+* `delete`: Delete a named Queue and all of its data.
+* `list`: List all named Queues.
+* `clear`: Clear the contents of a queue by removing all of its data.
+* `peek`: Print the next N items in the queue or queue partition (without removal).
+* `len`: Print the length of a queue partition or the total length of all partitions.
+
+## `modal queue create`
+
+Create a named Queue.
+
+Note: This is a no-op when the Queue already exists.
+
+**Usage**:
+
+```shell
+modal queue create [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal queue delete`
+
+Delete a named Queue and all of its data.
+
+**Usage**:
+
+```shell
+modal queue delete [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `--allow-missing`: Don't error if the Queue doesn't exist.
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal queue list`
+
+List all named Queues.
+
+**Usage**:
+
+```shell
+modal queue list [OPTIONS]
+```
+
+**Options**:
+
+* `--json / --no-json`: \[default: no-json]
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal queue clear`
+
+Clear the contents of a queue by removing all of its data.
+
+**Usage**:
+
+```shell
+modal queue clear [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `-p, --partition TEXT`: Name of the partition to use, otherwise use the default (anonymous) partition.
+* `-a, --all`: Clear the contents of all partitions.
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal queue peek`
+
+Print the next N items in the queue or queue partition (without removal).
+
+**Usage**:
+
+```shell
+modal queue peek [OPTIONS] NAME [N]
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+* `[N]`: \[default: 1]
+
+**Options**:
+
+* `-p, --partition TEXT`: Name of the partition to use, otherwise use the default (anonymous) partition.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal queue len`
+
+Print the length of a queue partition or the total length of all partitions.
+
+**Usage**:
+
+```shell
+modal queue len [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `-p, --partition TEXT`: Name of the partition to use, otherwise use the default (anonymous) partition.
+* `-t, --total`: Compute the sum of the queue lengths across all partitions
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+### modal run
+
+# `modal run`
+
+Run a Modal function or local entrypoint.
+
+`FUNC_REF` should be of the format `{file or module}::{function name}`.
+Alternatively, you can refer to the function via the app:
+
+`{file or module}::{app variable name}.{function name}`
+
+**Examples:**
+
+To run the hello_world function (or local entrypoint) in my_app.py:
+
+```
+modal run my_app.py::hello_world
+```
+
+If your module only has a single app and your app has a
+single local entrypoint (or single function), you can omit the app and
+function parts:
+
+```
+modal run my_app.py
+```
+
+Instead of pointing to a file, you can also use the Python module path, which
+by default will ensure that your remote functions will use the same module
+names as they do locally.
+
+```
+modal run -m my_project.my_app
+```
+
+**Usage**:
+
+```shell
+modal run [OPTIONS] FUNC_REF
+```
+
+**Options**:
+
+* `-w, --write-result TEXT`: Write return value (which must be str or bytes) to this local path.
+* `-q, --quiet`: Don't show Modal progress indicators.
+* `-d, --detach`: Don't stop the app if the local process dies or disconnects.
+* `-i, --interactive`: Run the app in interactive mode.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `-m`: Interpret argument as a Python module path instead of a file/script path
+* `--help`: Show this message and exit.
+
+### modal secret
+
+# `modal secret`
+
+Manage secrets.
+
+**Usage**:
+
+```shell
+modal secret [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `list`: List your published secrets.
+* `create`: Create a new secret.
+* `delete`: Delete a named Secret.
+
+## `modal secret list`
+
+List your published secrets.
+
+**Usage**:
+
+```shell
+modal secret list [OPTIONS]
+```
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+## `modal secret create`
+
+Create a new secret.
+
+**Usage**:
+
+```shell
+modal secret create [OPTIONS] SECRET_NAME [KEYVALUES]...
+```
+
+**Arguments**:
+
+* `SECRET_NAME`: \[required]
+* `[KEYVALUES]...`: Space-separated KEY=VALUE items.
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--from-dotenv PATH`: Path to a .env file to load secrets from.
+* `--from-json PATH`: Path to a JSON file to load secrets from.
+* `--force`: Overwrite the secret if it already exists.
+* `--help`: Show this message and exit.
+
+## `modal secret delete`
+
+Delete a named Secret.
+
+**Usage**:
+
+```shell
+modal secret delete [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: Name of the modal.Secret to be deleted. Case sensitive  \[required]
+
+**Options**:
+
+* `--allow-missing`: Don't error if the Secret doesn't exist.
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+### modal serve
+
+# `modal serve`
+
+Run a web endpoint(s) associated with a Modal app and hot-reload code.
+
+**Examples:**
+
+```
+modal serve hello_world.py
+```
+
+Modal-generated URLs will have a `-dev` suffix appended to them when running with `modal serve`.
+To customize this suffix (i.e., to avoid collisions with other users in your workspace who are
+concurrently serving the App), you can set the `dev_suffix` in your `.modal.toml` file or the
+`MODAL_DEV_SUFFIX` environment variable.
+
+**Usage**:
+
+```shell
+modal serve [OPTIONS] APP_REF
+```
+
+**Arguments**:
+
+* `APP_REF`: Path to a Python file with an app.  \[required]
+
+**Options**:
+
+* `--timeout FLOAT`
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `-m`: Interpret argument as a Python module path instead of a file/script path
+* `--help`: Show this message and exit.
+
+### modal setup
+
+# `modal setup`
+
+Bootstrap Modal's configuration.
+
+**Usage**:
+
+```shell
+modal setup [OPTIONS]
+```
+
+**Options**:
+
+* `--profile TEXT`
+* `--help`: Show this message and exit.
+
+### modal shell
+
+# `modal shell`
+
+Run a command or interactive shell inside a Modal container.
+
+**Examples:**
+
+Start an interactive shell inside the default Debian-based image:
+
+```
+modal shell
+```
+
+Start an interactive shell with the spec for `my_function` in your App
+(uses the same image, volumes, mounts, etc.):
+
+```
+modal shell hello_world.py::my_function
+```
+
+Or, if you're using a [modal.Cls](https://modal.com/docs/reference/modal.Cls)
+you can refer to a `@modal.method` directly:
+
+```
+modal shell hello_world.py::MyClass.my_method
+```
+
+Start a `python` shell:
+
+```
+modal shell hello_world.py --cmd=python
+```
+
+Run a command with your function's spec and pipe the output to a file:
+
+```
+modal shell hello_world.py -c 'uv pip list' > env.txt
+```
+
+Connect to a running Sandbox by ID:
+
+```
+modal shell sb-abc123xyz
+```
+
+**Usage**:
+
+```shell
+modal shell [OPTIONS] [REF]
+```
+
+**Arguments**:
+
+* `[REF]`: ID of running container or Sandbox, or path to a Python file containing an App. Can also include a Function specifier, like `module.py::func`, if the file defines multiple Functions.
+
+**Options**:
+
+* `-c, --cmd TEXT`: Command to run inside the Modal image.  \[default: /bin/bash]
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--image TEXT`: Container image tag for inside the shell (if not using REF).
+* `--add-python TEXT`: Add Python to the image (if not using REF).
+* `--volume TEXT`: Name of a `modal.Volume` to mount inside the shell at `/mnt/{name}` (if not using REF). Can be used multiple times.
+* `--add-local TEXT`: Local file or directory to mount inside the shell at `/mnt/{basename}` (if not using REF). Can be used multiple times.
+* `--secret TEXT`: Name of a `modal.Secret` to mount inside the shell (if not using REF). Can be used multiple times.
+* `--cpu INTEGER`: Number of CPUs to allocate to the shell (if not using REF).
+* `--memory INTEGER`: Memory to allocate for the shell, in MiB (if not using REF).
+* `--gpu TEXT`: GPUs to request for the shell, if any. Examples are `any`, `a10g`, `a100:4` (if not using REF).
+* `--cloud TEXT`: Cloud provider to run the shell on. Possible values are `aws`, `gcp`, `oci`, `auto` (if not using REF).
+* `--region TEXT`: Region(s) to run the container on. Can be a single region or a comma-separated list to choose from (if not using REF).
+* `--pty`: Run the command using a PTY.
+* `-m`: Interpret argument as a Python module path instead of a file/script path
+* `--help`: Show this message and exit.
+
+### modal token
+
+# `modal token`
+
+Manage tokens.
+
+**Usage**:
+
+```shell
+modal token [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `set`: Set account credentials for connecting to Modal.
+* `new`: Create a new token by using an authenticated web session.
+
+## `modal token set`
+
+Set account credentials for connecting to Modal.
+
+If the credentials are not provided on the command line, you will be prompted to enter them.
+
+**Usage**:
+
+```shell
+modal token set [OPTIONS]
+```
+
+**Options**:
+
+* `--token-id TEXT`: Account token ID.
+* `--token-secret TEXT`: Account token secret.
+* `--profile TEXT`: Modal profile to set credentials for. If unspecified (and MODAL_PROFILE environment variable is not set), uses the workspace name associated with the credentials.
+* `--activate / --no-activate`: Activate the profile containing this token after creation.  \[default: activate]
+* `--verify / --no-verify`: Make a test request to verify the new credentials.  \[default: verify]
+* `--help`: Show this message and exit.
+
+## `modal token new`
+
+Create a new token by using an authenticated web session.
+
+**Usage**:
+
+```shell
+modal token new [OPTIONS]
+```
+
+**Options**:
+
+* `--profile TEXT`: Modal profile to set credentials for. If unspecified (and MODAL_PROFILE environment variable is not set), uses the workspace name associated with the credentials.
+* `--activate / --no-activate`: Activate the profile containing this token after creation.  \[default: activate]
+* `--verify / --no-verify`: Make a test request to verify the new credentials.  \[default: verify]
+* `--source TEXT`
+* `--help`: Show this message and exit.
+
+### modal volume
+
+# `modal volume`
+
+Read and edit `modal.Volume` volumes.
+
+Note: users of `modal.NetworkFileSystem` should use the `modal nfs` command instead.
+
+**Usage**:
+
+```shell
+modal volume [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `create`: Create a named, persistent modal.Volume.
+* `get`: Download files from a modal.Volume object.
+* `list`: List the details of all modal.Volume volumes in an Environment.
+* `ls`: List files and directories in a modal.Volume volume.
+* `put`: Upload a file or directory to a modal.Volume.
+* `rm`: Delete a file or directory from a modal.Volume.
+* `cp`: Copy within a modal.Volume.
+* `delete`: Delete a named Volume and all of its data.
+* `rename`: Rename a modal.Volume.
+
+## `modal volume create`
+
+Create a named, persistent modal.Volume.
+
+**Usage**:
+
+```shell
+modal volume create [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: \[required]
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--version INTEGER`: VolumeFS version. (Experimental)
+* `--help`: Show this message and exit.
+
+## `modal volume get`
+
+Download files from a modal.Volume object.
+
+If a folder is passed for REMOTE_PATH, the contents of the folder will be downloaded
+recursively, including all subdirectories.
+
+**Example**
+
+```
+modal volume get <volume_name> logs/april-12-1.txt
+modal volume get <volume_name> / volume_data_dump
+```
+
+Use "-" as LOCAL_DESTINATION to write file contents to standard output.
+
+**Usage**:
+
+```shell
+modal volume get [OPTIONS] VOLUME_NAME REMOTE_PATH [LOCAL_DESTINATION]
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `REMOTE_PATH`: \[required]
+* `[LOCAL_DESTINATION]`: \[default: .]
+
+**Options**:
+
+* `--force / --no-force`: \[default: no-force]
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal volume list`
+
+List the details of all modal.Volume volumes in an Environment.
+
+**Usage**:
+
+```shell
+modal volume list [OPTIONS]
+```
+
+**Options**:
+
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--json / --no-json`: \[default: no-json]
+* `--help`: Show this message and exit.
+
+## `modal volume ls`
+
+List files and directories in a modal.Volume volume.
+
+**Usage**:
+
+```shell
+modal volume ls [OPTIONS] VOLUME_NAME [PATH]
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `[PATH]`: \[default: /]
+
+**Options**:
+
+* `--json / --no-json`: \[default: no-json]
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal volume put`
+
+Upload a file or directory to a modal.Volume.
+
+Remote parent directories will be created as needed.
+
+Ending the REMOTE_PATH with a forward slash (/), it's assumed to be a directory
+and the file will be uploaded with its current name under that directory.
+
+**Usage**:
+
+```shell
+modal volume put [OPTIONS] VOLUME_NAME LOCAL_PATH [REMOTE_PATH]
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `LOCAL_PATH`: \[required]
+* `[REMOTE_PATH]`: \[default: /]
+
+**Options**:
+
+* `-f, --force`: Overwrite existing files.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal volume rm`
+
+Delete a file or directory from a modal.Volume.
+
+**Usage**:
+
+```shell
+modal volume rm [OPTIONS] VOLUME_NAME REMOTE_PATH
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `REMOTE_PATH`: \[required]
+
+**Options**:
+
+* `-r, --recursive`: Delete directory recursively
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal volume cp`
+
+Copy within a modal.Volume. Copy source file to destination file or multiple source files to destination directory.
+
+**Usage**:
+
+```shell
+modal volume cp [OPTIONS] VOLUME_NAME PATHS...
+```
+
+**Arguments**:
+
+* `VOLUME_NAME`: \[required]
+* `PATHS...`: \[required]
+
+**Options**:
+
+* `-r, --recursive`: Copy directories recursively
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal volume delete`
+
+Delete a named Volume and all of its data.
+
+**Usage**:
+
+```shell
+modal volume delete [OPTIONS] NAME
+```
+
+**Arguments**:
+
+* `NAME`: Name of the modal.Volume to be deleted. Case sensitive  \[required]
+
+**Options**:
+
+* `--allow-missing`: Don't error if the Volume doesn't exist.
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+## `modal volume rename`
+
+Rename a modal.Volume.
+
+**Usage**:
+
+```shell
+modal volume rename [OPTIONS] OLD_NAME NEW_NAME
+```
+
+**Arguments**:
+
+* `OLD_NAME`: \[required]
+* `NEW_NAME`: \[required]
+
+**Options**:
+
+* `-y, --yes`: Run without pausing for confirmation.
+* `-e, --env TEXT`: Environment to interact with.
+
+If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
+Otherwise, raises an error if the workspace has multiple environments.
+* `--help`: Show this message and exit.
+
+# Changelog
+
+This changelog documents user-facing updates (features, enhancements, fixes, and deprecations) to the `modal` client library.
+
+## Latest
+
+### 1.3.0 (2025-12-19)
+
+Modal now supports Python 3.14. Support for Python 3.14t (the free-threading build) is still experimental; please report any issues you encounter using Modal with free-threading Python. Additionally, Modal no longer supports Python 3.9, which has reached [end-of-life](https://devguide.python.org/versions).
+
+We are adding experimental support for detecting cases where Modal's blocking APIs are used in async contexts (which can be a source of bugs or performance issues). You can opt into runtime warnings by setting `MODAL_ASYNC_WARNINGS=1` as an environment variable or `async_warnings = true` as a config field. We will enable these warnings by default in the future; please report any apparent false positives or other issues while support is experimental.
+
+This release also includes a small number of deprecations and behavioral changes:
+
+- The Modal SDK will no longer propagate `grpclib.GRPCError` types out to the user; our own `modal.Error` subtypes will be used instead. To avoid disrupting user code that has relied on `GRPCError` exceptions for control flow, we are temporarily making some exception types inherit from `GRPCError` so that they will also be caught by `except grpclib.GRPCError` statements. Accessing the `.status` attribute of the exception will issue a deprecation warning, but warnings cannot be issued if the exception object is only caught and there is no other interaction with it. We advise proactively migrating any exception handling to use Modal types, as we will remove the dependency on `grpclib` types entirely in the future. See the [`modal.exception`](https://modal.com/docs/reference/modal.exception) docs for the mapping from gRPC status codes to Modal exception types.
+- The `max_inputs` parameter in the `@app.function()` and `@app.cls` decorators has been renamed to `single_use_containers` and now takes a boolean value rather than an integer. Note that only `max_inputs=1` has been supported, so this has no functional implications. This change is being made to reduce confusion with `@modal.concurrent(max_inputs=...)` and so that Modal's autoscaler can provide better performance for Functions with single-use containers.
+- The async (`.aio`) interface has been deprecated from `modal.FunctionCall.from_id`, `modal.Image.from_id`, and `modal.SandboxSnapshot.from_id`, because these methods do not perform I/O.
+- The `replace_bytes` and `delete_bytes` methods have been removed from the `modal.file_io` filesystem interface.
+- Images built with `modal.Image.micromamba()` using the 2023.12 [Image Builder Version](https://modal.com/docs/guide/images#image-builder-updates) will now use a Python version that matches their local environment by default, rather than defaulting to Python 3.9.
+
+## 1.2
+
+### 1.2.6 (2025-12-16)
+
+- Fixed bug where iterating on a `modal.Sandbox.exec` output stream could raise unauthenticated errors.
+
+### 1.2.5 (2025-12-12)
+
+- It is now possible to set a custom `name=` for a Function without using `serialized=True`. This can be useful when decorating a function multiple times, e.g. applying multiple Modal configurations to the same implementation.
+- It is now possible to start `modal shell` with a Modal Image ID (`modal shell im-abc123`). Additionally, `modal shell` will now warn if you pass invalid combinations of arguments (like `--cpu` together with the ID of an already running Sandbox, etc.).
+- Fixed a bug in `modal shell` that caused e.g. `vi` to fail with unicode decode errors.
+- Fixed a thread-safety issue in `modal.Sandbox` resource cleanup.
+- Improved performance when adding large local directories to an Image.
+- Improved async Sandbox performance by not blocking the event loop while reading from `stdout` or `stderr`.
+
+### 1.2.4 (2025-11-21)
+
+- Fixed a bug in `modal.Sandbox.exec` when using `stderr=StreamType.STDOUT` (introduced in v1.2.3).
+- Added a new `h2_enabled` option in `modal.forward`, which enables HTTP/2 advertisement in TLS establishment.
+
+### 1.2.3 (2025-11-20)
+
+- CPU Functions can now be configured to run on non-preemptible capacity by setting `nonpreemptible=True` in the `@app.function()` or `@app.cls()` decorator. This feature is not currently available when requesting a GPU. Note that non-preemptibility incurs a 3x multiplier on CPU and memory pricing. See the [Guide](https://modal.com/docs/guide/preemption) for more information on preemptions.
+- The Modal client can now respond more gracefully to server throttling (e.g., rate limiting) by backing off and automatically retrying. This behavior can be controlled with a new `MODAL_MAX_THROTTLE_WAIT` config variable. Setting the config to `0` will preserve the previous behavior and treat rate limits as an exception; setting it to a nonzero number (the unit is seconds) will allow a limited duration of retries.
+- The `modal.Sandbox.exec` implementation has been rewritten to be more reliable and efficient.
+- Added a new `--add-local` flag to `modal shell`, allowing local files and directories to be included in the shell's container.
+- Fixed a bug introduced in v1.2.2 where some Modal objects (e.g., `modal.FunctionCall`) were not usable after being captured in a Memory Snapshot. The bug would result in a `has no loader function` error when the object was used.
+
+### 1.2.2 (2025-11-10)
+
+- `modal.Image.run_commands` now supports `modal.Volume` mounts. This can be helpful for accelerating builds by keeping a package manager cache on the Volume:
+
+  ```python
+  cache_vol = modal.Volume.from_name("cache-mount")
+  cmd_using_cache = "..."
+  image = modal.Image.debian_slim().run_commands(cmd_using_cache, volumes={"/cache": cache_vol})
+  ```
+
+- All Modal objects now accept an optional `modal.Client` object in their constructor methods. Passing an explicit client can be helpful in cases where Modal credentials are retrieved from within the Python process that is making requests.
+- The `name=` passed to `modal.Sandbox.create` and `modal.Sandbox.from_name` is now required to follow other Modal object naming rules (must contain only alphanumeric characters, dashes, periods, or underscores and cannot exceed 64 characters). Passing an invalid name will now error.
+- `modal.CloudBucketMount` now supports `force_path_style=True` to disable virtual-host-style addressing. See [mountpoint-s3 endpoints docs](https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#endpoints-and-aws-privatelink) for details.
+- The output from `modal config show` is now valid JSON and can be parsed by CLI tools such as `jq`.
+- Fixed a bug where App tags were not attached to Image builds that occur when first deploying the App.
+
+### 1.2.1 (2025-10-22)
+
+- It's now possible to override the default `-dev` suffix applied to the autogenerated URLs for ephemeral Apps (i.e., when using `modal serve`) via a new `dev_suffix` field in the `.modal.toml` config file, or equivalently with the `MODAL_DEV_SUFFIX` environment variable. This can help avoid collisions when multiple users of a workspace are working on the same codebase simultaneously.
+- Fixed a bug where reading long stdout/stderr from `modal.Sandbox.exec()` could break in `text=True` mode.
+- Fixed a bug where the status code was not checked when downloading a file from a Volume.
+- `modal run --detach ...` will now exit more gracefully if you lose internet connection while your App is running.
+
+### 1.2.0 (2025-10-09)
+
+In this release, we're introducing the concept of "App tags", which are simple key-value metadata that can be included to provide additional organizational context. Tags can be defined as part of the `modal.App` constructor:
+
+```python
+app = modal.App("llm-inference-server", tags={"team": "genai-platform"})
+```
+
+Tags can also be added to an active App via the new `modal.App.set_tags()` method, and current tags can be retrieved with the new `modal.App.get_tags()`.
+
+This release also introduces a new API for generating a tabular billing report: `modal.billing.workspace_billing_report()`. The billing API will report the cost incurred by each App, aggregated over time intervals (currently supporting a daily or hourly resolution). The report can optionally include App tags, allowing you to perform cost allocation using your own organizational schema.
+
+Note that the initial release of the billing API is a private beta. Please get in touch to discuss access.
+
+This release also includes some internal changes to Function input/output serialization. These changes will provide better support for calling into Modal Functions from our `modal-js` and `modal-go` SDKs. Versions 0.4 or later of `modal-js` and `modal-go` will only be able to invoke Functions in Apps deployed with version 1.2 or later of the Python SDK.
+
+Other new features and improvements:
+
+- The new `modal.Sandbox.create_connect_token()` method facilitates authentication for making HTTP / Websocket requests to a server running in a Sandbox:
+
+  ```python notest
+  sb = modal.Sandbox.create(...)
+
+  # Create a connect token, optionally including arbitrary user metadata
+  creds = sb.create_connect_token(user_metadata={"user_id": "user123"})
+
+  # Make an http request, passing the token in the authorization header
+  requests.get(creds.url, headers={"Authorization": f"Bearer {creds.token}"})
+  ```
+
+  See the [Sandbox Networking guide](https://modal.com/docs/guide/sandbox-networking) for more information.
+
+- The new `modal.Image.build()` method allows you to eagerly trigger an Image build. This is particularly helpful when working with Sandboxes, as otherwise the Image build would happen lazily inside `modal.Sandbox.create()`:
+
+  ```python notest
+  app = modal.App.lookup("sandbox-app")
+  image = modal.Image.from_registry("ubuntu")
+
+  # This step will block until the build completes
+  image.build(app)
+
+  # Now the Sandbox will be created and scheduled immediately
+  sb = modal.Sandbox.create(app=app, image=image)
+  ```
+
+- We've added an `env` parameter to a number of methods that configure Function, Sandbox, or Image execution. This parameter accepts a dictionary and adds the contents as environment variables in the relevant Modal container. This allows for simpler inclusion of non-sensitive information compared to using a `modal.Secret`.
+- It's now possible to pass a `modal.CloudBucketMount` instance to the `volumes=` parameter of `modal.Cls.with_options` (previously, only dynamic addition of `modal.Volume` mounts was supported).
+- The new `modal.Sandbox.get_tags()` method will fetch the tags currently in use by the Sandbox (i.e., after calling `modal.Sandbox.set_tags()`). Note that Sandbox tags are distinct from the new concept of App tags.
+- `modal.Dict.pop()` now accepts an optional `default` parameter, akin to Python's `dict.pop()`.
+- It's now possible to `modal shell` into a running Sandbox by passing its Sandbox ID (`modal shell sb-123`).
+- Sandboxes can now be configured to expose a PTY device via `Sandbox.create(..., pty=True)` and `Sandbox.exec(..., pty=True)`. This provides better support for Claude Code.
+- The new `modal.experimental.image_delete()` function can be used to delete the final layer of an Image given its ID, which can be particularly useful for cleaning up Sandbox Filesystem Snapshots.
+- Using `modal run --interactive` (or `-i`) will now suppress Modal's status spinner to avoid interfering with breakpoints in local entrypoint functions. We've also improved support for printing large objects when attached to a debugger.
+- We've improved support for Protobuf 5+ when using the Python implementation of the Protobuf runtime.
+
+This release also introduces a small number of new deprecations:
+
+- We deprecated the `client` parameter from `Sandbox.set_tags()`. To use an explicit Client when interacting with the Sandbox, pass it into `modal.Sandbox.create()` instead.
+- We deprecated the `pty_info` parameter from `Sandbox.create()` and `Sandbox.exec()`. This was a private parameter accepting an internal Protobuf type. See the new boolean `pty` parameter instead.
+- We replaced the `--no-confirm` option with `--yes` in the `modal environment delete` CLI to align with other CLI commands that normally require confirmation.
+
+Finally, some functionality that began issuing deprecation warnings prior to v0.73 has now been completely removed:
+
+- It is now required to "instantiate" a `modal.Cls` before invoking one of its methods.
+- The eager `.lookup()` method has been removed from most Modal object classes (but not from `modal.App.lookup`, which remains supported). The lazy `.from_name()` method is recommended for accessing deployed objects going forward.
+- The public constructors on the `modal.mount.Mount` object have been removed; this is now an entirely internal class.
+- The `context_mount=` parameter has accordingly been removed from Docker-oriented `modal.Image` methods.
+- The unused `allow_cross_region_volumes` parameter has been removed from the function decorators.
+- The `modal.experimental.update_autoscaler()` function has been removed; this functionality now has a stable API as `modal.Function.update_autoscaler()`.
+
+## 1.1
+
+### 1.1.4 (2025-09-03)
+
+- Added a `startup_timeout` parameter to the `@app.function()` and `@app.cls()` decorators. When used, this configures the timeout applied to each container's startup period separately from the input `timeout`. For backwards compatibility, `timeout` still applies to the startup phase when `startup_timeout` is unset.
+- Added an optional `idle_timeout` parameter to `modal.Sandbox.create()`. When provided, Sandboxes will terminate after `idle_timeout` seconds of idleness.
+- The dataclass returned by `modal.experimental.get_cluster_info()` now includes a `cluster_id` field to identify the clustered set of containers.
+- When `block_network=True` is set in `modal.Sandbox.create()`, we now raise an error if any of `encrypted_ports`, `h2_ports`, or `unencrypted_ports` are also set.
+- Functions decorated with `@modal.asgi_app()` now return an HTTP 408 (request timeout) error code instead of a 502 (gateway timeout) in rare cases when an input fails to arrive at the container, e.g. due to cancellation.
+- `modal.Sandbox.create()` now warns when an invalid `name=` is passed, applying the same rules as other Modal object names: names must be alphanumeric and not longer than 64 characters. This will become an error in the future.
+
+### 1.1.3 (2025-08-19)
+
+- Fixed a bug introduced in `v1.1.2` that causes invocation of `modal.FunctionCall.get`, `modal.FunctionCall.get_call_graph`, `modal.FunctionCall.cancel`, and `modal.FunctionCall.gather` to fail when the `FunctionCall` object is retrieved via `modal.FunctionCall.from_id`.
+- Added retries to improve the robustness of `modal volume get`
+
+### 1.1.2 (2025-08-14)
+
+We're introducing a new API pattern for imperative management of Modal resource types (`modal.Volume`, `modal.Secret`, `modal.Dict`, and `modal.Queue`). The API is accessible through the `.objects` namespace on each class. The object management namespace has methods for the following operations:
+
+- `.objects.create(name)` creates an object on our backend. E.g., with [`modal.Volume.objects.create`](https://modal.com/docs/reference/modal.Volume#create):
+  ```python notest
+  modal.Volume.objects.create("huggingface-cache", environment_name="dev")
+  ```
+- `.objects.delete(name)` deletes the object with that name. E.g., with [`modal.Secret.objects.delete`](https://modal.com/docs/reference/modal.Secret#delete):
+  ```python notest
+  modal.Secret.objects.delete("aws-token")
+  ```
+- `.objects.list()` returns a list of object instances. E.g., with [`modal.Queue.objects.list`](https://modal.com/docs/reference/modal.Queue#list):
+  ```python notest
+  for queue in modal.Queue.objects.list():
+      queue_info = queue.info()
+      print(queue_info.name, queue_info.created_at, queue.len())
+  ```
+
+With the introduction of these APIs, we're replacing a few older methods with similar functionality:
+
+- Static `.delete()` methods on the resource types themselves are being deprecated, because they are too easily confused with operations on the _contents_ of a resource (i.e., calling `modal.Dict.delete(key_name)` is an easy mistake that can have significant adverse consequences).
+- The undocumented `.create_deployed()` methods of `modal.Volume` and `modal.Secret` are being deprecated in favor of this consistent API for imperative management.
+
+Other changes:
+
+- `modal.Cls.with_options` now supports `region` and `cloud` keyword arguments to support runtime constraints on scheduling.
+- Fixed a bug that could cause Image builds to fail with `'FilePatternMatcher' object has no attribute 'patterns'` when using a `modal.FilePatternMatcher.from_file` ignore pattern.
+- Fixed a bug where `rdma=True` was ignored when using `@modal.experimental.clustered()` with a `modal.Cls`.
+
+### 1.1.1 (2025-08-01)
+
+We're introducing the concept of "named Sandboxes" for usecases where Sandboxes need to have unique ownership over a resource. A named Sandbox can be created by passing `name=` to `modal.Sandbox.create()`, and it can be retrieved with the new `modal.Sandbox.from_name()` constructor. Only one running Sandbox can use a given name (scoped within the App that is managing the Sandbox) at any time, so trying to create a Sandbox with a name that is already taken will fail. Sandboxes release their name when they terminate. See the [guide](https://modal.com/docs/guide/sandbox#named-sandboxes) for more information about using this new feature.
+
+Other changes:
+
+- We've made an internal change to the `modal.Image.uv_pip_install` method to make it more portable across different base Images. As a consequence, Images built with this method on 1.1.0 will need to rebuild the next time they are used.
+- We've added a `.name` property and `.info()` method to `modal.Dict`, `modal.Queue`, `modal.Volume`, and `modal.Secret` objects.
+- Sandboxes now support `experimental_options` configuration for enabling preview functionality.
+- We've Improved Modal's rich output when used in a Jupyter notebook.
+
+### 1.1.0 (2025-07-17)
+
+This release introduces support for the `2025.06` [Image Builder Version](https://modal.com/docs/guide/images#image-builder-updates), which is in a "preview" state. The new image builder includes several major changes to how the Modal client dependencies are included in Modal Images. These improvements should greatly reduce the risk of conflicts with user code dependencies. They also allow Modal Sandboxes to easily be used with existing Images or Dockerfiles that are not themselves compatible with the Modal client library. You can see more details and update your Workspace on its [Image Config](https://modal.com/settings/image-config) page. Please share any issues that you encounter as we work to make the version stable.
+
+We're also introducing first-class support for building Modal Images with the [uv package manager](https://docs.astral.sh/uv/) through the new [`modal.Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install) and [`modal.Image.uv_sync`](https://modal.com/docs/reference/modal.Image#uv_sync) methods:
+
+```python
+import modal
+
+# uv_pip_install accepts a list of packages, like pip_install, but up to 50% faster
+image = modal.Image.debian_slim().uv_pip_install("torch==2.7.1", "numpy==2.3.1")
+
+# uv_sync accepts a local `uv_project_dir` (defaulting to the local working directory)
+# and uses the pyproject.toml and uv.lock files to specify the environment
+image = modal.Image.debian_slim().uv_sync()
+```
+
+Please note that, as these methods are new, there is some chance that future releases will need to fix bugs or address edge cases in ways that break the cache for existing Images. When using `modal.Image.uv_pip_install`, we recommend pinning dependency versions so that any necessary rebuilds produce a consistent environment.
+
+This release also includes a number of other new features and bug fixes:
+
+- Optimized handling of the `ignore` parameter in `Image.add_local_dir` and similar methods for cases where entire directories are ignored.
+- Added a `poetry_version` parameter to `modal.Image.poetry_install_from_file`, which supports installing a specific version of `poetry`. It's also possible to set `poetry_version=None` to skip the install step, i.e. when poetry is already available in the Image.
+- Added a [`modal.Sandbox.reload_volumes`](https://modal.com/docs/reference/modal.Sandbox#reload_volumes) method, which triggers a reload of all Volumes currently mounted inside a running Sandbox.
+- Added a `build_args` parameter to `modal.Image.from_dockerfile` for passing arguments through to `ARG` instructions in the Dockerfile.
+- It's now possible to use `@modal.experimental.clustered` and `i6pn` networking with `modal.Cls`.
+- Fixed a bug where `Cls.with_options` would fail when provided with a `modal.Secret` object that was already hydrated.
+- Fixed a bug where the timeout specified in `modal.Sandbox.exec()` was not respected by `ContainerProcess.wait()` or `ContainerProcess.poll()`.
+- Fixed retry handling when using `modal run --detach` directly against a remote Function.
+
+Finally, this release introduces a small number of deprecations and potentially-breaking changes:
+
+- We now raise `modal.exception.NotFoundError` in all cases where Modal object lookups fail; previously some methods could leak an internal `GRPCError` with a `NOT_FOUND` status.
+- We're enforcing pre-1.0 deprecations on `modal.build`, `modal.Image.copy_local_file`, and `modal.Image.copy_local_dir`.
+- We're deprecating the `environment_name` parameter in `modal.Sandbox.create()`. A Sandbox's environment association will now be determined by its parent App. This should have no user-facing effects.
+- We've deprecated the `namespace` parameter in the `.from_name` methods of `Function`, `Cls`, `Dict`, `Queue`, `Volume`, `NetworkFileSystem`, and `Secret`, along with `modal.runner.deploy_app`. These object types do not have a concept of distinct namespaces.
+
+## 1.0
+
+### 1.0.5 (2025-06-27)
+
+- Added a [`modal.Volume.read_only`](/docs/reference/modal.Volume#read_only) method, which will configure a Volume instance to disallow writes:
+
+  ```python notest
+  vol = modal.Volume.from_name("models")
+  read_only_vol = vol.read_only()
+
+  @app.function(volumes={"/models": read_only_vol})
+  def f():
+      with open("/models/weights.pt", "w") as fid:  # Raises an OSError
+          ...
+
+  @app.local_entrypoint()
+  def main():
+      with read_only_vol.batch_upload() as batch:  # Raises a modal.exceptions.InvalidError
+          ...
+
+      with vol.batch_upload() as batch:  # This instance is still writeable
+          ...
+  ```
+
+- Introduced a gradual fix for a bug where `Function.map` and `Function.starmap` leak an internal exception wrapping type (`modal.exceptions.UserCodeException`) when `return_exceptions=True` is set. To avoid breaking any user code that depends on the specific types in the return list, these functions will continue returning the wrapper type by default, but they now issue a deprecation warning. To opt into the future behavior and silence the warning, you can set `wrap_returned_exceptions=False` in the call to `.map` or `.starmap`.
+- When an `@app.cls()`-decorated class inherits from a class (or classes) with `modal.parameter()` annotations, the parent parameters will now be inherited and included in the parameter set for the modal Cls.
+- Redeployments that migrate parameterized functions from an explicit constructor to `modal.parameter()` annotations will now handle requests from outdated clients more gracefully, avoiding a problem where new containers would crashloop on a deserialization error.
+- The Modal client will now retry its initial connection to the Modal server, improving stability on flaky networks.
+
+### 1.0.4 (2025-06-13)
+
+- When `modal.Cls.with_options` is called multiple times on the same instance, the overrides will now be merged. For example, the following configuration will use an H100 GPU and request 16 CPU cores:
+  ```python
+  Model.with_options(gpu="A100", cpu=16).with_options(gpu="H100")
+  ```
+- Added a `--secret` option to `modal shell` for including environment variables defined by named Secret(s) in the shell session:
+  ```
+  modal shell --secret huggingface --secret wandb
+  ```
+- Added a `verbose: bool` option to `modal.Sandbox.create()`. When this is set to `True`, execs and file system operations will appear in the Sandbox logs.
+- Updated `modal.Sandbox.watch()` so that exceptions are now raised in (and can be caught by) the calling task.
+
+### 1.0.3 (2025-06-05)
+
+- Added support for specifying a timezone on `Cron` schedules, which allows you to run a Function at a specific local time regardless of daylight savings:
+
+  ```python
+  import modal
+  app = modal.App()
+
+  @app.function(schedule=modal.Cron("* 6 * * *"), timezone="America/New_York")  # Use tz database naming conventions
+  def f():
+      print("This function will run every day at 6am New York time.")
+  ```
+
+- Added an `h2_ports` parameter to `Sandbox.create`, which exposes encrypted ports using HTTP/2. The following example will create an H2 port on 5002 and a port using HTTPS over HTTP/1.1 on 5003:
+  ```python
+  sb = modal.Sandbox.create(app=app, h2_ports = [5002], encrypted_ports = [5003])
+  ```
+- Added `--from-dotenv` and `--from-json` options to `modal secret create`, which will read from local files to populate Secret contents.
+- `Sandbox.terminate` no longer waits for container shutdown to complete before returning. It still ensures that a terminated container will shutdown imminently. To restore the previous behavior (i.e., to wait until the Sandbox is actually terminated), call `sb.wait(raise_on_termination=False)` after calling `sb.terminate()`.
+- Improved performance and stability for `modal volume get`.
+- Fixed a rare race condition that could sometimes make `Function.map` and similar calls deadlock.
+- Fixed an issue where `Function.map` and similar methods would stall for 55 seconds when passed an empty iterator as input instead of completing immediately.
+- We now raise an error during App setup when using interactive mode without the `modal.enable_output` context manager. Previously, this would run the App but raise when `modal.interact()` was called.
+
+### 1.0.2 (2025-05-26)
+
+- Fixed an incompatibility with breaking changes in `aiohttp` v3.12.0, which caused issues with Volume and large input uploads. The issues typically manifest as `Local data and remote data checksum mismatch` or `'_io.BufferedReader' object has no attribute 'getbuffer'` errors.
+
+### 1.0.1 (2025-05-19)
+
+- Added a `--timestamps` flag to `modal app logs` that prepends a timestamp to each log line.
+- Fixed a bug where objects returned by `Sandbox.list` had `returncode == 0` for _running_ Sandboxes. Now the return code for running Sandboxes will be `None`.
+- Fixed a bug affecting systems where the `sys.platform.node` name includes unicode characters.
+
+### 1.0.0 (2025-05-16)
+
+With this release, we're beginning to enforce the deprecations discussed in the [1.0 migration guide](https://modal.com/docs/guide/modal-1-0-migration). Going forward, we'll include breaking changes for outstanding deprecations in `1.Y.0` releases, so we recommend pinning Modal on a minor version (`modal~=1.0.0`) if you have not addressed the existing warnings. While we'll continue to make improvements to the Modal API, new deprecations will be introduced at a substantially reduced rate, and support windows for older client versions will lengthen.
+
+⚠️ In this release, we've made some breaking changes to Modal's "automounting" behavior.️ If you've not already adapted your source code in response to warnings about automounting, Apps built with 1.0+ will have different files included and may not run as expected:
+
+- Previously, Modal containers would automatically include the source for local Python packages that were imported by your Modal App. Going forward, it will be necessary to explicitly include such packages in the Image (i.e., with `modal.Image.add_local_python_source`).
+- Support for the `automount` configuration (`MODAL_AUTOMOUNT`) has been removed; this environment variable will no longer have any effect.
+- Modal will continue to automatically include the Python module or package where the Function is defined. If the Function is defined within a package, the entire directory tree containing the package will be mounted. This limited automounting can also be disabled in cases where your Image definition already includes the package defining the Function: set `include_source=False` in the `modal.App` constructor or `@app.function` decorator.
+
+Additionally, we have enforced a number of previously-introduced deprecations:
+
+- Removed `modal.Mount` as a public object, along with various `mount=` parameters where Mounts could be passed into the Modal API. Usage can be replaced with `modal.Image` methods, e.g.:
+  ```python
+  @app.function(image=image, mounts=[modal.Mount.from_local_dir("data", "/root/data")])  # This is now an error!
+  @app.function(image=image.add_local_dir("data", "/root/data"))  # Correct spelling
+  ```
+- Removed the `show_progress` parameter from `modal.App.run`. This parameter has been replaced by the `modal.enable_output` context manager:
+  ```python
+  with modal.enable_output(), app.run():
+    ...  # Will produce verbose Modal output
+  ```
+- Passing flagged options to the `Image.pip_install` package list will now raise an error. Use the `extra_options` parameter to specify options that aren't exposed through the `Image.pip_install` signature:
+  ```python
+  image.pip_install("flash-attn", "--no-build-isolation")  # This is now an error!
+  image.pip_install("flash-attn", extra_options="--no-build-isolation")  # Correct spelling
+  ```
+- Removed backwards compatibility for using `label=` or `tag=` keywords in object lookup methods. We standardized these methods to use `name=` as the parameter name, but we recommend using positional arguments:
+  ```python
+  f = modal.Function.from_name("my-app", tag="f")  # No longer supported! Will raise an error!
+  f = modal.Function.from_name("my-app", "f")  # Preferred spelling
+  ```
+- It's no longer possible to invoke a generator Function with `Function.spawn`; previously this warned, now it raises an `InvalidError`. Additionally, the `FunctionCall.get_gen` method has been removed, and it's no longer possible to set `is_generator` when using `FunctionCall.from_id`.
+- Removed the `.resolve()` method on Modal objects. This method had not been publicly documented, but where used it can be replaced straightforwardly with `.hydrate()`. Note that explicit hydration should rarely be necessary: in most cases you can rely on lazy hydration semantics (i.e., objects will be hydrated when the first method that requires server metadata is called).
+- Functions decorated with `@modal.asgi_app` or `@modal.wsgi_app` are now required to be nullary. Previously, we warned in the case where a function was defined with parameters that all had default arguments.
+- Referencing the deprecated `modal.Stub` object will now raise an `AttributeError`, whereas previously it was an alias for `modal.App`. This is a simple name change.
+
+## 0.77
+
+### 0.77.0 (2025-05-13)
+
+- This is the final pre-1.0 release of the Modal client. The next release will be version 1.0. While we do not plan to enforce most major deprecations until later in the 1.0 cycle, there will be some breaking changes introduced in the next release.
+
+## 0.76
+
+### 0.76.3 (2025-05-12)
+
+- Fixed the behavior of `modal app history --json` when the history contains versions with and without commit information or "tag" metadata. Commit information is now always included (with a `null` placeholder when absent), while tag metadata is included only when there is at least one tagged release (other releases will have a `null` placeholder).
+
+### 0.76.0 (2025-05-12)
+
+- Fixed the behavior of `ignore=` in `modal.Image` methods, including when `.dockerignore` files are implicitly used in docker-oriented methods. This may result in Image rebuilds with different final inventories:
+  - When using `modal.Image.add_local_dir`, exclusion patterns are now correctly interpreted as relative to the directory being added (e.g., `*.json` will now ignore all json files in the top-level of the directory).
+  - When using `modal.Image.from_dockerfile`, exclusion patterns are correctly interpreted as relative to the context directory.
+  - As in Docker, leading or trailing path delimiters are stripped from the ignore patterns before being applied.
+  - **Breaking change**: When providing a custom function to `ignore=`, file paths passed into the function will now be _relative_, rather than absolute.
+
+## 0.75
+
+### 0.75.8 (2025-05-12)
+
+- Introduced `modal.Cls.with_concurrency` and `modal.Cls.with_batching` for runtime configuration of functionality that is exposed through the `@modal.concurrent` and `@modal.batched` decorators.
+  ```python
+  model = Model.with_options(gpu="H100").with_concurrency(max_inputs=100)()
+  ```
+- Added a deprecation warning when using `allow_concurrent_inputs` in `modal.Cls.with_options`.
+- Added `buffer_containers` to `modal.Cls.with_options`.
+- _Behavior change:_ when `modal.Cls.with_options` is called multiple times on the same object, the configurations will be merged rather than using the most recent.
+
+### 0.75.4 (2025-05-09)
+
+- Fixed issue with .spawn_map producing wrong number of arguments
+
+### 0.75.3 (2025-05-08)
+
+- New `modal.Dict`s (forthcoming on 2025-05-20) use a new durable storage backend with more "cache-like" behavior - items expire after 7 days of inactivity (no reads or writes). Previously created `modal.Dict`s will continue to use the old backend, but support will eventually be dropped.
+- `modal.Dict.put` now supports an `skip_if_exists` flag that can be used to avoid overwriting the value for existing keys:
+
+  ```
+  item_created = my_dict.put("foo", "bar", skip_if_exists=True)
+  assert item_created
+  new_item_created = my_dict.put("foo", "baz", skip_if_exists=True)
+  assert not new_item_created
+  ```
+
+  Note that this flag only works for `modal.Dict` objects with the new backend (forthcoming on 2025-05-20) and will raise an error otherwise.
+
+### 0.75.2 (2025-05-08)
+
+- Reverts defective changes to the interpretation of `ignore=` patterns and `.dockerignore` files that were introduced in v0.75.0.
+
+### 0.75.0 (2025-05-08)
+
+- Introduced some changes to the handling of `ignore=` patterns in `modal.Image` methods. Due to a defect around the handling of leading path delimiter characters, these changes reverted in 0.75.2 and later reintroduced in 0.76.0.
+
+## 0.74
+
+### 0.74.63 (2025-05-08)
+
+- Deprecates `Function.web_url` in favor of a new `Function.get_web_url()` method. This also allows the url of a `Function` to be retrieved in an async manner using `Function.get_web_url.aio()` (like all other io-bearing methods in the Modal API)
+
+### 0.74.61 (2025-05-07)
+
+- Adds a deprecation warning when data is passed directly to `modal.Dict.from_name` or `modal.Dict.ephemeral`. Going forward, it will be necessary to separate `Dict` population from creation.
+
+### 0.74.60 (2025-05-07)
+
+- `modal.Dict.update` now also accepts a positional Mapping, like Python's `dict` type:
+
+  ```python
+  d = modal.Dict.from_name("some-dict")
+  d.update({"a_key": 1, "another_key": "b"}, some_kwarg=True)
+  ```
+
+### 0.74.56 (2025-05-06)
+
+- Experimental `modal cluster` subcommand is added.
+
+### 0.74.53 (2025-05-06)
+
+- Added functionality for `.spawn_map` on a function instantiated from `Function.from_name`.
+
+### 0.74.51 (2025-05-06)
+
+- The `modal` client library can now be installed with Protobuf 6.
+
+### 0.74.49 (2025-05-06)
+
+- Changes the log format of the modal client's default logger. Instead of `[%(threadName)s]`, the client now logs `[modal-client]` as the log line prefix.
+- Adds a configuration option (MODAL_LOG_PATTERN) to the modal config for setting the log formatting pattern, in case users want to customize the format. To get the old format, use `MODAL_LOG_PATTERN='[%(threadName)s] %(asctime)s %(message)s'` (or add this to your `.modal.toml` in the `log_pattern` field).
+
+### 0.74.48 (2025-05-05)
+
+- Added a new method for spawning many function calls in parallel: `Function.spawn_map`.
+
+### 0.74.46 (2025-05-05)
+
+- Introduces a new `.update_autoscaler()` method, which will replace the existing `.keep_warm()` method with the ability to dynamically change the entire autoscaler configuration (`min_containers`, `max_containers`, `buffer_containers`, and `scaledown_window`).
+
+### 0.74.39 (2025-04-30)
+
+- The `modal` client no longer includes `fastapi` as a library dependency.
+
+### 0.74.36 (2025-04-29)
+
+- A new parameter, `restrict_modal_access`, can be provided on a Function to prevent it from interacting with other resources in your Modal Workspace like Queues, Volumes, or other Functions. This can be useful for running user-provided or LLM-written code in a safe way.
+
+### 0.74.35 (2025-04-29)
+
+- Fixed a bug that prevented doing `modal run` against an entrypoint defined by `Cls.with_options`.
+
+### 0.74.32 (2025-04-29)
+
+- When setting a custom `name=` in `@app.function()`, an error is now raised unless `serialized=True` is also set.
+
+### 0.74.25 (2025-04-25)
+
+- The `App.include` method now returns `self` so it's possible to build up an App through chained calls:
+
+  ```python
+  app = modal.App("main-app").include(sub_app_1).include(sub_app_2)
+  ```
+
+### 0.74.23 (2025-04-25)
+
+- Marked some parameters in a small number of Modal functions as requiring keyword arguments (namely, `modal.App.run`, `modal.Cls.with_options`, all `.from_name` methods, and a few others). Code that calls these functions with positional arguments will now raise an error. This is expected to be minimally disruptive as the affected parameters are mostly "extra" options or positioned after parameters that have previously been deprecated.
+
+### 0.74.22 (2025-04-24)
+
+- Added a `modal secret delete` command to the CLI.
+
+### 0.74.21 (2025-04-24)
+
+- The `allow_cross_region_volumes` parameter of the `@app.function` and `@app.cls` decorators now issues a deprecation warning; the parameter is always treated as `True` on the Modal backend.
+
+### 0.74.18 (2025-04-23)
+
+- Adds a `.deploy()` method to the `App` object. This method allows you programmatically deploy Apps from Python:
+
+  ```python
+  app = modal.App("programmatic-deploy")
+  ...
+  app.deploy()
+  ```
+
+### 0.74.12 (2025-04-18)
+
+- The `@app.function` and `@app.cls` decorators now support `experimental_options`, which we'll use going forward when testing experimental functionality that depends only on server-side configuration.
+
+### 0.74.7 (2025-04-17)
+
+- Modal will now raise an error if local files included in the App are modified during the build process. This behavior can be controlled with the `MODAL_BUILD_VALIDATION` configuration, which accepts `error` (default), `warn`, or `ignore`.
+
+### 0.74.6 (2025-04-17)
+
+- Internal change that makes containers for functions/classes with `serialized=True` start up _slightly_ faster than before
+
+### 0.74.0 (2025-04-15)
+
+- Introduces a deprecation warning when using explicit constructors (`__init__` methods) on `@modal.cls`-decorated classes. Class parameterization should instead be done via [dataclass-style `modal.parameter()` declarations](https://modal.com/docs/guide/parametrized-functions). Initialization logic should run in `@modal.enter()`-decorated [lifecycle methods](https://modal.com/docs/guide/lifecycle-functions).
+
+## 0.73
+
+### 0.73.173 (2025-04-15)
+
+- Fix bug where containers hang with batch sizes above 100 (with `@modal.batched`).
+- Fix bug where containers can fail with large outputs and batch sizes above 49 (with `@modal.batched`)
+
+### 0.73.170 (2025-04-14)
+
+- Fixes a bug where `modal run` didn't recognize `modal.parameter()` class parameters
+
+### 0.73.165 (2025-04-11)
+
+- Allow running new ephemeral apps from **within** Modal containers using `with app.run(): ...`. Use with care, as putting such a run block in global scope of a module could easily lead to infinite app creation recursion
+
+### 0.73.160 (2025-04-10)
+
+- The `allow_concurrent_inputs` parameter of `@app.function` and `@app.cls` is now deprecated in favor of the `@modal.concurrent` decorator. See the [Modal 1.0 Migration Guide](https://modal.com/docs/guide/modal-1-0-migration#replacing-allow_concurrent_inputs-with-modalconcurrent) and documentation on [input concurrency](https://modal.com/docs/guide/concurrent-inputs) for more information.
+
+### 0.73.159 (2025-04-10)
+
+- Fixes a bug where `serialized=True` classes could not `self.` reference other methods on the class, or use `modal.parameter()` synthetic constructors
+
+### 0.73.158 (2025-04-10)
+
+- Adds support for `bool` type to class parameters using `name: bool = modal.parameter()`. Note that older clients can't instantiate classes with bool parameters unless those have default values which are not modified. Bool parameters are also not supported by web endpoints at this time.
+
+### 0.73.148 (2025-04-07)
+
+- Fixes a bug introduced in 0.73.147 that broke App builds when using `@modal.batched` on a class method.
+
+### 0.73.147 (2025-04-07)
+
+- Improved handling of cases where `@modal.concurrent` is stacked with other decorators.
+
+### 0.73.144 (2025-04-04)
+
+- Adds a `context_dir` parameter to `modal.Image.from_dockerfile` and `modal.Image.dockerfile_commands`. This parameter can be used to provide a local reference for relative COPY commands.
+
+### 0.73.139 (2025-04-02)
+
+- Added `modal.experimental.ipython` module, which can be loaded in Jupyter notebooks with `%load_ext modal.experimental.ipython`. Currently it provides the `%modal` line magic for looking up functions:
+
+  ```python
+  %modal from main/my-app import my_function, MyClass as Foo
+
+  # Now you can use my_function() and Foo in your notebook.
+  my_function.remote()
+  Foo().my_method.remote()
+  ```
+
+- Removed the legacy `modal.extensions.ipython` module from 2022.
+
+### 0.73.135 (2025-03-29)
+
+- Fix shutdown race bug that emitted spurious error-level logs.
+
+### 0.73.132 (2025-03-28)
+
+- Adds the `@modal.concurrent` decorator, which will be replacing the beta `allow_concurrent_inputs=` parameter of `@app.function` and `@app.cls` for enabling input concurrency. Notably, `@modal.concurrent` introduces a distinction between `max_inputs` and `target_inputs`, allowing containers to burst over the concurrency level targeted by the Modal autoscaler during periods of high load.
+
+### 0.73.131 (2025-03-28)
+
+- Instantiation of classes using keyword arguments that are not defined as as `modal.parameter()` will now raise an error on the calling side rather than in the receiving container. Note that this only applies if there is at least one modal.parameter() defined on the class, but this will likely apply to parameter-less classes in the future as well.
+
+### 0.73.121 (2025-03-24)
+
+- Adds a new "commit info" column to the `modal app history` command. It shows the short git hash at the time of deployment, with an asterisk `*` if the repository had uncommitted changes.
+
+### 0.73.119 (2025-03-21)
+
+- Class parameters are no longer automatically cast into their declared type. If the wrong type is provided to a class parameter, method calls to that class instance will now fail with an exception.
+
+### 0.73.115 (2025-03-19)
+
+- Adds support for new strict `bytes` type for `modal.parameter`
+
+Usage:
+
+```py
+import typing
+import modal
+
+app = modal.App()
+
+@app.cls()
+class Foo:
+    a: bytes = modal.parameter(default=b"hello")
+
+    @modal.method()
+    def bar(self):
+        return f"hello {self.a}"
+
+@app.local_entrypoint()
+def main():
+    foo = Foo(a=b"world")
+    foo.bar.remote()
+```
+
+**Note**: For parameterized web endoints you must base64 encode the bytes before passing them in as a query parameter.
+
+### 0.73.107 (2025-03-14)
+
+- Include git commit info at the time of app deployment.
+
+### 0.73.105 (2025-03-14)
+
+- Added `Image.cmd()` for setting image default entrypoint args (a.k.a. `CMD`).
+
+### 0.73.95 (2025-03-12)
+
+- Fixes a bug which could cause `Function.map` and sibling methods to stall indefinitely if there was an exception in the input iterator itself (i.e. not the mapper function)
+
+### 0.73.89 (2025-03-05)
+
+- The `@modal.web_endpoint` decorator is now deprecated. We are replacing it with `@modal.fastapi_endpoint`. This can be a simple name substitution in your code; the two decorators have identical semantics.
+
+### 0.73.84 (2025-03-04)
+
+- The `keep_warm=` parameter has been removed from the`@modal.method` decorator. This parameter has been nonfunctional since v0.63.0; all autoscaler configuration must be done at the level of the modal Cls.
+
+### 0.73.82 (2025-03-04)
+
+- Adds `modal.fastapi_endpoint` as an alias for `modal.web_endpoint`. We will be deprecating the `modal.web_endpoint` _name_ (but not the functionality) as part of the Modal 1.0 release.
+
+### 0.73.81 (2025-03-03)
+
+- The `wait_for_response` parameter of Modal's web endpoint decorators has been removed (originally deprecated in May 2024).
+
+### 0.73.78 (2025-03-01)
+
+- It is now possible to call `Cls.with_options` on an unhydrated Cls, e.g.
+
+  ```python
+  ModelWithGPU = modal.Cls.from_name("my-app", "Model").with_options(gpu="H100")
+  ```
+
+### 0.73.77 (2025-03-01)
+
+- `Cls.with_options()` now accept unhydated volume and secrets
+
+### 0.73.76 (2025-02-28)
+
+- We're renaming several `App.function` and `App.cls` parameters that configure the behavior of Modal's autoscaler:
+  - `concurrency_limit` is now `max_containers`
+  - `keep_warm` is now `min_containers`
+  - `container_idle_timeout` is now `scaledown_window`
+- The old names will continue to work, but using them will issue a deprecation warning. The aim of the renaming is to reduce some persistent confusion about what these parameters mean. Code updates should require only a simple substitution of the new name.
+- We're adding a new parameter, `buffer_containers` (previously available as `_experimental_buffer_containers`). When your Function is actively handling inputs, the autoscaler will spin up additional `buffer_containers` so that subsequent inputs will not be blocked on cold starts. When the Function is idle, it will still scale down to the value given by `min_containers`.
+
+### 0.73.75 (2025-02-28)
+
+- Adds a new config field, `ignore_cache` (also accessible via environment variables as `MODAL_IGNORE_CACHE=1`), which will force Images used by the App to rebuild but not clobber any existing cached Images. This can be useful for testing an App's robustness to Image rebuilds without affecting other Apps that depend on the same base Image layer(s).
+
+### 0.73.73 (2025-02-28)
+
+- Adds a deprecation warning to the `workspace` parameter in `modal.Cls` lookup methods. This argument is unused and will be removed in the future.
+
+### 0.73.69 (2025-02-25)
+
+- We've moved the `modal.functions.gather` function to be a staticmethod on `modal.FunctionCall.gather`. The former spelling has been deprecated and will be removed in a future version.
+
+### 0.73.68 (2025-02-25)
+
+- Fixes issue where running `modal shell` with a dot-separated module reference as input would not accept the required `-m` flag for "module mode", but still emitted a warning telling users to use `-m`
+
+### 0.73.60 (2025-02-20)
+
+- Fixes an issue where `modal.runner.deploy_app()` didn't work when called from within a running (remote) Modal Function
+
+### 0.73.58 (2025-02-20)
+
+- Introduces an `-m` flag to `modal run`, `modal shell`, `modal serve` and `modal deploy`, which indicates that the modal app/function file is specified using python "module syntax" rather than a file path. In the future this will be a required flag when using module syntax.
+
+  Old syntax:
+
+  ```sh
+  modal run my_package/modal_main.py
+  modal run my_package.modal_main
+  ```
+
+  New syntax (note the `-m` on the second line):
+
+  ```sh
+  modal run my_package/modal_main.py
+  modal run -m my_package.modal_main
+  ```
+
+### 0.73.54 (2025-02-18)
+
+- Passing `App.lookup` an invalid name now raises an error. App names may contain only alphanumeric characters, dashes, periods, and underscores, must be shorter than 64 characters, and cannot conflict with App ID strings.
+
+### 0.73.51 (2025-02-14)
+
+- Fixes a bug where sandboxes returned from `Sandbox.list()` were not snapshottable even if they were created with `_experimental_enable_snapshot`.
+
+### 0.73.44 (2025-02-13)
+
+- `modal.FunctionCall` is now available in the top-level `modal` namespace. We recommend referencing the class this way instead of using the the fully-qualified `modal.functions.FunctionCall` name.
+
+### 0.73.40 (2025-02-12)
+
+- `Function.web_url` will now return None (instead of raising an error) when the Function is not a web endpoint
+
+### 0.73.31 (2025-02-10)
+
+- Deprecate the GPU classes (`gpu=A100(...)` etc) in favor of just using strings (`gpu="A100"` etc)
+
+### 0.73.26 (2025-02-10)
+
+- Adds a pending deprecation warning when looking up class methods using `Function.from_name`, e.g. `Function.from_name("some_app", "SomeClass.some_method")`. The recommended way to reference methods of classes is to look up the class instead: `RemoteClass = Cls.from_name("some_app", "SomeClass")`
+
+### 0.73.25 (2025-02-09)
+
+- Fixes an issue introduced in `0.73.19` that prevented access to GPUs during image builds
+
+### 0.73.18 (2025-02-06)
+
+- When using a parameterized class (with at least one `modal.parameter()` specified), class instantiation with an incorrect construction signature (wrong arguments or types) will now fail at the `.remote()` calling site instead of container startup for the called class.
+
+### 0.73.14 (2025-02-04)
+
+- Fixed the status message shown in terminal logs for ephemeral Apps to accurately report the number of active containers.
+
+### 0.73.11 (2025-02-04)
+
+- Warns users if the `modal.Image` of a Function/Cls doesn't include all the globally imported "local" modules (using `.add_local_python_source()`), and the user hasn't explicitly set an `include_source` value of True/False. This is in preparation for an upcoming deprecation of the current "auto mount" logic.
+
+### 0.73.10 (2025-02-04)
+
+- Modal functions, methods and entrypoints can now accept variable-length arguments to skip Modal's default CLI parsing. This is useful if you want to use Modal with custom argument parsing via `argparse` or `HfArgumentParser`. For example, the following function can be invoked with `modal run my_file.py --foo=42 --bar="baz"`:
+
+  ```python
+  import argparse
+
+  @app.function()
+  def train(*arglist):
+      parser = argparse.ArgumentParser()
+      parser.add_argument("--foo", type=int)
+      parser.add_argument("--bar", type=str)
+      args = parser.parse_args(args = arglist)
+  ```
+
+### 0.73.1 (2025-01-30)
+
+- `modal run` now runs a single local entrypoints/function in the selected module. If exactly one local entrypoint or function exists in the selected module, the user doesn't have to qualify the runnable
+  in the modal run command, even if some of the module's referenced apps have additional local entrypoints or functions. This partially restores "auto-inferred function" functionality that was changed in v0.72.48.
+
+### 0.73.0 (2025-01-30)
+
+- Introduces an `include_source` argument in the `App.function` and `App.cls` decorators that let users configure which class of python packages are automatically included as source mounts in created modal functions/classes (what we used to call "automount" behavior). This will supersede the MODAL_AUTOMOUNT configuration value which will eventually be deprecated. As a convenience, the `modal.App` constructor will also accept an `include_source` argument which serves as the default for all the app's functions and classes.
+
+  The `include_source` argument accepts the following values:
+
+  - `True` (default in a future version of Modal) Automatically includes the Python files of the source package of the function's own home module, but not any other local packages. Roughly equivalent ot `MODAL_AUTOMOUNT=0` in previous versions of Modal.
+  - `False` - don't include _any_ local source. Assumes the function's home module is importable in the container environment through some other means (typically added to the provided `modal.Image`'s Python environment).
+  - `None` (the default) - use current soon-to-be-deprecated automounting behavior, including source of all first party packages that are not installed into site-packages locally.
+
+- Minor change to `MODAL_AUTOMOUNT=0`: When running/deploying using a module path (e.g. `modal run mypak.mymod`), **all non .pyc files** of the source package (`mypak` in this case) are now included in the function's container. Previously, only the function's home `.py` module file + any `__init__.py` files in its package structure were included. Note that this is only for MODAL_AUTOMOUNT=0. To get full control over which source files are included with your functions, you can set `include_source=False` on your function (see above) and manually specify the files to include using the `ignore` argument to `Image.add_local_python_source`.
+
+## 0.72
+
+### 0.72.56 (2025-01-28)
+
+- Deprecated `.lookup` methods on Modal objects. Users are encouraged to use `.from_name` instead. In most cases this will be a simple name substitution. See [the 1.0 migration guide](https://modal.com/docs/guide/modal-1-0-migration#deprecating-the-lookup-method-on-modal-objects) for more information.
+
+### 0.72.54 (2025-01-28)
+
+- Fixes bug introduced in v0.72.48 where `modal run` didn't work with files having global `Function.from_name()`/`Function.lookup()`/`Cls.from_name()`/`Cls.lookup()` calls.
+
+### 0.72.48 (2025-01-24)
+
+- Fixes a CLI bug where you couldn't reference functions via a qualified app, e.g. `mymodule::{app_variable}.{function_name}`.
+- The `modal run`, `modal serve` and `modal shell` commands get more consistent error messages in cases where the passed app or function reference isn't resolvable to something that the current command expects.
+- Removes the deprecated `__getattr__`, `__setattr__`, `__getitem__` and `__setitem__` methods from `modal.App`
+
+### 0.72.39 (2025-01-22)
+
+- Introduced a new public method, `.hydrate`, for on-demand hydration of Modal objects. This method replaces the existing semi-public `.resolve` method, which is now deprecated.
+
+### 0.72.33 (2025-01-20)
+
+- The Image returned by `Sandbox.snapshot_filesystem` now has `object_id` and other metadata pre-assigned rather than require loading by subsequent calls to sandboxes or similar to set this data.
+
+### 0.72.30 (2025-01-18)
+
+- Adds a new `oidc_auth_role_arn` field to `CloudBucketMount` for using OIDC authentication to create the mountpoint.
+
+### 0.72.24 (2025-01-17)
+
+- No longer prints a warning if `app.include` re-includes an already included function (warning is still printed if _another_ function with the same name is included)
+
+### 0.72.22 (2025-01-17)
+
+- Internal refactor of the `modal.object` module. All entities except `Object` from that module have now been moved to the `modal._object` "private" module.
+
+### 0.72.17 (2025-01-16)
+
+- The `@modal.build` decorator is now deprecated. For storing large assets (e.g. model weights), we now recommend using a `modal.Volume` over writing data to the `modal.Image` filesystem directly.
+
+### 0.72.16 (2025-01-16)
+
+- Fixes bug introduced in v0.72.9 where `modal run SomeClass.some_method` would incorrectly print a deprecation warning.
+
+### 0.72.15 (2025-01-15)
+
+- Added an `environment_name` parameter to the `App.run` context manager.
+
+### 0.72.8 (2025-01-10)
+
+- Fixes a bug introduced in v0.72.2 when specifying `add_python="3.9"` in `Image.from_registry`.
+
+### 0.72.0 (2025-01-09)
+
+- The default behavior`Image.from_dockerfile()` and `image.dockerfile_commands()` if no parameter is passed to `ignore` will be to automatically detect if there is a valid dockerignore file in the current working directory or next to the dockerfile following the same rules as `dockerignore` does using `docker` commands. Previously no patterns were ignored.
+
+## 0.71
+
+### 0.71.13 (2025-01-09)
+
+- `FilePatternMatcher` has a new constructor `from_file` which allows you to read file matching patterns from a file instead of having to pass them in directly, this can be used for `Image` methods accepting an `ignore` parameter in order to read ignore patterns from files.
+
+### 0.71.11 (2025-01-08)
+
+- Modal Volumes can now be renamed via the CLI (`modal volume rename`) or SDK (`modal.Volume.rename`).
+
+### 0.71.7 (2025-01-08)
+
+- Adds `Image.from_id`, which returns an `Image` object from an existing image id.
+
+### 0.71.1 (2025-01-06)
+
+- Sandboxes now support fsnotify-like file watching:
+
+```python
+from modal.file_io import FileWatchEventType
+
+app = modal.App.lookup("file-watch", create_if_missing=True)
+sb = modal.Sandbox.create(app=app)
+events = sb.watch("/foo")
+for event in events:
+    if event.type == FileWatchEventType.Modify:
+        print(event.paths)
+```
+
+## 0.70
+
+### 0.70.1 (2024-12-27)
+
+- The sandbox filesystem API now accepts write payloads of sizes up to 1 GiB.
+
+## 0.69
+
+### 0.69.0 (2024-12-21)
+
+- `Image.from_dockerfile()` and `image.dockerfile_commands()` now auto-infer which files need to be uploaded based on COPY commands in the source if `context_mount` is omitted. The `ignore=` argument to these methods can be used to selectively omit files using a set of glob patterns.
+
+## 0.68
+
+### 0.68.53 (2024-12-20)
+
+- You can now point `modal launch vscode` at an arbitrary Dockerhub base image:
+
+  `modal launch vscode --image=nvidia/cuda:12.4.0-devel-ubuntu22.04`
+
+### 0.68.44 (2024-12-19)
+
+- You can now run GPU workloads on [Nvidia L40S GPUs](https://www.nvidia.com/en-us/data-center/l40s/):
+
+  ```python
+  @app.function(gpu="L40S")
+  def my_gpu_fn():
+      ...
+  ```
+
+### 0.68.43 (2024-12-19)
+
+- Fixed a bug introduced in v0.68.39 that changed the exception type raise when the target object for `.from_name`/`.lookup` methods was not found.
+
+### 0.68.39 (2024-12-18)
+
+- Standardized terminology in `.from_name`/`.lookup`/`.delete` methods to use `name` consistently where `label` and `tag` were used interchangeably before. Code that invokes these methods using `label=` as an explicit keyword argument will issue a deprecation warning and will break in a future release.
+
+### 0.68.29 (2024-12-17)
+
+- The internal `deprecation_error` and `deprecation_warning` utilities have been moved to a private namespace
+
+### 0.68.28 (2024-12-17)
+
+- Sandboxes now support additional filesystem commands `mkdir`, `rm`, and `ls`.
+
+```python
+app = modal.App.lookup("sandbox-fs", create_if_missing=True)
+sb = modal.Sandbox.create(app=app)
+sb.mkdir("/foo")
+with sb.open("/foo/bar.txt", "w") as f:
+    f.write("baz")
+print(sb.ls("/foo"))
+```
+
+### 0.68.27 (2024-12-17)
+
+- Two previously-introduced deprecations are now enforced and raise an error:
+  - The `App.spawn_sandbox` method has been removed in favor of `Sandbox.create`
+  - `Sandbox.create` now requires an `App` object to be passed
+
+### 0.68.24 (2024-12-16)
+
+- The `modal run` CLI now has a `--write-result` option. When you pass a filename, Modal will write the return value of the entrypoint function to that location on your local filesystem. The return value of the function must be either `str` or `bytes` to use this option; otherwise, an error will be raised. It can be useful for exercising a remote function that returns text, image data, etc.
+
+### 0.68.21 (2024-12-13)
+
+Adds an `ignore` parameter to our `Image` `add_local_dir` and `copy_local_dir` methods. It is similar to the `condition` method on `Mount` methods but instead operates on a `Path` object. It takes either a list of string patterns to ignore which follows the `dockerignore` syntax implemented in our `FilePatternMatcher` class, or you can pass in a callable which allows for more flexible selection of files.
+
+Usage:
+
+```python
+img.add_local_dir(
+  "./local-dir",
+  remote_path="/remote-path",
+  ignore=FilePatternMatcher("**/*", "!*.txt") # ignore everything except files ending with .txt
+)
+
+img.add_local_dir(
+  ...,
+  ignore=~FilePatternMatcher("**/*.py") # can be inverted for when inclusion filters are simpler to write
+)
+
+img.add_local_dir(
+  ...,
+  ignore=["**/*.py", "!module/*.py"] # ignore all .py files except those in the module directory
+)
+
+img.add_local_dir(
+  ...,
+  ignore=lambda fp: fp.is_relative_to("somewhere") # use a custom callable
+)
+```
+
+which will add the `./local-dir` directory to the image but ignore all files except `.txt` files
+
+### 0.68.15 (2024-12-13)
+
+Adds the `requires_proxy_auth` parameter to `web_endpoint`, `asgi_app`, `wsgi_app`, and `web_server` decorators. Requests to the app will respond with 407 Proxy Authorization Required if a webhook token is not supplied in the HTTP headers. Protects against DoS attacks that will unnecessarily charge users.
+
+### 0.68.11 (2024-12-13)
+
+- `Cls.from_name(...)` now works as a lazy alternative to `Cls.lookup()` that doesn't perform any IO until a method on the class is used for a .remote() call or similar
+
+### 0.68.6 (2024-12-12)
+
+- Fixed a bug introduced in v0.67.47 that suppressed console output from the `modal deploy` CLI.
+
+### 0.68.5 (2024-12-12)
+
+We're removing support for `.spawn()`ing generator functions.
+
+### 0.68.2 (2024-12-11)
+
+- Sandboxes now support a new filesystem API. The `open()` method returns a `FileIO` handle for native file handling in sandboxes.
+
+```python
+app = modal.App.lookup("sandbox-fs", create_if_missing=True)
+sb = modal.Sandbox.create(app=app)
+
+with sb.open("test.txt", "w") as f:
+  f.write("Hello World\n")
+
+f = sb.open("test.txt", "rb")
+print(f.read())
+```
+
+## 0.67
+
+### 0.67.43 (2024-12-11)
+
+- `modal container exec` and `modal shell` now work correctly even when a pseudoterminal (PTY) is not present. This means, for example, that you can pipe the output of these commands to a file:
+
+  ```python
+  modal shell -c 'uv pip list' > env.txt
+  ```
+
+### 0.67.39 (2024-12-09)
+
+- It is now possible to delete named `NetworkFileSystem` objects via the CLI (`modal nfs delete ...`) or API `(modal.NetworkFileSystem.delete(...)`)
+
+### 0.67.38 (2024-12-09)
+
+- Sandboxes now support filesystem snapshots. Run `Sandbox.snapshot_filesystem()` to get an Image which can be used to spawn new Sandboxes.
+
+### 0.67.28 (2024-12-05)
+
+- Adds `Image.add_local_python_source` which works similarly to the old and soon-to-be-deprecated `Mount.from_local_python_packages` but for images. One notable difference is that the new `add_local_python_source` _only_ includes `.py`-files by default
+
+### 0.67.23 (2024-12-04)
+
+- Image build functions that use a `functools.wraps` decorator will now have their global variables included in the cache key. Previously, the cache would use global variables referenced within the wrapper itself. This will force a rebuild for Image layers defined using wrapped functions.
+
+### 0.67.22 (2024-12-03)
+
+- Fixed a bug introduced in v0.67.0 where it was impossible to call `modal.Cls` methods when passing a list of requested GPUs.
+
+### 0.67.12 (2024-12-02)
+
+- Fixed a bug that executes the wrong method when a Modal Cls overrides a `@modal.method` inherited from a parent.
+
+### 0.67.7 (2024-11-29)
+
+- Fixed a bug where pointing `modal run` at a method on a Modal Cls would fail if the method was inherited from a parent.
+
+### 0.67.0 (2024-11-27)
+
+New minor client version `0.67.x` comes with an internal data model change for how Modal creates functions for Modal classes. There are no breaking or backwards-incompatible changes with this release. All forward lookup scenarios (`.lookup()` of a `0.67` class from a pre `0.67` client) as well as backwards lookup scenarios (`.lookup()` of a pre `0.67` class from a `0.67` client) work, except for a `0.62` client looking up a `0.67` class (this maintains our current restriction of not being able to lookup a `0.63+` class from a `0.62` client).
+
+## 0.66
+
+### 0.66.49 (2024-11-26)
+
+- `modal config set-environment` will now raise if the requested environment does not exist.
+
+### 0.66.45 (2024-11-26)
+
+- The `modal launch` CLI now accepts a `--detach` flag to run the App in detached mode, such that it will persist after the local client disconnects.
+
+### 0.66.40 (2024-11-23)
+
+- Adds `Image.add_local_file(..., copy=False)` and `Image.add_local_dir(..., copy=False)` as a unified replacement for the old `Image.copy_local_*()` and `Mount.add_local_*` methods.
+
+### 0.66.30 (2024-11-21)
+
+- Removed the `aiostream` package from the modal client library dependencies.
+
+### 0.66.12 (2024-11-19)
+
+`Sandbox.exec` now accepts arguments `text` and `bufsize` for streaming output, which controls text output and line buffering.
+
+### 0.66.0 (2024-11-15)
+
+- Modal no longer supports Python 3.8, which has reached its [official EoL](https://devguide.python.org/versions/).
+
+## 0.65
+
+### 0.65.55 (2024-11-13)
+
+- Escalates stuck input cancellations to container death. This prevents unresponsive user code from holding up resources.
+- Input timeouts no longer kill the entire container. Instead, they just cancel the timed-out input, leaving the container and other concurrent inputs running.
+
+### 0.65.49 (2024-11-12)
+
+- Fixed issue in `modal serve` where files used in `Image.copy_*` commands were not watched for changes
+
+### 0.65.42 (2024-11-07)
+
+- `Sandbox.exec` can now accept `timeout`, `workdir`, and `secrets`. See the `Sandbox.create` function for context on how to use these arguments.
+
+### 0.65.33 (2024-11-06)
+
+- Removed the `interactive` parameter from `function` and `cls` decorators. This parameter has been deprecated since May 2024. Instead of specifying Modal Functions as interactive, use `modal run --interactive` to activate interactive mode.
+
+### 0.65.30 (2024-11-05)
+
+- The `checkpointing_enabled` option, deprecated in March 2024, has now been removed.
+
+### 0.65.9 (2024-10-31)
+
+- Output from `Sandbox.exec` can now be directed to `/dev/null`, `stdout`, or stored for consumption. This behavior can be controlled via the new `StreamType` arguments.
+
+### 0.65.8 (2024-10-31)
+
+- Fixed a bug where the `Image.imports` context manager would not correctly propagate ImportError when using a `modal.Cls`.
+
+### 0.65.2 (2024-10-30)
+
+- Fixed an issue where `modal run` would pause for 10s before exiting if there was a failure during app creation.
+
+## 0.64
+
+### 0.64.227 (2024-10-25)
+
+- The `modal container list` CLI command now shows the containers within a specific environment: the active profile's environment if there is one, otherwise the workspace's default environment. You can pass `--env` to list containers in other environments.
+
+### 0.64.223 (2024-10-24)
+
+- Fixed `modal serve` not showing progress when reloading apps on file changes since v0.63.79.
+
+### 0.64.218 (2024-10-23)
+
+- Fix a regression introduced in client version 0.64.209, which affects client authentication within a container.
+
+### 0.64.198 (2024-10-18)
+
+- Fixed a bug where `Queue.put` and `Queue.put_many` would throw `queue.Full` even if `timeout=None`.
+
+### 0.64.194 (2024-10-18)
+
+- The previously-deprecated `--confirm` flag has been removed from the `modal volume delete` CLI. Use `--yes` to force deletion without a confirmation prompt.
+
+### 0.64.193 (2024-10-18)
+
+- Passing `wait_for_response=False` in Modal webhook decorators is no longer supported. See [the docs](https://modal.com/docs/guide/webhook-timeouts#polling-solutions) for alternatives.
+
+### 0.64.187 (2024-10-16)
+
+- When writing to a `StreamWriter` that has already had EOF written, a `ValueError` is now raised instead of an `EOFError`.
+
+### 0.64.185 (2024-10-15)
+
+- Memory snapshotting can now be used with parametrized functions.
+
+### 0.64.184 (2024-10-15)
+
+- StreamWriters now accept strings as input.
+
+### 0.64.182 (2024-10-15)
+
+- Fixed a bug where App rollbacks would not restart a schedule that had been removed in an intervening deployment.
+
+### 0.64.181 (2024-10-14)
+
+- The `modal shell` CLI command now takes a container ID, allowing you to shell into a running container.
+
+### 0.64.180 (2024-10-14)
+
+- `modal shell --cmd` now can be shortened to `modal shell -c`. This means you can use it like `modal shell -c "uname -a"` to quickly run a command within the remote environment.
+
+### 0.64.168 (2024-10-03)
+
+- The `Image.conda`, `Image.conda_install`, and `Image.conda_update_from_environment` methods are now fully deprecated. We recommend using `micromamba` (via `Image.micromamba` and `Image.micromamba_install`) instead, or manually installing and using conda with `Image.run_commands` when strictly necessary.
+
+### 0.64.153 (2024-09-30)
+
+- **Breaking Change:** `Sandbox.tunnels()` now returns a `Dict` rather than a `List`. This dict is keyed by the container's port, and it returns a `Tunnel` object, just like `modal.forward` does.
+
+### 0.64.142 (2024-09-25)
+
+- `modal.Function` and `modal.Cls` now support specifying a `list` of GPU configurations, allowing the Function's container pool to scale across each GPU configuration in preference order.
+
+### 0.64.139 (2024-09-25)
+
+- The deprecated `_experimental_boost` argument is now removed. (Deprecated in late July.)
+
+### 0.64.123 (2024-09-18)
+
+- Sandboxes can now be created without an entrypoint command. If they are created like this, they will stay alive up until their set timeout. This is useful if you want to keep a long-lived sandbox and execute code in it later.
+
+### 0.64.119 (2024-09-17)
+
+- Sandboxes now have a `cidr_allowlist` argument, enabling controlled access to certain IP ranges. When not used (and with `block_network=False`), the sandbox process will have open network access.
+
+### 0.64.118 (2024-09-17)
+
+Introduce an experimental API to allow users to set the input concurrency for a container locally.
+
+### 0.64.112 (2024-09-15)
+
+- Creating sandboxes without an associated `App` is deprecated. If you are spawning a `Sandbox` outside a Modal container, you can lookup an `App` by name to attach to the `Sandbox`:
+
+  ```python
+  app = modal.App.lookup('my-app', create_if_missing=True)
+  modal.Sandbox.create('echo', 'hi', app=app)
+  ```
+
+### 0.64.109 (2024-09-13)
+
+- App handles can now be looked up by name with `modal.App.lookup(name)`. This can be useful for associating Sandboxes with Apps:
+
+  ```python
+  app = modal.App.lookup("my-app", create_if_missing=True)
+  modal.Sandbox.create("echo", "hi", app=app)
+  ```
+
+### 0.64.100 (2024-09-11)
+
+- The default timeout for `modal.Image.run_function` has been lowered to 1 hour. Previously it was 24 hours.
+
+### 0.64.99 (2024-09-11)
+
+- Fixes an issue that could cause containers using `enable_memory_snapshot=True` on Python 3.9 and below to shut down prematurely.
+
+### 0.64.97 (2024-09-11)
+
+- Added support for [ASGI lifespan protocol](https://asgi.readthedocs.io/en/latest/specs/lifespan.html):
+
+  ```python
+  @app.function()
+  @modal.asgi_app()
+  def func():
+      from fastapi import FastAPI, Request
+
+      def lifespan(wapp: FastAPI):
+          print("Starting")
+          yield {"foo": "bar"}
+          print("Shutting down")
+
+      web_app = FastAPI(lifespan=lifespan)
+
+      @web_app.get("/")
+      def get_state(request: Request):
+          return {"message": f"This is the state: {request.state.foo}"}
+
+      return web_app
+  ```
+
+  which enables support for `gradio>=v4` amongst other libraries using lifespans
+
+### 0.64.87 (2024-09-05)
+
+- Sandboxes now support port tunneling. Ports can be exposed via the `open_ports` argument, and a list of active tunnels can be retrieved via the `.tunnels()` method.
+
+### 0.64.67 (2024-08-30)
+
+- Fixed a regression in `modal launch` to resume displaying output when starting the container.
+
+### 0.64.48 (2024-08-21)
+
+- Introduces new dataclass-style syntax for class parametrization (see updated [docs](https://modal.com/docs/guide/parametrized-functions))
+
+  ```python
+  @app.cls()
+  class MyCls:
+      param_a: str = modal.parameter()
+
+  MyCls(param_a="hello")  # synthesized constructor
+  ```
+
+- The new syntax enforces types (`str` or `int` for now) on all parameters
+
+- _When the new syntax is used_, any web endpoints (`web_endpoint`, `asgi_app`, `wsgi_app` or `web_server`) on the app will now also support parametrization through the use of query parameters matching the parameter names, e.g. `https://myfunc.modal.run/?param_a="hello` in the above example.
+
+- The old explicit `__init__` constructor syntax is still allowed, but could be deprecated in the future and doesn't work with web endpoint parametrization
+
+### 0.64.38 (2024-08-16)
+
+- Added a `modal app rollback` CLI command for rolling back an App deployment to a previous version.
+
+### 0.64.33 (2024-08-16)
+
+- Commands in the `modal app` CLI now accept an App name as a positional argument, in addition to an App ID:
+
+  ```
+  modal app history my-app
+  ```
+
+  Accordingly, the explicit `--name` option has been deprecated. Providing a name that can be confused with an App ID will also now raise an error.
+
+### 0.64.32 (2024-08-16)
+
+- Updated type stubs using generics to allow static type inferrence for functions calls, e.g. `function.remote(...)`.
+
+### 0.64.26 (2024-08-15)
+
+- `ContainerProcess` handles now support `wait()` and `poll()`, like `Sandbox` objects
+
+### 0.64.24 (2024-08-14)
+
+- Added support for dynamic batching. Functions or class methods decorated with `@modal.batched` will now automatically batch their invocations together, up to a specified `max_batch_size`. The batch will wait for a maximum of `wait_ms` for more invocations after the first invocation is made. See guide for more details.
+
+  ```python
+  @app.function()
+  @modal.batched(max_batch_size=4, wait_ms=1000)
+  async def batched_multiply(xs: list[int], ys: list[int]) -> list[int]:
+      return [x * y for x, y in zip(xs, xs)]
+
+  @app.cls()
+  class BatchedClass():
+      @modal.batched(max_batch_size=4, wait_ms=1000)
+      async def batched_multiply(xs: list[int], ys: list[int]) -> list[int]:
+          return [x * y for x, y in zip(xs, xs)]
+  ```
+
+  The batched function is called with individual inputs:
+
+  ```python
+  await batched_multiply.remote.aio(2, 3)
+  ```
+
+### 0.64.18 (2024-08-12)
+
+- Sandboxes now have an `exec()` method that lets you execute a command inside the sandbox container. `exec` returns a `ContainerProcess` handle for input and output streaming.
+
+  ```python
+  sandbox = modal.Sandbox.create("sleep", "infinity")
+
+  process = sandbox.exec("bash", "-c", "for i in $(seq 1 10); do echo foo $i; sleep 0.5; done")
+
+  for line in process.stdout:
+      print(line)
+  ```
+
+### 0.64.8 (2024-08-06)
+
+- Removed support for the undocumented `modal.apps.list_apps()` function, which was internal and not intended to be part of public API.
+
+### 0.64.7 (2024-08-05)
+
+- Removed client check for CPU core request being at least 0.1, deferring to server-side enforcement.
+
+### 0.64.2 (2024-08-02)
+
+- Volumes can now be mounted to an ad hoc modal shell session:
+
+  ```
+  modal shell --volume my-vol-name
+  ```
+
+  When the shell starts, the volume will be mounted at `/mnt/my-vol-name`. This may be helpful for shell-based exploration or manipulation of volume contents.
+
+  Note that the option can be used multiple times to mount additional models:
+
+  ```
+  modal shell --volume models --volume data
+  ```
+
+### 0.64.0 (2024-07-29)
+
+- App deployment events are now atomic, reducing the risk that a failed deploy will leave the App in a bad state.
+
+## 0.63
+
+### 0.63.87 (2024-07-24)
+
+- The `_experimental_boost` argument can now be removed. Boost is now enabled on all modal Functions.
+
+### 0.63.77 (2024-07-18)
+
+- Setting `_allow_background_volume_commits` is no longer necessary and has been deprecated. Remove this argument in your decorators.
+
+### 0.63.36 (2024-07-05)
+
+- Image layers defined with a `@modal.build` method will now include the values of any _class variables_ that are referenced within the method as part of the layer cache key. That means that the layer will rebuild when the class variables change or are overridden by a subclass.
+
+### 0.63.22 (2024-07-01)
+
+- Fixed an error when running `@modal.build` methods that was introduced in v0.63.19
+
+### 0.63.20 (2024-07-01)
+
+- Fixed bug where `self.method.local()` would re-trigger lifecycle methods in classes
+
+### 0.63.14 (2024-06-28)
+
+- Adds `Cls.lookup()` backwards compatibility with classes created by clients prior to `v0.63`.
+
+  **Important**: When updating (to >=v0.63) an app with a Modal `class` that's accessed using `Cls.lookup()` - make sure to update the client of the app/service **using** `Cls.lookup()` first, and **then** update the app containing the class being looked up.
+
+### 0.63.12 (2024-06-27)
+
+- Fixed a bug introduced in 0.63.0 that broke `modal.Cls.with_options`
+
+### 0.63.10 (2024-06-26)
+
+- Adds warning about future deprecation of `retries` for generators. Retries are being deprecated as they can lead to nondetermistic generator behavior.
+
+### 0.63.9 (2024-06-26)
+
+- Fixed a bug in `Volume.copy_files()` where some source paths may be ignored if passed as `bytes`.
+- `Volume.read_file`, `Volume.read_file_into_fileobj`, `Volume.remove_file`, and `Volume.copy_files` can no longer take both string or bytes for their paths. They now only accept `str`.
+
+### 0.63.2 (2024-06-25)
+
+- Fixes issue with `Cls.lookup` not working (at all) after upgrading to v0.63.0. **Note**: this doesn't fix the cross-version lookup incompatibility introduced in 0.63.0.
+
+### 0.63.0 (2024-06-24)
+
+- Changes how containers are associated with methods of `@app.cls()`-decorated Modal "classes".
+
+  Previously each `@method` and web endpoint of a class would get its own set of isolated containers and never run in the same container as other sibling methods.
+  Starting in this version, all `@methods` and web endpoints will be part of the same container pool. Notably, this means all methods will scale up/down together, and options like `keep_warm` and `concurrency_limit` will affect the total number of containers for all methods in the class combined, rather than individually.
+
+  **Version incompatibility warning:** Older clients (below 0.63) can't use classes deployed by new clients (0.63 and above), and vice versa. Apps or standalone clients using `Cls.lookup(...)` to invoke Modal classes need to be upgraded to version `0.63` at the same time as the deployed app that's being called into.
+
+- `keep_warm` for classes is now an attribute of the `@app.cls()` decorator rather than individual methods.
+
+## 0.62
+
+### 0.62.236 (2024-06-21)
+
+- Added support for mounting Volume or CloudBucketMount storage in `Image.run_function`. Note that this is _typically_ not necessary, as data downloaded during the Image build can be stored directly in the Image filesystem.
+
+### 0.62.230 (2024-06-18)
+
+- It is now an error to create or lookup Modal objects (`Volume`, `Dict`, `Secret`, etc.) with an invalid name. Object names must be shorter than 64 characters and may contain only alphanumeric characters, dashes, periods, and underscores. The name check had inadvertently been removed for a brief time following an internal refactor and then reintroduced as a warning. It is once more a hard error. Please get in touch if this is blocking access to your data.
+
+### 0.62.224 (2024-06-17)
+
+- The `modal app list` command now reports apps created by `modal app run` or `modal app serve` as being in an "ephemeral" state rather than a "running" state to reduce confusion with deployed apps that are actively processing inputs.
+
+### 0.62.223 (2024-06-14)
+
+- All modal CLI commands now accept `-e` as a short-form of `--env`
+
+### 0.62.220 (2024-06-12)
+
+- Added support for entrypoint and shell for custom containers: `Image.debian_slim().entrypoint([])` can be used interchangeably with `.dockerfile_commands('ENTRYPOINT []')`, and `.shell(["/bin/bash", "-c"])` can be used interchangeably with `.dockerfile_commands('SHELL ["/bin/bash", "-c"]')`
+
+### 0.62.219 (2024-06-12)
+
+- Fix an issue with `@web_server` decorator not working on image builder version 2023.12
+
+### 0.62.208 (2024-06-08)
+
+- `@web_server` endpoints can now return HTTP headers of up to 64 KiB in length. Previously, they were limited to 8 KiB due to an implementation detail.
+
+### 0.62.201 (2024-06-04)
+
+- `modal deploy` now accepts a `--tag` optional parameter that allows you to specify a custom tag for the deployed version, making it easier to identify and manage different deployments of your app.
+
+### 0.62.199 (2024-06-04)
+
+- `web_endpoint`s now have the option to include interactive SwaggerUI/redoc docs by setting `docs=True`
+- `web_endpoint`s no longer include an OpenAPI JSON spec route by default
+
+### 0.62.190 (2024-05-29)
+
+- `modal.Function` now supports requesting ephemeral disk (SSD) via the new `ephemeral_disk` parameter. Intended for use in doing large dataset ingestion and transform.
+
+### 0.62.186 (2024-05-29)
+
+- `modal.Volume` background commits are now enabled by default when using `spawn_sandbox`.
+
+### 0.62.185 (2024-05-28)
+
+- The `modal app stop` CLI command now accepts a `--name` (or `-n`) option to stop an App by name rather than by ID.
+
+### 0.62.181 (2024-05-24)
+
+- Background committing on `modal.Volume` mounts is now default behavior.
+
+### 0.62.178 (2024-05-21)
+
+- Added a `modal container stop` CLI command that will kill an active container and reassign its current inputs.
+
+### 0.62.175 (2024-05-17)
+
+- `modal.CloudBucketMount` now supports writing to Google Cloud Storage buckets.
+
+### 0.62.174 (2024-05-17)
+
+- Using `memory=` to specify the type of `modal.gpu.A100` is deprecated in favor of `size=`. Note that `size` accepts a string type (`"40GB"` or `"80GB"`) rather than an integer, as this is a request for a specific variant of the A100 GPU.
+
+### 0.62.173 (2024-05-17)
+
+- Added a `version` flag to the `modal.Volume` API and CLI, allow opting in to a new backend implementation.
+
+### 0.62.172 (2024-05-17)
+
+- Fixed a bug where other functions weren't callable from within an `asgi_app` or `wsgi_app` constructor function and side effects of `@enter` methods weren't available in that scope.
+
+### 0.62.166 (2024-05-14)
+
+- Disabling background commits on `modal.Volume` volumes is now deprecated. Background commits will soon become mandatory behavior.
+
+### 0.62.165 (2024-05-13)
+
+- Deprecated `wait_for_response=False` on web endpoints. See [the docs](https://modal.com/docs/guide/webhook-timeouts#polling-solutions) for alternatives.
+
+### 0.62.162 (2024-05-13)
+
+- A deprecation warning is now raised when using `modal.Stub`, which has been renamed to `modal.App`. Additionally, it is recommended to use `app` as the variable name rather than `stub`, which matters when using the automatic app discovery feature in the `modal run` CLI command.
+
+### 0.62.159 (2024-05-10)
+
+- Added a `--stream-logs` flag to `modal deploy` that, if True, begins streaming the app logs once deployment is complete.
+
+### 0.62.156 (2024-05-09)
+
+- Added support for looking up a deployed App by its deployment name in `modal app logs`
+
+### 0.62.150 (2024-05-08)
+
+- Added validation that App `name`, if provided, is a string.
+
+### 0.62.149 (2024-05-08)
+
+- The `@app.function` decorator now raises an error when it is used to decorate a class (this was always invalid, but previously produced confusing behavior).
+
+### 0.62.148 (2024-05-08)
+
+- The `modal app list` output has been improved in several ways:
+  - Persistent storage objects like Volumes or Dicts are no longer included (these objects receive an app ID internally, but this is an implementation detail and subject to future change). You can use the dedicated CLI for each object (e.g. `modal volume list`) instead.
+  - For Apps in a _stopped_ state, the output is now limited to those stopped within the past 2 hours.
+  - The number of tasks running for each App is now shown.
+
+### 0.62.146 (2024-05-07)
+
+- Added the `region` parameter to the `modal.App.function` and `modal.App.cls` decorators. This feature allows the selection of specific regions for function execution. Note that it is available only on some plan types. See our [blog post](https://modal.com/blog/region-selection-launch) for more details.
+
+### 0.62.144 (2024-05-06)
+
+- Added deprecation warnings when using Python 3.8 locally or in a container. Python 3.8 is nearing EOL, and Modal will be dropping support for it soon.
+
+### 0.62.141 (2024-05-03)
+
+- Deprecated the `Image.conda` constructor and the `Image.conda_install` / `Image.conda_update_from_environment` methods. Conda-based images had a number of tricky issues and were generally slower and heavier than images based on `micromamba`, which offers a similar featureset and can install packages from the same repositories.
+- Added the `spec_file` parameter to allow `Image.micromamba_install` to install dependencies from a local file. Note that `micromamba` supports conda yaml syntax along with simple text files.
+
+### 0.62.131 (2024-05-01)
+
+- Added a deprecation warning when object names are invalid. This applies to `Dict`, `NetworkFileSystem`, `Secret`, `Queue`, and `Volume` objects. Names must be shorter than 64 characters and may contain only alphanumeric characters, dashes, periods, and underscores. These rules were previously enforced, but the check had inadvertently been dropped in a recent refactor. Please update the names of your objects and transfer any data to retain access, as invalid names will become an error in a future release.
+
+### 0.62.130 (2024-05-01)
+
+- Added a command-line interface for interacting with `modal.Queue` objects. Run `modal queue --help` in your terminal to see what is available.
+
+### 0.62.116 (2024-04-26)
+
+- Added a command-line interface for interacting with `modal.Dict` objects. Run `modal dict --help` in your terminal to see what is available.
+
+### 0.62.114 (2024-04-25)
+
+- `Secret.from_dotenv` now accepts an optional filename keyword argument:
+
+  ```python
+  @app.function(secrets=[modal.Secret.from_dotenv(filename=".env-dev")])
+  def run():
+      ...
+  ```
+
+### 0.62.110 (2024-04-25)
+
+- Passing a glob `**` argument to the `modal volume get` CLI has been deprecated — instead, simply download the desired directory path, or `/` for the entire volume.
+- `Volume.listdir()` no longer takes trailing glob arguments. Use `recursive=True` instead.
+- `modal volume get` and `modal nfs get` performance is improved when downloading a single file. They also now work with multiple files when outputting to stdout.
+- Fixed a visual bug where `modal volume get` on a single file will incorrectly display the destination path.
+
+### 0.62.109 (2024-04-24)
+
+- Improved feedback for deserialization failures when objects are being transferred between local / remote environments.
+
+### 0.62.108 (2024-04-24)
+
+- Added `Dict.delete` and `Queue.delete` as API methods for deleting named storage objects:
+
+  ```python
+  import modal
+  modal.Queue.delete("my-job-queue")
+  ```
+
+- Deprecated invoking `Volume.delete` as an instance method; it should now be invoked as a static method with the name of the Volume to delete, as with the other methods.
+
+### 0.62.98 (2024-04-21)
+
+- The `modal.Dict` object now implements a `keys`/`values`/`items` API. Note that there are a few differences when compared to standard Python dicts:
+  - The return value is a simple iterator, whereas Python uses a dictionary view object with more features.
+  - The results are unordered.
+- Additionally, there was no key data stored for items added to a `modal.Dict` prior to this release, so empty strings will be returned for these entries.
+
+### 0.62.81 (2024-04-18)
+
+- We are introducing `modal.App` as a replacement for `modal.Stub` and encouraging the use of "app" terminology over "stub" to reduce confusion between concepts used in the SDK and the Dashboard. Support for `modal.Stub` will be gradually deprecated over the next few months.
+
+### 0.62.72 (2024-04-16)
+
+- Specifying a hard memory limit for a `modal.Function` is now supported. Pass a tuple of `memory=(request, limit)`. Above the `limit`, which is specified in MiB, a Function's container will be OOM killed.
+
+### 0.62.70 (2024-04-16)
+
+- `modal.CloudBucketMount` now supports read-only access to Google Cloud Storage
+
+### 0.62.69 (2024-04-16)
+
+- Iterators passed to `Function.map()` and similar parallel execution primitives are now executed on the main thread, preventing blocking iterators from possibly locking up background Modal API calls, and risking task shutdowns.
+
+### 0.62.67 (2024-04-15)
+
+- The return type of `Volume.listdir()`, `Volume.iterdir()`, `NetworkFileSystem.listdir()`, and `NetworkFileSystem.iterdir()` is now a `FileEntry` dataclass from the `modal.volume` module. The fields of this data class are the same as the old protobuf object returned by these methods, so it should be mostly backwards-compatible.
+
+### 0.62.65 (2024-04-15)
+
+- Cloudflare R2 bucket support added to `modal.CloudBucketMount`
+
+### 0.62.55 (2024-04-11)
+
+- When Volume reloads fail due to an open file, we now try to identify and report the relevant path. Note that there may be some circumstances in which we are unable to identify the specific file blocking a reload and will report a generic error message in that case.
+
+### 0.62.53 (2024-04-10)
+
+- Values in the `modal.toml` config file that are spelled as `0`, `false`, `"False"`, or `"false"` will now be coerced in Python to`False`, whereas previously only `"0"` (as a string) would have the intended effect.
+
+### 0.62.25 (2024-04-01)
+
+- Fixed a recent regression that caused functions using `modal.interact()` to crash.
+
+### 0.62.15 (2024-03-29)
+
+- Queue methods `put`, `put_many`, `get`, `get_many` and `len` now support an optional `partition` argument (must be specified as a `kwarg`). When specified, users read and write from new partitions of the queue independently. `partition=None` corresponds to the default partition of the queue.
+
+### 0.62.3 (2024-03-27)
+
+- User can now mount S3 buckets using [Requester Pays](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html). This can be done with `CloudBucketMount(..., requester_pays=True)`.
+
+### 0.62.1 (2024-03-27)
+
+- Raise an error on `@web_server(startup_timeout=0)`, which is an invalid configuration.
+
+### 0.62.0 (2024-03-26)
+
+- The `.new()` method has now been deprecated on all Modal objects. It should typically be replaced with `.from_name(...)` in Modal app code, or `.ephemeral()` in scripts that use Modal
+- Assignment of Modal objects to a `Stub` via subscription (`stub["object"]`) or attribute (`stub.object`) syntax is now deprecated. This syntax was only necessary when using `.new()`.
+
+## 0.61
+
+### 0.61.104 (2024-03-25)
+
+- Fixed a bug where images based on `micromamba` could fail to build if requesting Python 3.12 when a different version of Python was being used locally.
+
+### 0.61.76 (2024-03-19)
+
+- The `Sandbox`'s `LogsReader` is now an asynchronous iterable. It supports the `async for` statement to stream data from the sandbox's `stdout/stderr`.
+
+```python
+@stub.function()
+async def my_fn():
+    sandbox = stub.spawn_sandbox(
+      "bash",
+      "-c",
+      "while true; do echo foo; sleep 1; done"
+    )
+    async for message in sandbox.stdout:
+        print(f"Message: {message}")
+```
+
+### 0.61.57 (2024-03-15)
+
+- Add the `@web_server` decorator, which exposes a server listening on a container port as a web endpoint.
+
+### 0.61.56 (2024-03-15)
+
+- Allow users to write to the `Sandbox`'s `stdin` with `StreamWriter`.
+
+```python
+@stub.function()
+def my_fn():
+    sandbox = stub.spawn_sandbox(
+        "bash",
+        "-c",
+        "while read line; do echo $line; done",
+    )
+    sandbox.stdin.write(b"foo\\n")
+    sandbox.stdin.write(b"bar\\n")
+    sandbox.stdin.write_eof()
+    sandbox.stdin.drain()
+    sandbox.wait()
+```
+
+### 0.61.53 (2024-03-15)
+
+- Fixed an bug where` Mount` was failing to include symbolic links.
+
+### 0.61.45 (2024-03-13)
+
+When called from within a container, `modal.experimental.stop_fetching_inputs()` causes it to gracefully exit after the current input has been processed.
+
+### 0.61.35 (2024-03-12)
+
+- The `@wsgi_app()` decorator now uses a different backend based on `a2wsgi` that streams requests in chunks, rather than buffering the entire request body.
+
+### 0.61.32 (2024-03-11)
+
+- Stubs/apps can now be "composed" from several smaller stubs using `stub.include(...)`. This allows more ergonomic setup of multi-file Modal apps.
+
+### 0.61.31 (2024-03-08)
+
+- The `Image.extend` method has been deprecated. This is a low-level interface and can be replaced by other `Image` methods that offer more flexibility, such as `Image.from_dockerfile`, `Image.dockerfile_commands`, or `Image.run_commands`.
+
+### 0.61.24 (2024-03-06)
+
+- Fixes `modal volume put` to support uploading larger files, beyond 40 GiB.
+
+### 0.61.22 (2024-03-05)
+
+- Modal containers now display a warning message if lingering threads are present at container exit, which prevents runner shutdown.
+
+### 0.61.17 (2024-03-05)
+
+- Bug fix: Stopping an app while a container's `@exit()` lifecycle methods are being run no longer interrupts the lifecycle methods.
+- Bug fix: Worker preemptions no longer interrupt a container's `@exit()` lifecycle method (until 30 seconds later).
+- Bug fix: Async `@exit()` lifecycle methods are no longer skipped for sync functions.
+- Bug fix: Stopping a sync function with `allow_concurrent_inputs>1` now actually stops the container. Previously, it would not propagate the signal to worker threads, so they would continue running.
+- Bug fix: Input-level cancellation no longer skips the `@exit()` lifecycle method.
+- Improve stability of container entrypoint against race conditions in task cancellation.
+
+### 0.61.9 (2024-03-05)
+
+- Fix issue with pdm where all installed packages would be automounted when using package cache (MOD-2485)
+
+### 0.61.6 (2024-03-04)
+
+- For modal functions/classes with `concurrency_limit < keep_warm`, we'll raise an exception now. Previously we (silently) respected the `concurrency_limit` parameter.
+
+### 0.61.1 (2024-03-03)
+
+`modal run --interactive` or `modal run -i` run the app in "interactive mode". This allows any remote code to connect to the user's local terminal by calling `modal.interact()`.
+
+```python
+@stub.function()
+def my_fn(x):
+    modal.interact()
+
+    x = input()
+    print(f"Your number is {x}")
+```
+
+This means that you can dynamically start an IPython shell if desired for debugging:
+
+```python
+@stub.function()
+def my_fn(x):
+    modal.interact()
+
+    from IPython import embed
+    embed()
+```
+
+For convenience, breakpoints automatically call `interact()`:
+
+```python
+@stub.function()
+def my_fn(x):
+    breakpoint()
+```
+
+## 0.60
+
+### 0.60.0 (2024-02-29)
+
+- `Image.run_function` now allows you to pass args and kwargs to the function. Usage:
+
+```python
+def my_build_function(name, size, *, variant=None):
+    print(f"Building {name} {size} {variant}")
+
+image = modal.Image.debian_slim().run_function(
+    my_build_function, args=("foo", 10), kwargs={"variant": "bar"}
+)
+```
+
+## 0.59
+
+### 0.59.0 (2024-02-28)
+
+- Mounted packages are now deduplicated across functions in the same stub
+- Mounting of local Python packages are now marked as such in the mount creation output, e.g. `PythonPackage:my_package`
+- Automatic mounting now includes packages outside of the function file's own directory. Mounted packages are mounted in /root/<module path>
+
+## 0.58
+
+### 0.58.92 (2024-02-27)
+
+- Most errors raised through usage of the CLI will now print a simple error message rather than showing a traceback from inside the `modal` library.
+- Tracebacks originating from user code will include fewer frames from within `modal` itself.
+- The new `MODAL_TRACEBACK` environment variable (and `traceback` field in the Modal config file) can override these behaviors so that full tracebacks are always shown.
+
+### 0.58.90 (2024-02-27)
+
+- Fixed a bug that could cause `cls`-based functions to to ignore timeout signals.
+
+### 0.58.88 (2024-02-26)
+
+- `volume get` performance is improved for large (> 100MB) files
+
+### 0.58.79 (2024-02-23)
+
+- Support for function parameters in methods decorated with `@exit` has been deprecated. Previously, exit methods were required to accept three arguments containing exception information (akin to `__exit__` in the context manager protocol). However, due to a bug, these arguments were always null. Going forward, `@exit` methods are expected to have no parameters.
+
+### 0.58.75 (2024-02-23)
+
+- Function calls can now be cancelled without killing the container running the inputs. This allows new inputs by different function calls to the same function to be picked up immediately without having to cold-start new containers after cancelling calls.
+
+## 0.57
+
+### 0.57.62 (2024-02-21)
+
+- An `InvalidError` is now raised when a lifecycle decorator (`@build`, `@enter`, or `@exit`) is used in conjunction with `@method`. Previously, this was undefined and could produce confusing failures.
+
+### 0.57.61 (2024-02-21)
+
+- Reduced the amount of context for frames in modal's CLI framework when showing a traceback.
+
+### 0.57.60 (2024-02-21)
+
+- The "dunder method" approach for class lifecycle management (`__build__`, `__enter__`, `__exit__`, etc.) is now deprecated in favor of the modal `@build`, `@enter`, and `@exit` decorators.
+
+### 0.57.52 (2024-02-17)
+
+- In `modal token new` and `modal token set`, the `--no-no-verify` flag has been removed in favor of a `--verify` flag. This remains the default behavior.
+
+### 0.57.51 (2024-02-17)
+
+- Fixes a regression from 0.57.40 where `@enter` methods used a separate event loop.
+
+### 0.57.42 (2024-02-14)
+
+- Adds a new environment variable/config setting, `MODAL_FORCE_BUILD`/`force_build`, that coerces all images to be built from scratch, rather than loaded from cache.
+
+### 0.57.40 (2024-02-13)
+
+- The `@enter()` lifecycle method can now be used to run additional setup code prior to function checkpointing (when the class is decorated with `stub.cls(enable_checkpointing=True)`. Note that there are currently some limitations on function checkpointing:
+  - Checkpointing only works for CPU memory; any GPUs attached to the function will not available
+  - Networking is disabled while the checkpoint is being created
+- Please note that function checkpointing is still a beta feature.
+
+### 0.57.31 (2024-02-12)
+
+- Fixed an issue with displaying deprecation warnings on Windows systems.
+
+### 0.57.22 (2024-02-09)
+
+- Modal client deprecation warnings are now highlighted in the CLI
+
+### 0.57.16 (2024-02-07)
+
+- Fixes a regression in container scheduling. Users on affected versions (**0.57.5**—**0.57.15**) are encouraged to upgrade immediately.
+
+### 0.57.15 (2024-02-07)
+
+- The legacy `image_python_version` config option has been removed. Use the `python_version=` parameter on your image definition instead.
+
+### 0.57.13 (2024-02-07)
+
+- Adds support for mounting an S3 bucket as a volume.
+
+### 0.57.9 (2024-02-07)
+
+- Support for an implicit 'default' profile is now deprecated. If you have more than one profile in your Modal config file, one must be explicitly set to `active` (use `modal profile activate` or edit your `.modal.toml` file to resolve).
+- An error is now raised when more than one profile is set to `active`.
+
+### 0.57.2 (2024-02-06)
+
+- Improve error message when generator functions are called with `.map(...)`.
+
+### 0.57.0 (2024-02-06)
+
+- Greatly improved streaming performance of generators and WebSocket web endpoints.
+- **Breaking change:** You cannot use `.map()` to call a generator function. (In previous versions, this merged the results onto a single stream, but the behavior was undocumented and not widely used.)
+- **Incompatibility:** Generator outputs are now on a different internal system. Modal code on client versions before 0.57 cannot trigger [deployed functions](https://modal.com/docs/guide/trigger-deployed-functions) with `.remote_gen()` that are on client version 0.57, and vice versa.
+
+## 0.56
+
+Note that in version 0.56 and prior, Modal used a different numbering system for patch releases.
+
+### 0.56.4964 (2024-02-05)
+
+- When using `modal token new` or `model token set`, the profile containing the new token will now be activated by default. Use the `--no-activate` switch to update the `modal.toml` file without activating the corresponding profile.
+
+### 0.56.4953 (2024-02-05)
+
+- The `modal profile list` output now indicates when the workspace is determined by a token stored in environment variables.
+
+### 0.56.4952 (2024-02-05)
+
+- Variadic parameters (e.g. \*args and \*\*kwargs) can now be used in scheduled functions as long as the function doesn't have any other parameters without a default value
+
+### 0.56.4903 (2024-02-01)
+
+- `modal container exec`'s `--no-tty` flag has been renamed to `--no-pty`.
+
+### 0.56.4902 (2024-02-01)
+
+- The singular form of the `secret` parameter in `Stub.function`, `Stub.cls`, and `Image.run_function` has been deprecated. Please update your code to use the plural form instead:`secrets=[Secret(...)]`.
+
+### 0.56.4885 (2024-02-01)
+
+- In `modal profile list`, the user's GitHub username is now shown as the name for the "Personal" workspace.
+
+### 0.56.4874 (2024-01-31)
+
+- The `modal token new` and `modal token set` commands now create profiles that are more closely associated with workspaces, and they have more explicit profile activation behavior:
+  - By default, these commands will create/update a profile named after the workspace that the token points to, rather than a profile named "default"
+  - Both commands now have an `--activate` flag that will activate the profile associated with the new token
+  - If no other profiles exist at the time of creation, the new profile will have its `active` metadata set to True
+- With these changes, we are moving away from the concept of a "default" profile. Implicit usage of the "default" profile will be deprecated in a future update.
+
+### 0.56.4849 (2024-01-29)
+
+- Adds tty support to `modal container exec` for fully-interactive commands. Example: `modal container exec [container-id] /bin/bash`
+
+### 0.56.4792 (2024-01-26)
+
+- The `modal profile list` command now shows the workspace associated with each profile.
+
+### 0.56.4715 (2024-01-24)
+
+- `Mount.from_local_python_packages` now places mounted packages at `/root` in the Modal runtime by default (used to be `/pkg`). To override this behavior, the function now takes a `remote_dir: Union[str, PurePosixPath]` argument.
+
+### 0.56.4707 (2024-01-23)
+
+- The Modal client library is now compatible with Python 3.12, although there are a few limitations:
+
+  - Images that use Python 3.12 without explicitly specifing it through `python_version` or `add_python` will not build
+    properly unless the modal client is also running on Python 3.12.
+  - The `conda` and `microconda` base images currently do not support Python 3.12 because an upstream dependency is not yet compatible.
+
+### 0.56.4700 (2024-01-22)
+
+- `gpu.A100` class now supports specifying GiB memory configuration using a `size: str` parameter. The `memory: int` parameter is deprecated.
+
+### 0.56.4693 (2024-01-22)
+
+- You can now execute commands in running containers with `modal container exec [container-id] [command]`.
+
+### 0.56.4691 (2024-01-22)
+
+- The `modal` cli now works more like the `python` cli in regard to script/module loading:
+  - Running `modal my_dir/my_script.py` now puts `my_dir` on the PYTHONPATH.
+  - `modal my_package.my_module` will now mount to /root/my_package/my_module.py in your Modal container, regardless if using automounting or not (and any intermediary `__init__.py` files will also be mounted)
+
+### 0.56.4687 (2024-01-20)
+
+- Modal now uses the current profile if `MODAL_PROFILE` is set to the empty string.
+
+### 0.56.4649 (2024-01-17)
+
+- Dropped support for building Python 3.7 based `modal.Image`s. Python 3.7 is end-of-life since late June 2023.
+
+### 0.56.4620 (2024-01-16)
+
+- modal.Stub.function now takes a `block_network` argument.
+
+### 0.56.4616 (2024-01-16)
+
+- modal.Stub now takes a `volumes` argument for setting the default volumes of all the stub's functions, similarly to the `mounts` and `secrets` argument.
+
+### 0.56.4590 (2024-01-13)
+
+- `modal serve`: Setting MODAL_LOGLEVEL=DEBUG now displays which files cause an app reload during serve
+
+### 0.56.4570 (2024-01-12)
+
+- `modal run` cli command now properly propagates `--env` values to object lookups in global scope of user code
+
+## Examples
+
+Source: https://github.com/modal-labs/modal-examples
+
+### Agent
+
+# Build a coding agent with Modal Sandboxes and LangGraph
+
+This example demonstrates how to build an LLM coding "agent" that can generate and execute Python code, using
+documentation from the web to inform its approach.
+
+Naturally, we use the agent to generate code that runs language models.
+
+The agent is built with [LangGraph](https://github.com/langchain-ai/langgraph), a library for building
+directed graphs of computation popular with AI agent developers,
+and uses models from the OpenAI API.
+
+## Setup
+
+```python
+import modal
+
+from .src import edges, nodes, retrieval
+from .src.common import COLOR, PYTHON_VERSION, image
+
+```
+
+You will need two [Modal Secrets](https://modal.com/docs/guide/secrets) to run this example:
+one to access the OpenAI API and another to access the LangSmith API for logging the agent's behavior.
+
+To create them, head to the [Secrets dashboard](https://modal.com/secrets), select "Create new secret",
+and use the provided templates for OpenAI and LangSmith.
+
+```python
+app = modal.App(
+    "example-agent",
+    image=image,
+    secrets=[
+        modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
+        modal.Secret.from_name("langsmith-secret", required_keys=["LANGCHAIN_API_KEY"]),
+    ],
+)
+
+```
+
+## Creating a Sandbox
+
+We execute the agent's code in a Modal [Sandbox](https://modal.com/docs/guide/sandbox), which allows us to
+run arbitrary code in a safe environment. In this example, we will use the [`transformers`](https://huggingface.co/docs/transformers/index)
+library to generate text with a pre-trained model. Let's create a Sandbox with the necessary dependencies.
+
+```python
+def create_sandbox(app) -> modal.Sandbox:
+    # Change this image (and the retrieval logic in the retrieval module)
+    # if you want the agent to give coding advice on other libraries!
+    agent_image = modal.Image.debian_slim(python_version=PYTHON_VERSION).uv_pip_install(
+        "torch==2.5.0",
+        "transformers==4.46.0",
+    )
+
+    return modal.Sandbox.create(
+        image=agent_image,
+        timeout=60 * 10,  # 10 minutes
+        app=app,
+        # Modal sandboxes support GPUs!
+        gpu="T4",
+        # you can also pass secrets here -- note that the main app's secrets are not shared
+    )
+
+```
+
+We also need a way to run our code in the sandbox. For this, we'll write a simple wrapper
+around the Modal Sandbox `exec` method. We use `exec` because it allows us to run code without spinning up a
+new container. And we can reuse the same container for multiple runs, preserving state.
+
+```python
+def run(code: str, sb: modal.Sandbox) -> tuple[str, str]:
+    print(
+        f"{COLOR['HEADER']}📦: Running in sandbox{COLOR['ENDC']}",
+        f"{COLOR['GREEN']}{code}{COLOR['ENDC']}",
+        sep="\n",
+    )
+
+    exc = sb.exec("python", "-c", code)
+    exc.wait()
+
+    stdout = exc.stdout.read()
+    stderr = exc.stderr.read()
+
+    if exc.returncode != 0:
+        print(
+            f"{COLOR['HEADER']}📦: Failed with exitcode {sb.returncode}{COLOR['ENDC']}"
+        )
+
+    return stdout, stderr
+
+```
+
+## Constructing the agent's graph
+
+Now that we have the sandbox to execute code in, we can construct our agent's graph. Our graph is
+defined in the `edges` and `nodes` modules
+[associated with this example](https://github.com/modal-labs/modal-examples/tree/main/13_sandboxes/codelangchain).
+Nodes are actions that change the state. Edges are transitions between nodes.
+
+The idea is simple: we start at the node `generate`, which invokes the LLM to generate code based off documentation.
+The generated code is executed (in the sandbox) as part of an edge called `check_code_execution`
+and then the outputs are passed to the LLM for evaluation (the `evaluate_execution` node).
+If the LLM determines that the code has executed correctly -- which might mean that the code raised an exception! --
+we pass along the `decide_to_finish` edge and finish.
+
+```python
+def construct_graph(sandbox: modal.Sandbox, debug: bool = False):
+    from langgraph.graph import StateGraph
+
+    from .src.common import GraphState
+
+    # Crawl the transformers documentation to inform our code generation
+    context = retrieval.retrieve_docs(debug=debug)
+
+    graph = StateGraph(GraphState)
+
+    # Attach our nodes to the graph
+    graph_nodes = nodes.Nodes(context, sandbox, run, debug=debug)
+    for key, value in graph_nodes.node_map.items():
+        graph.add_node(key, value)
+
+    # Construct the graph by adding edges
+    graph = edges.enrich(graph)
+
+    # Set the starting and ending nodes of the graph
+    graph.set_entry_point(key="generate")
+    graph.set_finish_point(key="finish")
+
+    return graph
+
+```
+
+We now set up the graph and compile it. See the `src` module for details
+on the content of the graph and the nodes we've defined.
+
+```python
+DEFAULT_QUESTION = "How do I generate Python code using a pre-trained model from the transformers library?"
+
+@app.function()
+def go(
+    question: str = DEFAULT_QUESTION,
+    debug: bool = False,
+):
+    """Compiles the Python code generation agent graph and runs it, returning the result."""
+    sb = create_sandbox(app)
+
+    graph = construct_graph(sb, debug=debug)
+    runnable = graph.compile()
+    result = runnable.invoke(
+        {"keys": {"question": question, "iterations": 0}},
+        config={"recursion_limit": 50},
+    )
+
+    sb.terminate()
+
+    return result["keys"]["response"]
+
+```
+
+## Running the Graph
+
+Now let's call the agent from the command line!
+
+We define a `local_entrypoint` that runs locally and triggers execution on Modal.
+
+You can invoke it by executing following command from a folder that contains the `codelangchain` directory
+[from our examples repo](https://github.com/modal-labs/modal-examples/tree/main/13_sandboxes/codelangchain):
+
+```bash
+modal run -m codelangchain.agent --question "How do I run a pre-trained model from the transformers library?"
+```
+
+```python
+@app.local_entrypoint()
+def main(
+    question: str = DEFAULT_QUESTION,
+    debug: bool = False,
+):
+    """Sends a question to the Python code generation agent.
+
+    Switch to debug mode for shorter context and smaller model."""
+    if debug:
+        if question == DEFAULT_QUESTION:
+            question = "hi there, how are you?"
+
+    print(go.remote(question, debug=debug))
+
+```
+
+If things are working properly, you should see output like the following:
+
+```bash
+$ modal run -m codelangchain.agent --question "generate some cool output with transformers"
+---DECISION: FINISH---
+---FINISHING---
+To generate some cool output using transformers, we can use a pre-trained language model from the Hugging Face Transformers library. In this example, we'll use the GPT-2 model to generate text based on a given prompt. The GPT-2 model is a popular choice for text generation tasks due to its ability to produce coherent and contextually relevant text. We'll use the pipeline API from the Transformers library, which simplifies the process of using pre-trained models for various tasks, including text generation.
+
+from transformers import pipeline
+# Initialize the text generation pipeline with the GPT-2 model
+generator = pipeline('text-generation', model='gpt2')
+
+# Define a prompt for the model to generate text from
+prompt = "Once upon a time in a land far, far away"
+
+# Generate text using the model
+output = generator(prompt, max_length=50, num_return_sequences=1)
+
+# Print the generated text
+print(output[0]['generated_text'])
+
+Result of code execution:
+Once upon a time in a land far, far away, and still inhabited even after all the human race, there would be one God: a perfect universal God who has always been and will ever be worshipped. All His acts and deeds are immutable,
+```
+
+### Algolia Indexer
+
+# Algolia docsearch crawler
+
+This tutorial shows you how to use Modal to run the [Algolia docsearch
+crawler](https://docsearch.algolia.com/docs/legacy/run-your-own/) to index your
+website and make it searchable. This is not just example code - we run the same
+code in production to power search on this page (`Ctrl+K` to try it out!).
+
+## Basic setup
+
+Let's get the imports out of the way.
+
+```python
+import json
+import os
+import subprocess
+
+import modal
+
+```
+
+Modal lets you [use and extend existing Docker images](https://modal.com/docs/guide/custom-container#use-an-existing-container-image-with-from_registry),
+as long as they have `python` and `pip` available. We'll use the official crawler image built by Algolia, with a small
+adjustment: since this image has `python` symlinked to `python3.6` and Modal is not compatible with Python 3.6, we
+install Python 3.11 and symlink that as the `python` executable instead.
+
+```python
+algolia_image = modal.Image.from_registry(
+    "algolia/docsearch-scraper:v1.16.0",
+    add_python="3.11",
+    setup_dockerfile_commands=["ENTRYPOINT []"],
+)
+
+app = modal.App("example-algolia-indexer")
+
+```
+
+## Configure the crawler
+
+Now, let's configure the crawler with the website we want to index, and which
+CSS selectors we want to scrape. Complete documentation for crawler configuration is available
+[here](https://docsearch.algolia.com/docs/legacy/config-file).
+
+```python
+CONFIG = {
+    "index_name": "modal_docs",
+    "custom_settings": {
+        "separatorsToIndex": "._",
+        "synonyms": [["cls", "class"]],
+    },
+    "stop_urls": [
+        "https://modal.com/docs/reference/modal.Stub",
+        "https://modal.com/gpu-glossary",
+        "https://modal.com/docs/reference/changelog",
+    ],
+    "start_urls": [
+        {
+            "url": "https://modal.com/docs/guide",
+            "selectors_key": "default",
+            "page_rank": 2,
+        },
+        {
+            "url": "https://modal.com/docs/examples",
+            "selectors_key": "examples",
+            "page_rank": 1,
+        },
+        {
+            "url": "https://modal.com/docs/reference",
+            "selectors_key": "reference",
+            "page_rank": 1,
+        },
+    ],
+    "selectors": {
+        "default": {
+            "lvl0": {
+                "selector": "header .navlink-active",
+                "global": True,
+            },
+            "lvl1": "article h1",
+            "lvl2": "article h2",
+            "lvl3": "article h3",
+            "text": "article p,article ol,article ul",
+        },
+        "examples": {
+            "lvl0": {
+                "selector": "header .navlink-active",
+                "global": True,
+            },
+            "lvl1": "article h1",
+            "text": "article p,article ol,article ul",
+        },
+        "reference": {
+            "lvl0": {
+                "selector": "//div[contains(@class, 'sidebar')]//a[contains(@class, 'active')]//preceding::a[contains(@class, 'header')][1]",
+                "type": "xpath",
+                "global": True,
+                "default_value": "",
+                "skip": {"when": {"value": ""}},
+            },
+            "lvl1": "article h1",
+            "lvl2": "article h2",
+            "lvl3": "article h3",
+            "text": "article p,article ol,article ul",
+        },
+    },
+}
+
+```
+
+## Create an API key
+
+If you don't already have one, sign up for an account on [Algolia](https://www.algolia.com/). Set up
+a project and create an API key with `write` access to your index, and with the ACL permissions
+`addObject`, `editSettings` and `deleteIndex`. Now, create a Secret on the Modal [Secrets](https://modal.com/secrets)
+page with the `API_KEY` and `APPLICATION_ID` you just created. You can name this anything you want,
+but we named it `algolia-secret` and so that's what the code below expects.
+
+## The actual function
+
+We want to trigger our crawler from our CI/CD pipeline, so we're serving it as a
+[web endpoint](https://modal.com/docs/guide/webhooks) that can be triggered by a `GET` request during deploy.
+You could also consider running the crawler on a [schedule](https://modal.com/docs/guide/cron).
+
+The Algolia crawler is written for Python 3.6 and needs to run in the `pipenv` created for it,
+so we're invoking it using a subprocess.
+
+```python
+@app.function(
+    image=algolia_image,
+    secrets=[modal.Secret.from_name("algolia-secret")],
+)
+def crawl():
+    # Installed with a 3.6 venv; Python 3.6 is unsupported by Modal, so use a subprocess instead.
+    subprocess.run(
+        ["pipenv", "run", "python", "-m", "src.index"],
+        env={**os.environ, "CONFIG": json.dumps(CONFIG)},
+    )
+
+```
+
+We want to be able to trigger this function through a webhook.
+
+```python
+@app.function(image=modal.Image.debian_slim().uv_pip_install("fastapi[standard]"))
+@modal.fastapi_endpoint()
+def crawl_webhook():
+    crawl.remote()
+    return "Finished indexing docs"
+
+```
+
+## Deploy the indexer
+
+That's all the code we need! To deploy your application, run
+
+```shell
+modal deploy algolia_indexer.py
+```
+
+If successful, this will print a URL for your new webhook, that you can hit using
+`curl` or a browser. Logs from webhook invocations can be found from the [apps](https://modal.com/apps)
+page.
+
+The indexed contents can be found at https://www.algolia.com/apps/APP_ID/explorer/browse/, for your
+APP_ID. Once you're happy with the results, you can [set up the `docsearch` package with your
+website](https://docsearch.algolia.com/docs/docsearch-v3/), and create a search component that uses this index.
+
+## Entrypoint for development
+
+To make it easier to test this, we also have an entrypoint for when you run
+`modal run algolia_indexer.py`
+
+```python
+@app.local_entrypoint()
+def run():
+    crawl.remote()
+
+```
+
+### Amazon Embeddings
+
+# Embed 30 million Amazon reviews at 575k tokens per second with Qwen2-7B
+
+This example demonstrates how to create embeddings for a large text dataset. This is
+often necessary to enable semantic search, translation, and other language
+processing tasks. Modal makes it easy to deploy large, capable embedding models and handles
+all of the scaling to process very large datasets in parallel on many cloud GPUs.
+
+We create a Modal Function that will handle all of the data loading and submit inputs to an
+inference Cls that will automatically scale up to handle hundreds of large
+batches in parallel.
+
+Between the time a batch is submitted and the time it is fetched, it is stored via
+Modal's `spawn` system, which can hold onto up to one million inputs for up to a week.
+
+```python
+import json
+import subprocess
+from pathlib import Path
+
+import modal
+
+app = modal.App(name="example-amazon-embeddings")
+MINUTES = 60  # seconds
+HOURS = 60 * MINUTES
+
+```
+
+We define our `main` function as a `local_entrypoint`. This is what we'll call locally
+to start the job on Modal.
+
+You can run it with the command
+
+```bash
+modal run --detach amazon_embeddings.py
+```
+
+By default we `down-scale` to 1/100th of the data for demonstration purposes.
+To launch the full job, set the `--down-scale` parameter to `1`.
+But note that this will cost you!
+
+The entrypoint starts the job and gets back a `f`unction `c`all ID for each batch.
+We can use these IDs to retrieve the embeddings once the job is finished.
+Modal will keep the results around for up to 7 days after completion. Take a look at our
+[job processing guide](https://modal.com/docs/guide/job-queue)
+for more details.
+
+```python
+@app.local_entrypoint()
+def main(
+    dataset_name: str = "McAuley-Lab/Amazon-Reviews-2023",
+    dataset_subset: str = "raw_review_Books",
+    down_scale: float = 0.001,
+):
+    out_path = Path("/tmp") / "embeddings-example-fc-ids.json"
+    function_ids = launch_job.remote(
+        dataset_name=dataset_name, dataset_subset=dataset_subset, down_scale=down_scale
+    )
+    out_path.write_text(json.dumps(function_ids, indent=2) + "\n")
+    print(f"output handles saved to {out_path}")
+
+```
+
+## Load the data and start the inference job
+
+Next we define the Function that will do the data loading and feed it to our embedding model.
+We define a container [Image](https://modal.com/docs/guide/images)
+with the data loading dependencies.
+
+In it, we download the data we need and cache it to the container's local disk,
+which will disappear when the job is finished. We will be saving the review data
+along with the embeddings, so we don't need to keep the dataset around.
+
+Embedding a large dataset like this can take some time, but we don't need to wait
+around for it to finish. We use `spawn` to invoke our embedding Function
+and get back a handle with an ID that we can use to get the results later.
+This can bottleneck on just sending data over the network for processing, so
+we speed things up by using `ThreadPoolExecutor` to submit batches using multiple threads.
+
+Once all of the batches have been sent for inference, we can return the function IDs
+to the local client to save.
+
+```python
+@app.function(
+    image=modal.Image.debian_slim().uv_pip_install("datasets==3.5.1"), timeout=2 * HOURS
+)
+def launch_job(dataset_name: str, dataset_subset: str, down_scale: float):
+    import time
+    from concurrent.futures import ThreadPoolExecutor, as_completed
+
+    from datasets import load_dataset
+    from tqdm import tqdm
+
+    print("Loading dataset...")
+    dataset = load_dataset(
+        dataset_name,
+        dataset_subset,
+        split="full",
+        trust_remote_code=True,
+    )
+
+    data_subset = dataset.select(range(int(len(dataset) * down_scale)))
+
+    tei = TextEmbeddingsInference()
+    batches = generate_batches_of_chunks(data_subset)
+
+    start = time.perf_counter()
+    with ThreadPoolExecutor() as executor:
+        futures = [executor.submit(tei.embed.spawn, batch) for batch in tqdm(batches)]
+        function_ids = []
+        for future in tqdm(as_completed(futures), total=len(futures)):
+            function_ids.append(future.result().object_id)
+
+    print(f"Finished submitting job: {time.perf_counter() - start:.2f}s")
+
+    return function_ids
+
+```
+
+## Massively scaling up and scaling out embedding inference on many beefy GPUs
+
+We're going to spin up many containers to run inference, and we don't want each
+one to have to download the embedding model from Hugging Face. We can download and save it to a
+Modal [Volume](https://modal.com/docs/guide/volumes)
+during the image build step using `run_function`.
+
+We'll use the
+[GTE-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
+model from Alibaba, which performs well on the
+[Massive Text Embedding Benchmark](https://huggingface.co/spaces/mteb/leaderboard).
+
+```python
+MODEL_ID = "Alibaba-NLP/gte-Qwen2-7B-instruct"
+MODEL_DIR = "/model"
+MODEL_CACHE_VOLUME = modal.Volume.from_name(
+    "embeddings-example-model-cache", create_if_missing=True
+)
+
+def download_model():
+    from huggingface_hub import snapshot_download
+
+    snapshot_download(MODEL_ID, cache_dir=MODEL_DIR)
+
+```
+
+For inference, we will use Hugging Face's
+[Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference)
+framework for embedding model deployment.
+
+Running lots of separate machines is "scaling out". But we can also "scale up"
+by running on large, high-performance machines.
+
+We'll use L40S GPUs for a good balance between cost and performance. Hugging Face has
+prebuilt Docker images we can use as a base for our Modal Image.
+We'll use the one built for the L40S's
+[SM89/Ada Lovelace architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture)
+and install the rest of our dependencies on top.
+
+```python
+tei_image = "ghcr.io/huggingface/text-embeddings-inference:89-1.7"
+
+inference_image = (
+    modal.Image.from_registry(tei_image, add_python="3.12")
+    .dockerfile_commands("ENTRYPOINT []")
+    .uv_pip_install(
+        "httpx==0.28.1",
+        "huggingface-hub==0.36.0",
+        "numpy==2.2.5",
+        "tqdm==4.67.1",
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1", "HF_HOME": MODEL_DIR})
+    .run_function(download_model, volumes={MODEL_DIR: MODEL_CACHE_VOLUME})
+)
+
+```
+
+Next we define our inference class. Modal will auto-scale the number of
+containers ready to handle inputs based on the parameters we set in the `@app.cls`
+and `@modal.concurrent` decorators. Here we limit the total number of containers to
+100 and the maximum number of concurrent inputs to 10, which caps us at 1000 concurrent batches.
+On Modal's Starter (free) and Team plans, the maximum number of concurrent GPUs is lower,
+reducing the total number of concurrent batches and so the throughput.
+
+Customers on Modal's Enterprise Plan regularly scale up another order of magnitude above this.
+If you're interested in running on thousands of GPUs,
+[get in touch](https://form.fillout.com/t/onUBuQZ5vCus).
+
+Here we also specify the GPU type and attach the Modal Volume where we saved the
+embedding model.
+
+This class will spawn a local Text Embeddings Inference server when the container
+starts, and process each batch by receiving the text data over HTTP, returning a list of
+tuples with the batch text data and embeddings.
+
+```python
+@app.cls(
+    image=inference_image,
+    gpu="L40S",
+    volumes={MODEL_DIR: MODEL_CACHE_VOLUME},
+    max_containers=100,
+    scaledown_window=5 * MINUTES,  # idle for 5 min without inputs before scaling down
+    retries=3,  # handle transient failures and storms in the cloud
+    timeout=2 * HOURS,  # run for at most 2 hours
+)
+@modal.concurrent(max_inputs=10)
+class TextEmbeddingsInference:
+    @modal.enter()
+    def open_connection(self):
+        from httpx import AsyncClient
+
+        print("Starting text embedding inference server...")
+        self.process = spawn_server()
+        self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30)
+
+    @modal.exit()
+    def terminate_connection(self):
+        self.process.terminate()
+
+    @modal.method()
+    async def embed(self, batch):
+        texts = [chunk[-1] for chunk in batch]
+        res = await self.client.post("/embed", json={"inputs": texts})
+        return [chunk + (embedding,) for chunk, embedding in zip(batch, res.json())]
+
+```
+
+## Helper Functions
+
+The book review dataset contains ~30M reviews with ~12B total characters,
+indicating an average review length of ~500 characters. Some are much longer.
+Embedding models have a limit on the number of tokens they can process in a single
+input. We will need to split each review into chunks that are under this limit.
+
+The proper way to split text data is to use a tokenizer to ensure that any
+single request is under the models token limit, and to overlap chunks to provide
+semantic context and preserve information. For the sake of this example, we're going
+just to split by a set character length (`CHUNK_SIZE`).
+
+While the embedding model has a limit on the number of input tokens for a single
+embedding, the number of chunks that we can process in a single batch is limited by
+the VRAM of the GPU. We set the `BATCH_SIZE` accordingly.
+
+```python
+BATCH_SIZE = 256
+CHUNK_SIZE = 512
+
+def generate_batches_of_chunks(
+    dataset, chunk_size: int = CHUNK_SIZE, batch_size: int = BATCH_SIZE
+):
+    """Creates batches of chunks by naively slicing strings according to CHUNK_SIZE."""
+    batch = []
+    for entry_index, data in enumerate(dataset):
+        product_id = data["asin"]
+        user_id = data["user_id"]
+        timestamp = data["timestamp"]
+        title = data["title"]
+        text = data["text"]
+        for chunk_index, chunk_start in enumerate(range(0, len(text), chunk_size)):
+            batch.append(
+                (
+                    entry_index,
+                    chunk_index,
+                    product_id,
+                    user_id,
+                    timestamp,
+                    title,
+                    text[chunk_start : chunk_start + chunk_size],
+                )
+            )
+            if len(batch) == batch_size:
+                yield batch
+                batch = []
+    if batch:
+        yield batch
+
+def spawn_server(
+    model_id: str = MODEL_ID,
+    port: int = 8000,
+    max_client_batch_size: int = BATCH_SIZE,
+    max_batch_tokens: int = BATCH_SIZE * CHUNK_SIZE,
+    huggingface_hub_cache: str = MODEL_DIR,
+):
+    """Starts a text embedding inference server in a subprocess."""
+    import socket
+
+    LAUNCH_FLAGS = [
+        "--model-id",
+        model_id,
+        "--port",
+        str(port),
+        "--max-client-batch-size",
+        str(max_client_batch_size),
+        "--max-batch-tokens",
+        str(max_batch_tokens),
+        "--huggingface-hub-cache",
+        huggingface_hub_cache,
+    ]
+
+    process = subprocess.Popen(["text-embeddings-router"] + LAUNCH_FLAGS)
+    # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
+    while True:
+        try:
+            socket.create_connection(("127.0.0.1", port), timeout=1).close()
+            print("Inference server ready!")
+            return process
+        except (socket.timeout, ConnectionRefusedError):
+            retcode = process.poll()  # Check if the process has terminated.
+            if retcode is not None:
+                raise RuntimeError(f"Launcher exited unexpectedly with code {retcode}")
+
+```
+
+### Anthropic Computer Use
+
+# Run Anthropic's computer use demo in a Modal Sandbox
+
+This example demonstrates how to run Anthropic's [Computer Use demo](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo)
+in a Modal [Sandbox](https://modal.com/docs/guide/sandbox).
+
+## Sandbox Setup
+
+All Sandboxes are associated with an App.
+
+We start by looking up an existing App by name, or creating one if it doesn't exist.
+
+```python
+import time
+import urllib.request
+
+import modal
+import modal.experimental
+
+app = modal.App.lookup("example-anthropic-computer-use", create_if_missing=True)
+
+```
+
+The Computer Use [quickstart](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo)
+provides a prebuilt Docker image. We use this hosted image to create our sandbox environment.
+
+```python
+sandbox_image = (
+    modal.experimental.raw_registry_image(
+        "ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest",
+    )
+    .env({"WIDTH": "1920", "HEIGHT": "1080"})
+    .workdir("/home/computeruse")
+    .entrypoint([])
+)
+
+```
+
+We'll provide the Anthropic API key via a Modal [Secret](https://modal.com/docs/guide/secrets)
+which the sandbox can access at runtime.
+
+```python
+secret = modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"])
+
+```
+
+Now, we can start our Sandbox.
+We use `modal.enable_output()` to print the Sandbox's image build logs to the console.
+We'll also expose the ports required for the demo's interfaces:
+
+- Port 8501 serves the Streamlit UI for interacting with the agent loop
+- Port 6080 serves the VNC desktop view via a browser-based noVNC client
+
+```python
+with modal.enable_output():
+    sandbox = modal.Sandbox.create(
+        "sudo",
+        "--preserve-env=ANTHROPIC_API_KEY,DISPLAY_NUM,WIDTH,HEIGHT,PATH",
+        "-u",
+        "computeruse",
+        "./entrypoint.sh",
+        app=app,
+        image=sandbox_image,
+        secrets=[secret],
+        encrypted_ports=[8501, 6080],
+        timeout=60 * 60,  # stay alive for one hour, maximum one day
+    )
+
+print(f"🏖️  Sandbox ID: {sandbox.object_id}")
+
+```
+
+After starting the sandbox, we retrieve the public URLs for the exposed ports.
+
+```python
+tunnels = sandbox.tunnels()
+for port, tunnel in tunnels.items():
+    print(f"Waiting for service on port {port} to start at {tunnel.url}")
+
+```
+
+We can check on each server's status by making an HTTP request to the server's URL
+and verifying that it responds with a 200 status code.
+
+```python
+def is_server_up(url):
+    try:
+        response = urllib.request.urlopen(url)
+        return response.getcode() == 200
+    except Exception:
+        return False
+
+timeout = 60  # seconds
+start_time = time.time()
+up_ports = set()
+while time.time() - start_time < timeout:
+    for port, tunnel in tunnels.items():
+        if port not in up_ports and is_server_up(tunnel.url):
+            print(f"🏖️  Server is up and running on port {port}!")
+            up_ports.add(port)
+    if len(up_ports) == len(tunnels):
+        break
+    time.sleep(1)
+else:
+    print("🏖️  Timed out waiting for server to start.")
+
+```
+
+You can now open the URLs in your browser to interact with the demo!
+Note: The sandbox logs may mention `localhost:8080`.
+Ignore this and use the printed tunnel URLs instead.
+
+When finished, you can terminate the sandbox from your [Modal dashboard](https://modal.com/containers)
+or by running `Sandbox.from_id(sandbox.object_id).terminate()`.
+The Sandbox will also spin down after one hour.
+
+### App
+
+## Demo Streamlit application.
+
+This application is the example from https://docs.streamlit.io/library/get-started/create-an-app.
+
+Streamlit is designed to run its apps as Python scripts, not functions, so we separate the Streamlit
+code into this module, away from the Modal application code.
+
+```python
+def main():
+    import numpy as np
+    import pandas as pd
+    import streamlit as st
+
+    st.title("Uber pickups in NYC!")
+
+    DATE_COLUMN = "date/time"
+    DATA_URL = (
+        "https://s3-us-west-2.amazonaws.com/"
+        "streamlit-demo-data/uber-raw-data-sep14.csv.gz"
+    )
+
+    @st.cache_data
+    def load_data(nrows):
+        data = pd.read_csv(DATA_URL, nrows=nrows)
+
+        def lowercase(x):
+            return str(x).lower()
+
+        data.rename(lowercase, axis="columns", inplace=True)
+        data[DATE_COLUMN] = pd.to_datetime(data[DATE_COLUMN])
+        return data
+
+    data_load_state = st.text("Loading data...")
+    data = load_data(10000)
+    data_load_state.text("Done! (using st.cache_data)")
+
+    if st.checkbox("Show raw data"):
+        st.subheader("Raw data")
+        st.write(data)
+
+    st.subheader("Number of pickups by hour")
+    hist_values = np.histogram(data[DATE_COLUMN].dt.hour, bins=24, range=(0, 24))[0]
+    st.bar_chart(hist_values)
+
+    # Some number in the range 0-23
+    hour_to_filter = st.slider("hour", 0, 23, 17)
+    filtered_data = data[data[DATE_COLUMN].dt.hour == hour_to_filter]
+
+    st.subheader("Map of all pickups at %s:00" % hour_to_filter)
+    st.map(filtered_data)
+
+if __name__ == "__main__":
+    main()
+
+```
+
+### Asr Utils
+
+```python
+import math
+import os
+import sys
+import tempfile
+import wave
+from typing import Callable, Sequence
+from urllib.request import urlopen
+
+import nemo.collections.asr as nemo_asr
+import numpy as np
+import torch
+from omegaconf import DictConfig
+
+def preprocess_audio(audio: bytes | str, target_sample_rate: int = 16000) -> bytes:
+    import array
+    import io
+    import wave
+
+    if isinstance(audio, str):
+        audio = get_bytes_from_wav(audio)
+
+    with wave.open(io.BytesIO(audio), "rb") as wav_in:
+        n_channels = wav_in.getnchannels()
+        sample_width = wav_in.getsampwidth()
+        frame_rate = wav_in.getframerate()
+        n_frames = wav_in.getnframes()
+        frames = wav_in.readframes(n_frames)
+
+    # Convert frames to array based on sample width
+    if sample_width == 1:
+        audio_data = array.array("B", frames)  # unsigned char
+    elif sample_width == 2:
+        audio_data = array.array("h", frames)  # signed short
+    elif sample_width == 3:
+        audio_data = array.array("b", frames)  # signed byte
+    elif sample_width == 4:
+        audio_data = array.array("i", frames)  # signed int
+    else:
+        raise ValueError(f"Unsupported sample width: {sample_width}")
+
+    # Downmix to mono if needed
+    if n_channels > 1:
+        mono_data = array.array(audio_data.typecode)
+        for i in range(0, len(audio_data), n_channels):
+            chunk = audio_data[i : i + n_channels]
+            mono_data.append(sum(chunk) // n_channels)
+        audio_data = mono_data
+
+    # Resample to 16kHz if needed
+    if frame_rate != target_sample_rate:
+        ratio = target_sample_rate / frame_rate
+        new_length = int(len(audio_data) * ratio)
+        resampled_data = array.array(audio_data.typecode)
+
+        for i in range(new_length):
+            # Linear interpolation
+            pos = i / ratio
+            pos_int = int(pos)
+            pos_frac = pos - pos_int
+
+            if pos_int >= len(audio_data) - 1:
+                sample = audio_data[-1]
+            else:
+                sample1 = audio_data[pos_int]
+                sample2 = audio_data[pos_int + 1]
+                sample = int(sample1 + (sample2 - sample1) * pos_frac)
+
+            resampled_data.append(sample)
+
+        audio_data = resampled_data
+
+    return audio_data.tobytes()
+
+def get_bytes_from_wav(location: str) -> bytes:
+    if location.startswith("http"):
+        bytes = urlopen(location).read()
+    else:
+        bytes = open(location, "rb").read()
+
+    return bytes
+
+def identity(data):
+    return data
+
+def batch_seq(
+    data: Sequence, chunk_size: int, transform: Callable = None
+) -> list[bytes]:
+    if transform is None:
+        transform = identity
+    return [
+        transform(data[i : i + chunk_size]) for i in range(0, len(data), chunk_size)
+    ]
+
+SHUTDOWN_SIGNAL = (
+    b"END_OF_STREAM_8f13d09"  # byte sequence indicating a stream is finished
+)
+
+def int2float(audio_data):
+    import numpy as np
+
+    abs_max = np.abs(audio_data).max()
+    audio_data = audio_data.astype("float32")
+    if abs_max > 0:
+        audio_data *= 1 / 32768
+    audio_data = audio_data.squeeze()  # depends on the use case
+    return audio_data
+
+def bytes_to_torch(data, device="cuda"):
+    import numpy as np
+    import torch
+
+    data = np.frombuffer(data, dtype=np.int16)
+    data = torch.from_numpy(int2float(data)).to(device)
+    return data
+
+class NoStdStreams(object):
+    def __init__(self):
+        self.devnull = open(os.devnull, "w")
+
+    def __enter__(self):
+        self._stdout, self._stderr = sys.stdout, sys.stderr
+        self._stdout.flush(), self._stderr.flush()
+        sys.stdout, sys.stderr = self.devnull, self.devnull
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        sys.stdout, sys.stderr = self._stdout, self._stderr
+        self.devnull.close()
+
+def write_wav_file(args):
+    idx, data = args
+    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
+    with wave.open(temp_file, "wb") as wav_out:
+        wav_out.setnchannels(1)
+        wav_out.setsampwidth(2)  # 16-bit
+        wav_out.setframerate(16000)
+        wav_out.writeframes(data)
+    temp_file.close()
+    return temp_file.name
+
+LOG_MEL_ZERO = -16.635
+
+class AudioBufferer:
+    def __init__(self, sample_rate: int, buffer_size_in_secs: float):
+        self.buffer_size = int(buffer_size_in_secs * sample_rate)
+        self.sample_buffer = torch.zeros(self.buffer_size, dtype=torch.float32)
+
+    def reset(self) -> None:
+        """
+        Reset the buffer to zero
+        """
+        self.sample_buffer.zero_()
+
+    def update(self, audio: np.ndarray) -> None:
+        """
+        Update the buffer with the new frame
+        Args:
+            frame (Frame): frame to update the buffer with
+        """
+        if not isinstance(audio, torch.Tensor):
+            audio = torch.from_numpy(audio)
+
+        audio_size = audio.shape[0]
+        if audio_size > self.buffer_size:
+            raise ValueError(
+                f"Frame size ({audio_size}) exceeds buffer size ({self.buffer_size})"
+            )
+
+        shift = audio_size
+        self.sample_buffer[:-shift] = self.sample_buffer[shift:].clone()
+        self.sample_buffer[-shift:] = audio.clone()
+
+    def get_buffer(self) -> torch.Tensor:
+        """
+        Get the current buffer
+        Returns:
+            torch.Tensor: current state of the buffer
+        """
+        return self.sample_buffer.clone()
+
+    def is_buffer_empty(self) -> bool:
+        """
+        Check if the buffer is empty
+        Returns:
+            bool: True if the buffer is empty, False otherwise
+        """
+        return self.sample_buffer.sum() == 0
+
+class CacheFeatureBufferer:
+    def __init__(
+        self,
+        sample_rate: int,
+        buffer_size_in_secs: float,
+        chunk_size_in_secs: float,
+        preprocessor_cfg: DictConfig,
+        device: torch.device,
+        fill_value: float = LOG_MEL_ZERO,
+    ):
+        if buffer_size_in_secs < chunk_size_in_secs:
+            raise ValueError(
+                f"Buffer size ({buffer_size_in_secs}s) should be no less than chunk size ({chunk_size_in_secs}s)"
+            )
+
+        self.sample_rate = sample_rate
+        self.buffer_size_in_secs = buffer_size_in_secs
+        self.chunk_size_in_secs = chunk_size_in_secs
+        self.device = device
+
+        if hasattr(preprocessor_cfg, "log") and preprocessor_cfg.log:
+            self.ZERO_LEVEL_SPEC_DB_VAL = (
+                LOG_MEL_ZERO  # Log-Mel spectrogram value for zero signals
+            )
+        else:
+            self.ZERO_LEVEL_SPEC_DB_VAL = fill_value
+
+        self.n_feat = preprocessor_cfg.features
+        self.timestep_duration = preprocessor_cfg.window_stride
+        self.n_chunk_look_back = int(self.timestep_duration * self.sample_rate)
+        self.chunk_size = int(self.chunk_size_in_secs * self.sample_rate)
+        self.sample_buffer = AudioBufferer(sample_rate, buffer_size_in_secs)
+
+        self.feature_buffer_len = int(buffer_size_in_secs / self.timestep_duration)
+        self.feature_chunk_len = int(chunk_size_in_secs / self.timestep_duration)
+        self.feature_buffer = torch.full(
+            [self.n_feat, self.feature_buffer_len],
+            self.ZERO_LEVEL_SPEC_DB_VAL,
+            dtype=torch.float32,
+            device=self.device,
+        )
+
+        self.preprocessor = nemo_asr.models.ASRModel.from_config_dict(preprocessor_cfg)
+        self.preprocessor.to(self.device)
+
+    def is_buffer_empty(self) -> bool:
+        """
+        Check if the buffer is empty
+        Returns:
+            bool: True if the buffer is empty, False otherwise
+        """
+        return self.sample_buffer.is_buffer_empty()
+
+    def reset(self) -> None:
+        """
+        Reset the buffer to zero
+        """
+        self.sample_buffer.reset()
+        self.feature_buffer.fill_(self.ZERO_LEVEL_SPEC_DB_VAL)
+
+    def _update_feature_buffer(self, feat_chunk: torch.Tensor) -> None:
+        """
+        Add an extracted feature to `feature_buffer`
+        """
+        self.feature_buffer[:, : -self.feature_chunk_len] = self.feature_buffer[
+            :, self.feature_chunk_len :
+        ].clone()
+        self.feature_buffer[:, -self.feature_chunk_len :] = feat_chunk.clone()
+
+    def preprocess(self, audio_signal: torch.Tensor) -> torch.Tensor:
+        """
+        Preprocess the audio signal using the preprocessor
+        Args:
+            audio_signal (torch.Tensor): audio signal
+        Returns:
+            torch.Tensor: preprocessed features
+        """
+        audio_signal = audio_signal.unsqueeze_(0).to(self.device)
+        audio_signal_len = torch.tensor([audio_signal.shape[1]], device=self.device)
+        features, _ = self.preprocessor(
+            input_signal=audio_signal,
+            length=audio_signal_len,
+        )
+        features = features.squeeze()
+        return features
+
+    def update(self, audio: np.ndarray) -> None:
+        """
+        Update the sample anf feature buffers with the new frame
+        Args:
+            frame (Frame): frame to update the buffer with
+        """
+
+        # Update the sample buffer with the new frame
+        self.sample_buffer.update(audio)
+
+        if math.isclose(self.buffer_size_in_secs, self.chunk_size_in_secs):
+            # If the buffer size is equal to the chunk size, just take the whole buffer
+            samples = self.sample_buffer.sample_buffer.clone()
+        else:
+            # Add look_back to have context for the first feature
+            samples = self.sample_buffer.sample_buffer[
+                -(self.n_chunk_look_back + self.chunk_size) :
+            ]
+
+        # Get the mel spectrogram
+        features = self.preprocess(samples)
+
+        # If the features are longer than supposed to be, drop the last frames
+        # Drop the last diff frames because they might be incomplete
+        if (diff := features.shape[1] - self.feature_chunk_len - 1) > 0:
+            features = features[:, :-diff]
+
+        # Update the feature buffer with the new features
+        self._update_feature_buffer(features[:, -self.feature_chunk_len :])
+
+    def get_buffer(self) -> torch.Tensor:
+        """
+        Get the current sample buffer
+        Returns:
+            torch.Tensor: current state of the buffer
+        """
+        return self.sample_buffer.get_buffer()
+
+    def get_feature_buffer(self) -> torch.Tensor:
+        """
+        Get the current feature buffer
+        Returns:
+            torch.Tensor: current state of the feature buffer
+        """
+        return self.feature_buffer.clone()
+
+```
+
+### Badges
+
+# Serve a dynamic SVG badge
+
+In this example, we use Modal's [webhook](https://modal.com/docs/guide/webhooks) capability to host a dynamic SVG badge that shows
+you the current number of downloads for a Python package.
+
+First let's start off by creating a Modal app, and defining an image with the Python packages we're going to be using:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().uv_pip_install(
+    "fastapi[standard]", "pybadges", "pypistats"
+)
+
+app = modal.App("example-badges", image=image)
+
+```
+
+## Defining the web endpoint
+
+In addition to using `@app.function()` to decorate our function, we use the
+[`@modal.fastapi_endpoint` decorator](https://modal.com/docs/guide/webhooks)
+which instructs Modal to create a REST endpoint that serves this function.
+Note that the default method is `GET`, but this can be overridden using the `method` argument.
+
+```python
+@app.function()
+@modal.fastapi_endpoint()
+async def package_downloads(package_name: str):
+    import json
+
+    import pypistats
+    from fastapi import Response
+    from pybadges import badge
+
+    stats = json.loads(pypistats.recent(package_name, format="json"))
+    svg = badge(
+        left_text=f"{package_name} downloads",
+        right_text=str(stats["data"]["last_month"]),
+        right_color="blue",
+    )
+
+    return Response(content=svg, media_type="image/svg+xml")
+
+```
+
+In this function, we use `pypistats` to query the most recent stats for our package, and then
+use that as the text for a SVG badge, rendered using `pybadges`.
+Since Modal web endpoints are FastAPI functions under the hood, we return this SVG wrapped in a FastAPI response with the correct media type.
+Also note that FastAPI automatically interprets `package_name` as a [query param](https://fastapi.tiangolo.com/tutorial/query-params/).
+
+## Running and deploying
+
+We can now run an ephemeral app on the command line using:
+
+```shell
+modal serve badges.py
+```
+
+This will create a short-lived web url that exists until you terminate the script.
+It will also hot-reload the code if you make changes to it.
+
+If you want to create a persistent URL, you have to deploy the script.
+To deploy using the Modal CLI by running `modal deploy badges.py`,
+
+Either way, as soon as we run this command, Modal gives us the link to our brand new
+web endpoint in the output:
+
+![web badge deployment](./badges_deploy.png)
+
+We can now visit the link using a web browser, using a `package_name` of our choice in the URL query params.
+For example:
+- `https://YOUR_SUBDOMAIN.modal.run/?package_name=synchronicity`
+- `https://YOUR_SUBDOMAIN.modal.run/?package_name=torch`
+
+### Basic Grid Search
+
+# Hyperparameter search
+
+This example showcases a simple grid search in one dimension, where we try different
+parameters for a model and pick the one with the best results on a holdout set.
+
+## Defining the image
+
+First, let's build a custom image and install scikit-learn in it.
+
+```python
+import modal
+
+app = modal.App(
+    "example-basic-grid-search",
+    image=modal.Image.debian_slim().uv_pip_install("scikit-learn~=1.5.0"),
+)
+
+```
+
+## The Modal function
+
+Next, define the function. Note that we use the custom image with scikit-learn in it.
+We also take the hyperparameter `k`, which is how many nearest neighbors we use.
+
+```python
+@app.function()
+def fit_knn(k):
+    from sklearn.datasets import load_digits
+    from sklearn.model_selection import train_test_split
+    from sklearn.neighbors import KNeighborsClassifier
+
+    X, y = load_digits(return_X_y=True)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
+
+    clf = KNeighborsClassifier(k)
+    clf.fit(X_train, y_train)
+    score = float(clf.score(X_test, y_test))
+    print("k = %3d, score = %.4f" % (k, score))
+    return score, k
+
+```
+
+## Parallel search
+
+To do a hyperparameter search, let's map over this function with different values
+for `k`, and then select for the best score on the holdout set:
+
+```python
+@app.local_entrypoint()
+def main():
+    # Do a basic hyperparameter search
+    best_score, best_k = max(fit_knn.map(range(1, 100)))
+    print("Best k = %3d, score = %.4f" % (best_k, best_score))
+
+```
+
+### Basic Web
+
+# Hello world wide web!
+
+Modal makes it easy to turn your Python functions into serverless web services:
+access them via a browser or call them from any client that speaks HTTP, all
+without having to worry about setting up servers or managing infrastructure.
+
+This tutorial shows the path with the shortest ["time to 200"](https://shkspr.mobi/blog/2021/05/whats-your-apis-time-to-200/):
+[`modal.fastapi_endpoint`](https://modal.com/docs/reference/modal.fastapi_endpoint).
+
+On Modal, web endpoints have all the superpowers of Modal Functions:
+they can be [accelerated with GPUs](https://modal.com/docs/guide/gpu),
+they can access [Secrets](https://modal.com/docs/guide/secrets) or [Volumes](https://modal.com/docs/guide/volumes),
+and they [automatically scale](https://modal.com/docs/guide/cold-start) to handle more traffic.
+
+Under the hood, we use the [FastAPI library](https://fastapi.tiangolo.com/),
+which has [high-quality documentation](https://fastapi.tiangolo.com/tutorial/),
+linked throughout this tutorial.
+
+## Turn a Modal Function into an API endpoint with a single decorator
+
+Modal Functions are already accessible remotely -- when you add the `@app.function` decorator to a Python function
+and run `modal deploy`, you make it possible for your [other Python functions to call it](https://modal.com/docs/guide/trigger-deployed-functions).
+
+That's great, but it's not much help if you want to share what you've written with someone running code in a different language --
+or not running code at all!
+
+And that's where most of the power of the Internet comes from: sharing information and functionality across different computer systems.
+
+So we provide the `fastapi_endpoint` decorator to wrap your Modal Functions in the lingua franca of the web: HTTP.
+Here's what that looks like:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().uv_pip_install("fastapi[standard]")
+app = modal.App(name="example-basic-web", image=image)
+
+@app.function()
+@modal.fastapi_endpoint(
+    docs=True  # adds interactive documentation in the browser
+)
+def hello():
+    return "Hello world!"
+
+```
+
+You can turn this function into a web endpoint by running `modal serve basic_web.py`.
+In the output, you should see a URL that ends with `hello-dev.modal.run`.
+If you navigate to this URL, you should see the `"Hello world!"` message appear in your browser.
+
+You can also find interactive documentation, powered by OpenAPI and Swagger,
+if you add `/docs` to the end of the URL.
+From this documentation, you can interact with your endpoint, sending HTTP requests and receiving HTTP responses.
+For more details, see the [FastAPI documentation](https://fastapi.tiangolo.com/features/#automatic-docs).
+
+By running the endpoint with `modal serve`, you created a temporary endpoint that will disappear if you interrupt your terminal.
+These temporary endpoints are great for debugging -- when you save a change to any of your dependent files, the endpoint will redeploy.
+Try changing the message to something else, hitting save, and then hitting refresh in your browser or re-sending
+the request from `/docs` or the command line. You should see the new message, along with logs in your terminal showing the redeploy and the request.
+
+When you're ready to deploy this endpoint permanently, run `modal deploy basic_web.py`.
+Now, your function will be available even when you've closed your terminal or turned off your computer.
+
+## Send data to a web endpoint
+
+The web endpoint above was a bit silly: it always returns the same message.
+
+Most endpoints need an input to be useful. There are two ways to send data to a web endpoint:
+- in the URL as a [query parameter](#sending-data-in-query-parameters)
+- in the [body of the request](#sending-data-in-the-request-body) as JSON
+
+### Sending data in query parameters
+
+By default, your function's arguments are treated as query parameters:
+they are extracted from the end of the URL, where they should be added in the form
+`?arg1=foo&arg2=bar`.
+
+From the Python side, there's hardly anything to do:
+
+```python
+@app.function()
+@modal.fastapi_endpoint(docs=True)
+def greet(user: str) -> str:
+    return f"Hello {user}!"
+
+```
+
+If you are already running `modal serve basic_web.py`, this endpoint will be available at a URL, printed in your terminal, that ends with `greet-dev.modal.run`.
+
+We provide Python type-hints to get type information in the docs and
+[automatic validation](https://fastapi.tiangolo.com/tutorial/query-params-str-validations/).
+For example, if you navigate directly to the URL for `greet`, you will get a detailed error message
+indicating that the `user` parameter is missing. Navigate instead to `/docs` to see how to invoke the endpoint properly.
+
+You can read more about query parameters in the [FastAPI documentation](https://fastapi.tiangolo.com/tutorial/query-params/).
+
+### Sending data in the request body
+
+For larger and more complex data, it is generally preferrable to send data in the body of the HTTP request.
+This body is formatted as [JSON](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON),
+the most common data interchange format on the web.
+
+To set up an endpoint that accepts JSON data, add an argument with a `dict` type-hint to your function.
+This argument will be populated with the data sent in the request body.
+
+```python
+@app.function()
+@modal.fastapi_endpoint(method="POST", docs=True)
+def goodbye(data: dict) -> str:
+    name = data.get("name") or "world"
+    return f"Goodbye {name}!"
+
+```
+
+Note that we gave a value of `"POST"` for the `method` argument here.
+This argument defines the HTTP request method that the endpoint will respond to,
+and it defaults to `"GET"`.
+If you head to the URL for the `goodbye` endpoint in your browser,
+you will get a 405 Method Not Allowed error, because browsers only send GET requests by default.
+While this is technically a separate concern from query parameters versus request bodies
+and you can define an endpoint that accepts GET requests and uses data from the body,
+it is [considered bad form](https://stackoverflow.com/a/983458).
+
+Navigate to `/docs` for more on how to invoke the endpoint properly.
+You will need to send a POST request with a JSON body containing a `name` key.
+To get the same typing and validation benefits as with query parameters,
+use a [Pydantic model](https://fastapi.tiangolo.com/tutorial/body/)
+for this argument.
+
+You can read more about request bodies in the [FastAPI documentation](https://fastapi.tiangolo.com/tutorial/body/).
+
+## Handle expensive startup with `modal.Cls`
+
+Sometimes your endpoint needs to do something before it can handle its first request,
+like get a value from a database or set the value of a variable.
+If that step is expensive, like [loading a large ML model](https://modal.com/docs/guide/model-weights),
+it'd be a shame to have to do it every time a request comes in!
+
+Web endpoints can be methods on a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions#container-lifecycle-functions-and-parameters),
+which allows you to manage the container's lifecycle independently from processing individual requests.
+
+This example will only set the `start_time` instance variable once, on container startup.
+
+```python
+@app.cls()
+class WebApp:
+    @modal.enter()
+    def startup(self):
+        from datetime import datetime, timezone
+
+        print("🏁 Starting up!")
+        self.start_time = datetime.now(timezone.utc)
+
+    @modal.fastapi_endpoint(docs=True)
+    def web(self):
+        from datetime import datetime, timezone
+
+        current_time = datetime.now(timezone.utc)
+        return {"start_time": self.start_time, "current_time": current_time}
+
+```
+
+## Protect web endpoints with proxy authentication
+
+Sharing your Python functions on the web is great, but it's not always a good idea
+to make those functions available to just anyone.
+
+For example, you might have a function like the one below that
+is more expensive to run than to call (and so might be abused by your enemies)
+or reveals information that you would rather keep secret.
+
+To protect your Modal web endpoints so that they can't be triggered except
+by members of your [Modal workspace](https://modal.com/docs/guide/workspaces),
+add the `requires_proxy_auth=True` flag to the `fastapi_endpoint` decorator.
+
+```python
+@app.function(gpu="h100")
+@modal.fastapi_endpoint(requires_proxy_auth=True, docs=False)
+def expensive_secret():
+    return "I didn't care for 'The Godfather'. It insists upon itself."
+
+```
+
+The `expensive-secret` endpoint URL will still be printed to the output when you `modal serve` or `modal deploy`,
+along with a "🔑" emoji indicating that it is secured with proxy authentication.
+If you head to that URL via the browser, you will get a
+[`401 Unauthorized`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401) error code in response.
+You should also check the dashboard page for this app (at the URL printed at the very top of the `modal` command output)
+so you can see that no containers were spun up to handle the request -- this authorization is handled entirely inside Modal's infrastructure.
+
+You can trigger the web endpoint by [creating a Proxy Auth Token](https://modal.com/settings/proxy-auth-tokens)
+and then including the token ID and secret in the `Modal-Key` and `Modal-Secret` headers.
+
+From the command line, that might look like
+
+```shell
+export TOKEN_ID=wk-1234abcd
+export TOKEN_SECRET=ws-1234abcd
+curl -H "Modal-Key: $TOKEN_ID" \
+     -H "Modal-Secret: $TOKEN_SECRET" \
+     https://your-workspace-name--expensive-secret.modal.run
+```
+
+For more details, see the
+[guide to proxy authentication](https://modal.com/docs/guide/webhook-proxy-auth).
+
+## What next?
+
+Modal's `fastapi_endpoint` decorator is opinionated and designed for relatively simple web applications --
+one or a few independent Python functions that you want to expose to the web.
+
+Three additional decorators allow you to serve more complex web applications with greater control:
+- [`asgi_app`](https://modal.com/docs/guide/webhooks#asgi) to serve applications compliant with the ASGI standard,
+like [FastAPI](https://fastapi.tiangolo.com/)
+- [`wsgi_app`](https://modal.com/docs/guide/webhooks#wsgi) to serve applications compliant with the WSGI standard,
+like [Flask](https://flask.palletsprojects.com/)
+- [`web_server`](https://modal.com/docs/guide/webhooks#non-asgi-web-servers) to serve any application that listens on a port
+
+### Batched Whisper
+
+# Fast Whisper inference using dynamic batching
+
+In this example, we demonstrate how to run [dynamically batched inference](https://modal.com/docs/guide/dynamic-batching)
+for OpenAI's speech recognition model, [Whisper](https://openai.com/index/whisper/), on Modal.
+Batching multiple audio samples together or batching chunks of a single audio sample can help to achieve a 2.8x increase
+in inference throughput on an A10G!
+
+We will be running the [Whisper Large V3](https://huggingface.co/openai/whisper-large-v3) model.
+To run [any of the other HuggingFace Whisper models](https://huggingface.co/models?search=openai/whisper),
+simply replace the `MODEL_NAME` and `MODEL_REVISION` variables.
+
+## Setup
+
+Let's start by importing the Modal client and defining the model that we want to serve.
+
+```python
+from typing import Optional
+
+import modal
+
+MODEL_DIR = "/model"
+MODEL_NAME = "openai/whisper-large-v3"
+MODEL_REVISION = "afda370583db9c5359511ed5d989400a6199dfe1"
+
+```
+
+## Define a container image
+
+We’ll start with Modal's baseline `debian_slim` image and install the relevant libraries.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install(
+        "torch==2.5.1",
+        "transformers==4.47.1",
+        "huggingface-hub==0.36.0",
+        "librosa==0.10.2",
+        "soundfile==0.12.1",
+        "accelerate==1.2.1",
+        "datasets==3.2.0",
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1", "HF_HUB_CACHE": MODEL_DIR})
+)
+
+model_cache = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+app = modal.App(
+    "example-batched-whisper",
+    image=image,
+    volumes={MODEL_DIR: model_cache},
+)
+
+```
+
+## Caching the model weights
+
+We'll define a function to download the model and cache it in a volume.
+You can `modal run` against this function prior to deploying the App.
+
+```python
+@app.function()
+def download_model():
+    from huggingface_hub import snapshot_download
+    from transformers.utils import move_cache
+
+    snapshot_download(
+        MODEL_NAME,
+        ignore_patterns=["*.pt", "*.bin"],  # Using safetensors
+        revision=MODEL_REVISION,
+    )
+    move_cache()
+
+```
+
+## The model class
+
+The inference function is best represented using Modal's [class syntax](https://modal.com/docs/guide/lifecycle-functions).
+
+We define a `@modal.enter` method to load the model when the container starts, before it picks up any inputs.
+The weights will be loaded from the Hugging Face cache volume so that we don't need to download them when
+we start a new container. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+We also define a `transcribe` method that uses the `@modal.batched` decorator to enable dynamic batching.
+This allows us to invoke the function with individual audio samples, and the function will automatically batch them
+together before running inference. Batching is critical for making good use of the GPU, since GPUs are designed
+for running parallel operations at high throughput.
+
+The `max_batch_size` parameter limits the maximum number of audio samples combined into a single batch.
+We used a `max_batch_size` of `64`, the largest power-of-2 batch size that can be accommodated by the 24 A10G GPU memory.
+This number will vary depending on the model and the GPU you are using.
+
+The `wait_ms` parameter sets the maximum time to wait for more inputs before running the batched transcription.
+To tune this parameter, you can set it to the target latency of your application minus the execution time of an inference batch.
+This allows the latency of any request to stay within your target latency.
+
+```python
+@app.cls(
+    gpu="a10g",  # Try using an A100 or H100 if you've got a large model or need big batches!
+    max_containers=10,  # default max GPUs for Modal's free tier
+)
+class Model:
+    @modal.enter()
+    def load_model(self):
+        import torch
+        from transformers import (
+            AutoModelForSpeechSeq2Seq,
+            AutoProcessor,
+            pipeline,
+        )
+
+        self.processor = AutoProcessor.from_pretrained(MODEL_NAME)
+        self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
+            MODEL_NAME,
+            torch_dtype=torch.float16,
+            low_cpu_mem_usage=True,
+            use_safetensors=True,
+        ).to("cuda")
+
+        self.model.generation_config.language = "<|en|>"
+
+        # Create a pipeline for preprocessing and transcribing speech data
+        self.pipeline = pipeline(
+            "automatic-speech-recognition",
+            model=self.model,
+            tokenizer=self.processor.tokenizer,
+            feature_extractor=self.processor.feature_extractor,
+            torch_dtype=torch.float16,
+            device="cuda",
+        )
+
+    @modal.batched(max_batch_size=64, wait_ms=1000)
+    def transcribe(self, audio_samples):
+        import time
+
+        start = time.monotonic_ns()
+        print(f"Transcribing {len(audio_samples)} audio samples")
+        transcriptions = self.pipeline(audio_samples, batch_size=len(audio_samples))
+        end = time.monotonic_ns()
+        print(
+            f"Transcribed {len(audio_samples)} samples in {round((end - start) / 1e9, 2)}s"
+        )
+        return transcriptions
+
+```
+
+## Transcribe a dataset
+
+In this example, we use the [librispeech_asr_dummy dataset](https://huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy)
+from Hugging Face's Datasets library to test the model.
+
+We use [`map.aio`](https://modal.com/docs/reference/modal.Function#map) to asynchronously map over the audio files.
+This allows us to invoke the batched transcription method on each audio sample in parallel.
+
+```python
+@app.function()
+async def transcribe_hf_dataset(dataset_name):
+    from datasets import load_dataset
+
+    print("📂 Loading dataset", dataset_name)
+    ds = load_dataset(dataset_name, "clean", split="validation")
+    print("📂 Dataset loaded")
+    batched_whisper = Model()
+    print("📣 Sending data for transcription")
+    async for transcription in batched_whisper.transcribe.map.aio(ds["audio"]):
+        yield transcription
+
+```
+
+## Run the model
+
+We define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps)
+to run the transcription. You can run this locally with `modal run batched_whisper.py`.
+
+```python
+@app.local_entrypoint()
+async def main(dataset_name: Optional[str] = None):
+    if dataset_name is None:
+        dataset_name = "hf-internal-testing/librispeech_asr_dummy"
+    for result in transcribe_hf_dataset.remote_gen(dataset_name):
+        print(result["text"])
+
+```
+
+### Blender Video
+
+# Render a video with Blender on many GPUs or CPUs in parallel
+
+This example shows how you can render an animated 3D scene using
+[Blender](https://www.blender.org/)'s Python interface.
+
+You can run it on CPUs to scale out on one hundred containers
+or run it on GPUs to get higher throughput per node.
+Even for this simple scene, GPUs render >10x faster than CPUs.
+
+The final render looks something like this:
+
+<center>
+<video controls autoplay loop muted>
+<source src="https://modal-cdn.com/modal-blender-video.mp4" type="video/mp4">
+</video>
+</center>
+
+## Defining a Modal app
+
+```python
+from pathlib import Path
+
+import modal
+
+```
+
+Modal runs your Python functions for you in the cloud.
+You organize your code into apps, collections of functions that work together.
+
+```python
+app = modal.App("example-blender-video")
+
+```
+
+We need to define the environment each function runs in --  its container image.
+The block below defines a container image, starting from a basic Debian Linux image
+adding Blender's system-level dependencies
+and then installing the `bpy` package, which is Blender's Python API.
+
+```python
+rendering_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("xorg", "libxkbcommon0")  # X11 (Unix GUI) dependencies
+    .uv_pip_install("bpy==4.5.0")  # Blender as a Python package
+)
+
+```
+
+## Rendering a single frame
+
+We define a function that renders a single frame. We'll scale this function out on Modal later.
+
+Functions in Modal are defined along with their hardware and their dependencies.
+This function can be run with GPU acceleration or without it, and we'll use a global flag in the code to switch between the two.
+
+```python
+WITH_GPU = (
+    True  # try changing this to False to run rendering massively in parallel on CPUs!
+)
+
+```
+
+We decorate the function with `@app.function` to define it as a Modal function.
+Note that in addition to defining the hardware requirements of the function,
+we also specify the container image that the function runs in (the one we defined above).
+
+The details of the scene aren't too important for this example, but we'll load
+a .blend file that we created earlier. This scene contains a rotating
+Modal logo made of a transmissive ice-like material, with a generated displacement map. The
+animation keyframes were defined in Blender.
+
+```python
+@app.function(
+    gpu="L40S" if WITH_GPU else None,
+    # default limits on Modal free tier
+    max_containers=10 if WITH_GPU else 100,
+    image=rendering_image,
+)
+def render(blend_file: bytes, frame_number: int = 0) -> bytes:
+    """Renders the n-th frame of a Blender file as a PNG."""
+    import bpy
+
+    input_path = "/tmp/input.blend"
+    output_path = f"/tmp/output-{frame_number}.png"
+
+    # Blender requires input as a file.
+    Path(input_path).write_bytes(blend_file)
+
+    bpy.ops.wm.open_mainfile(filepath=input_path)
+    bpy.context.scene.frame_set(frame_number)
+    bpy.context.scene.render.filepath = output_path
+    configure_rendering(bpy.context, with_gpu=WITH_GPU)
+    bpy.ops.render.render(write_still=True)
+
+    # Blender renders image outputs to a file as well.
+    return Path(output_path).read_bytes()
+
+```
+
+### Rendering with acceleration
+
+We can configure the rendering process to use GPU acceleration with NVIDIA CUDA.
+We select the [Cycles rendering engine](https://www.cycles-renderer.org/), which is compatible with CUDA,
+and then activate the GPU.
+
+```python
+def configure_rendering(ctx, with_gpu: bool):
+    # configure the rendering process
+    ctx.scene.render.engine = "CYCLES"
+    ctx.scene.render.resolution_x = 3000
+    ctx.scene.render.resolution_y = 2000
+    ctx.scene.render.resolution_percentage = 50
+    ctx.scene.cycles.samples = 128
+
+    cycles = ctx.preferences.addons["cycles"]
+
+    # Use GPU acceleration if available.
+    if with_gpu:
+        cycles.preferences.compute_device_type = "CUDA"
+        ctx.scene.cycles.device = "GPU"
+
+        # reload the devices to update the configuration
+        cycles.preferences.get_devices()
+        for device in cycles.preferences.devices:
+            device.use = True
+
+    else:
+        ctx.scene.cycles.device = "CPU"
+
+    # report rendering devices -- a nice snippet for debugging and ensuring the accelerators are being used
+    for dev in cycles.preferences.devices:
+        print(f"ID:{dev['id']} Name:{dev['name']} Type:{dev['type']} Use:{dev['use']}")
+
+```
+
+## Combining frames into a video
+
+Rendering 3D images is fun, and GPUs can make it faster, but rendering 3D videos is better!
+We add another function to our app, running on a different, simpler container image
+and different hardware, to combine the frames into a video.
+
+```python
+combination_image = modal.Image.debian_slim(python_version="3.11").apt_install("ffmpeg")
+
+```
+
+The function to combine the frames into a video takes a sequence of byte sequences, one for each rendered frame,
+and converts them into a single sequence of bytes, the MP4 file.
+
+```python
+@app.function(image=combination_image)
+def combine(frames_bytes: list[bytes], fps: int = 60) -> bytes:
+    import subprocess
+    import tempfile
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        for i, frame_bytes in enumerate(frames_bytes):
+            frame_path = Path(tmpdir) / f"frame_{i:05}.png"
+            frame_path.write_bytes(frame_bytes)
+        out_path = Path(tmpdir) / "output.mp4"
+        subprocess.run(
+            f"ffmpeg -framerate {fps} -pattern_type glob -i '{tmpdir}/*.png' -c:v libx264 -pix_fmt yuv420p {out_path}",
+            shell=True,
+        )
+        return out_path.read_bytes()
+
+```
+
+## Rendering in parallel in the cloud from the comfort of the command line
+
+With these two functions defined, we need only a few more lines to run our rendering at scale on Modal.
+
+First, we need a function that coordinates our functions to `render` frames and `combine` them.
+We decorate that function with `@app.local_entrypoint` so that we can run it with `modal run blender_video.py`.
+
+In that function, we use `render.map` to map the `render` function over the range of frames.
+
+We give the `local_entrypoint` two parameters to control the render -- the number of frames to render and how many frames to skip.
+These demonstrate a basic pattern for controlling Functions on Modal from a local client.
+
+We collect the bytes from each frame into a `list` locally and then send it to `combine` with `.remote`.
+
+The bytes for the video come back to our local machine, and we write them to a file.
+
+The whole rendering process (for four seconds of 1080p 60 FPS video) takes about three minutes to run on 10 L40S GPUs,
+with a per-frame latency of about six seconds, and about five minutes to run on 100 CPUs, with a per-frame latency of about one minute.
+
+```python
+@app.local_entrypoint()
+def main(frame_count: int = 250, frame_skip: int = 1):
+    output_directory = Path("/tmp") / "render"
+    output_directory.mkdir(parents=True, exist_ok=True)
+
+    input_path = Path(__file__).parent / "IceModal.blend"
+    blend_bytes = input_path.read_bytes()
+    args = [(blend_bytes, frame) for frame in range(1, frame_count + 1, frame_skip)]
+    images = list(render.starmap(args))
+    for i, image in enumerate(images):
+        frame_path = output_directory / f"frame_{i + 1}.png"
+        frame_path.write_bytes(image)
+        print(f"Frame saved to {frame_path}")
+
+    video_path = output_directory / "output.mp4"
+    video_bytes = combine.remote(images)
+    video_path.write_bytes(video_bytes)
+    print(f"Video saved to {video_path}")
+
+```
+
+### Boltz Predict
+
+# Fold proteins with Boltz-2
+
+<figure style="width: 70%; margin: 0 auto; display: block;">
+<img src="https://modal-cdn.com/cdnbot/boltz_examplecd5u3m0j_9fa47e43.webp" alt="Boltz-2" />
+<figcaption style="text-align: center"><em>Example of Boltz-2 protein structure prediction
+of a <a style="text-decoration: underline;" href="https://github.com/jwohlwend/boltz/blob/main/examples/affinity.yaml" target="_blank">protein-ligand complex</a></em></figcaption>
+</figure>
+
+Boltz-2 is an open source molecular structure prediction model.
+In contrast to previous models like Boltz-1, [Chai-1](https://modal.com/docs/examples/chai1), and AlphaFold-3, it not only predicts protein structures but also the [binding affinities](https://en.wikipedia.org/wiki/Ligand_(biochemistry)#Receptor/ligand_binding_affinity) between proteins and [ligands](https://en.wikipedia.org/wiki/Ligand_(biochemistry)).
+It was created by the [MIT Jameel Clinic](https://jclinic.mit.edu/boltz-2/).
+For details, see [their technical report](https://jeremywohlwend.com/assets/boltz2.pdf).
+
+Here, we demonstrate how to run Boltz-2 on Modal.
+
+## Setup
+
+```python
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+here = Path(__file__).parent  # the directory of this file
+
+MINUTES = 60  # seconds
+
+app = modal.App(name="example-boltz-predict")
+
+```
+
+## Fold a protein from the command line
+
+The logic for running Boltz-2 is encapsulated in the function below,
+which you can trigger from the command line by running
+
+```shell
+modal run boltz_predict.py
+```
+
+This will set up the environment for running Boltz-2 inference in Modal's cloud,
+run it, and then save the results locally as a [tarball](https://computing.help.inf.ed.ac.uk/FAQ/whats-tarball-or-how-do-i-unpack-or-create-tgz-or-targz-file).
+That tarball archive contains, among other things, the predicted structure as a
+[Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File),
+which you can render with the online [Molstar Viewer](https://molstar.org/viewer).
+
+You can pass any options for the [`boltz predict` command line tool](https://github.com/jwohlwend/boltz/blob/main/docs/prediction.md)
+as a string, like
+
+``` shell
+modal run boltz_predict.py --args "--sampling_steps 10"
+```
+
+To see more options, run the command with the `--help` flag.
+
+To learn how it works, read on!
+
+```python
+@app.local_entrypoint()
+def main(
+    force_download: bool = False, input_yaml_path: Optional[str] = None, args: str = ""
+):
+    print("🧬 loading model remotely")
+    download_model.remote(force_download)
+
+    if input_yaml_path is None:
+        input_yaml_path = here / "data" / "boltz_affinity.yaml"
+    else:
+        input_yaml_path = Path(input_yaml_path)
+    input_yaml = input_yaml_path.read_text()
+
+    print(f"🧬 running boltz with input from {input_yaml_path}")
+    output = boltz_inference.remote(input_yaml)
+
+    output_path = Path("/tmp") / "boltz" / "boltz_result.tar.gz"
+    output_path.parent.mkdir(exist_ok=True, parents=True)
+    print(f"🧬 writing output to {output_path}")
+    output_path.write_bytes(output)
+
+```
+
+## Installing Boltz-2 Python dependencies on Modal
+
+Code running on Modal runs inside containers built from [container images](https://modal.com/docs/guide/images)
+that include that code's dependencies.
+
+Because Modal images include [GPU drivers](https://modal.com/docs/guide/cuda) by default,
+installation of higher-level packages like `boltz` that require GPUs is painless.
+
+Here, we do it in a few lines, using the `uv` package manager for extra speed.
+
+```python
+image = modal.Image.debian_slim(python_version="3.12").uv_pip_install("boltz==2.1.1")
+
+```
+
+## Storing Boltz-2 model weights on Modal with Volumes
+
+Not all "dependencies" belong in a container image. Boltz-2, for example, depends on
+the weights of the model and a [Chemical Component Dictionary](https://www.wwpdb.org/data/ccd) (CCD) file.
+
+Rather than loading them dynamically at run-time (which would add several minutes of GPU time to each inference),
+or installing them into the image (which would require they be re-downloaded any time the other dependencies changed),
+we load them onto a [Modal Volume](https://modal.com/docs/guide/volumes).
+A Modal Volume is a file system that all of your code running on Modal (or elsewhere!) can access.
+For more on storing model weights on Modal, see [this guide](https://modal.com/docs/guide/model-weights).
+For details on how we download the weights in this case, see the [Addenda](#addenda).
+
+```python
+boltz_model_volume = modal.Volume.from_name("boltz-models", create_if_missing=True)
+models_dir = Path("/models/boltz")
+
+```
+
+## Running Boltz-2 on Modal
+
+To run inference on Modal we wrap our function in a decorator, `@app.function`.
+We provide that decorator with some arguments that describe the infrastructure our code needs to run:
+the Volume we created, the Image we defined, and of course a fast GPU!
+
+Note that the `boltz` command-line tool we use takes the path to a
+[specially-formatted YAML file](https://github.com/jwohlwend/boltz/blob/main/docs/prediction.md#yaml-format)
+that includes definitions of molecules to predict the structures of and optionally paths to
+[Multiple Sequence Alignment](https://en.wikipedia.org/wiki/Multiple_sequence_alignment) (MSA) files
+for any protein molecules. We pass the [--use_msa_server](https://github.com/jwohlwend/boltz/blob/main/docs/prediction.md) flag to auto-generate the MSA using the mmseqs2 server.
+
+```python
+@app.function(
+    image=image,
+    volumes={models_dir: boltz_model_volume},
+    timeout=10 * MINUTES,
+    gpu="H100",
+)
+def boltz_inference(boltz_input_yaml: str, args="") -> bytes:
+    import shlex
+    import subprocess
+
+    input_path = Path("input.yaml")
+    input_path.write_text(boltz_input_yaml)
+
+    args = shlex.split(args)
+
+    print(f"🧬 predicting structure using boltz model from {models_dir}")
+    subprocess.run(
+        ["boltz", "predict", input_path, "--use_msa_server", "--cache", str(models_dir)]
+        + args,
+        check=True,
+    )
+
+    print("🧬 packaging up outputs")
+    output_bytes = package_outputs(f"boltz_results_{input_path.with_suffix('').name}")
+
+    return output_bytes
+
+```
+
+## Addenda
+
+Above, we glossed over just how we got hold of the model weights --
+the `local_entrypoint` just called a function named `download_model`.
+
+Here's the implementation of that function. For details, see our
+[guide to storing model weights on Modal](https://modal.com/docs/guide/model-weights).
+
+```python
+download_image = (
+    modal.Image.debian_slim()
+    .uv_pip_install("huggingface-hub==0.36.0")
+    .env({"HF_XET_HIGH_PERFORMANCE": "1"})
+)
+
+@app.function(
+    volumes={models_dir: boltz_model_volume},
+    timeout=20 * MINUTES,
+    image=download_image,
+)
+def download_model(
+    force_download: bool = False,
+    revision: str = "6fdef46d763fee7fbb83ca5501ccceff43b85607",
+):
+    from huggingface_hub import snapshot_download
+
+    snapshot_download(
+        repo_id="boltz-community/boltz-2",
+        revision=revision,
+        local_dir=models_dir,
+        force_download=force_download,
+    )
+    boltz_model_volume.commit()
+
+    print(f"🧬 model downloaded to {models_dir}")
+
+```
+
+We package the outputs into a tarball which contains the predicted structure as a
+[Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File)
+and the binding affinity as a JSON file.
+You can render the structure with the online [Molstar Viewer](https://molstar.org/viewer).
+
+```python
+def package_outputs(output_dir: str) -> bytes:
+    import io
+    import tarfile
+
+    tar_buffer = io.BytesIO()
+
+    with tarfile.open(fileobj=tar_buffer, mode="w:gz") as tar:
+        tar.add(output_dir, arcname=output_dir)
+
+    return tar_buffer.getvalue()
+
+```
+
+### Cache Aware Buffer
+
+# Example (cache_aware_buffer.py)
+
+This is the source code for **06_gpu_and_ml.speech-to-text.cache_aware_buffer**.
+
+```python
+import copy
+
+import torch
+from nemo.collections.asr.parts.mixins.streaming import StreamingEncoder
+from nemo.collections.asr.parts.preprocessing.features import normalize_batch
+from nemo.collections.asr.parts.preprocessing.segment import get_samples
+from omegaconf import OmegaConf
+
+class CacheAwareStreamingAudioBuffer:
+    """
+    A buffer to be used for cache-aware streaming. It can load a single or multiple audio
+    files/processed signals, split them in chunks and return one on one. It can be used to
+    simulate streaming audio or audios.
+    """
+
+    def __init__(self, model, online_normalization=None, pad_and_drop_preencoded=False):
+        """
+        Args:
+            model: An ASR model.
+            online_normalization (bool): whether to perform online normalization per chunk or
+            normalize the whole audio before chunking
+            pad_and_drop_preencoded (bool): if true pad first audio chunk and always drop preencoded
+        """
+        self.model = model
+        self.buffer = None
+        self.buffer_idx = 0
+        self.streams_length = None
+        self.step = 0
+        self.pad_and_drop_preencoded = pad_and_drop_preencoded
+
+        self.online_normalization = online_normalization
+        if not isinstance(model.encoder, StreamingEncoder):
+            raise ValueError(
+                "The model's encoder is not inherited from StreamingEncoder, and likely not to support streaming!"
+            )
+        if model.encoder.streaming_cfg is None:
+            model.encoder.setup_streaming_params()
+        self.streaming_cfg = model.encoder.streaming_cfg
+
+        self.input_features = model.encoder._feat_in
+
+        self.preprocessor = self.extract_preprocessor()
+
+        if hasattr(model.encoder, "pre_encode") and hasattr(
+            model.encoder.pre_encode, "get_sampling_frames"
+        ):
+            self.sampling_frames = model.encoder.pre_encode.get_sampling_frames()
+        else:
+            self.sampling_frames = None
+
+    def get_next_chunk(self):
+        """
+        Get the next audio chunk for streaming processing.
+
+        This method can be called repeatedly after appending audio via append_audio()
+        to process the audio in a streaming fashion.
+
+        Returns:
+            tuple: (audio_chunk, chunk_lengths) if there's data to process, None if buffer is exhausted
+                - audio_chunk: tensor containing the audio chunk with pre-encode cache prepended
+                - chunk_lengths: tensor containing the valid lengths for each stream in the batch
+        """
+        if self.buffer is None or self.buffer_idx >= self.buffer.size(-1):
+            return None
+
+        # Determine chunk size based on position (first chunk may be different)
+        if self.buffer_idx == 0 and isinstance(self.streaming_cfg.chunk_size, list):
+            if self.pad_and_drop_preencoded:
+                chunk_size = self.streaming_cfg.chunk_size[1]
+            else:
+                chunk_size = self.streaming_cfg.chunk_size[0]
+        else:
+            chunk_size = (
+                self.streaming_cfg.chunk_size[1]
+                if isinstance(self.streaming_cfg.chunk_size, list)
+                else self.streaming_cfg.chunk_size
+            )
+
+        # Determine shift size based on position (first chunk may be different)
+        if self.buffer_idx == 0 and isinstance(self.streaming_cfg.shift_size, list):
+            if self.pad_and_drop_preencoded:
+                shift_size = self.streaming_cfg.shift_size[1]
+            else:
+                shift_size = self.streaming_cfg.shift_size[0]
+        else:
+            shift_size = (
+                self.streaming_cfg.shift_size[1]
+                if isinstance(self.streaming_cfg.shift_size, list)
+                else self.streaming_cfg.shift_size
+            )
+
+        # Check if we have enough valid data available for a full chunk
+        # We need at least chunk_size frames of valid data from buffer_idx onwards
+        available_valid_frames = self.streams_length - self.buffer_idx
+        if available_valid_frames.min() < chunk_size:
+            # Not enough data accumulated yet, wait for more audio
+            return None
+
+        # Extract the current audio chunk
+        audio_chunk = self.buffer[:, :, self.buffer_idx : self.buffer_idx + chunk_size]
+
+        # Check if we have enough frames for downsampling (if applicable)
+        if self.sampling_frames is not None:
+            if self.buffer_idx == 0 and isinstance(self.sampling_frames, list):
+                cur_sampling_frames = self.sampling_frames[0]
+            else:
+                cur_sampling_frames = (
+                    self.sampling_frames[1]
+                    if isinstance(self.sampling_frames, list)
+                    else self.sampling_frames
+                )
+            if audio_chunk.size(-1) < cur_sampling_frames:
+                return None
+
+        # Add the pre-encode cache to the chunk
+        zeros_pads = None
+        if self.buffer_idx == 0 and isinstance(
+            self.streaming_cfg.pre_encode_cache_size, list
+        ):
+            if self.pad_and_drop_preencoded:
+                cache_pre_encode_num_frames = self.streaming_cfg.pre_encode_cache_size[
+                    1
+                ]
+            else:
+                cache_pre_encode_num_frames = self.streaming_cfg.pre_encode_cache_size[
+                    0
+                ]
+            cache_pre_encode = torch.zeros(
+                (audio_chunk.size(0), self.input_features, cache_pre_encode_num_frames),
+                device=audio_chunk.device,
+                dtype=audio_chunk.dtype,
+            )
+        else:
+            if isinstance(self.streaming_cfg.pre_encode_cache_size, list):
+                pre_encode_cache_size = self.streaming_cfg.pre_encode_cache_size[1]
+            else:
+                pre_encode_cache_size = self.streaming_cfg.pre_encode_cache_size
+
+            start_pre_encode_cache = self.buffer_idx - pre_encode_cache_size
+            if start_pre_encode_cache < 0:
+                start_pre_encode_cache = 0
+            cache_pre_encode = self.buffer[
+                :, :, start_pre_encode_cache : self.buffer_idx
+            ]
+            if cache_pre_encode.size(-1) < pre_encode_cache_size:
+                zeros_pads = torch.zeros(
+                    (
+                        audio_chunk.size(0),
+                        audio_chunk.size(-2),
+                        pre_encode_cache_size - cache_pre_encode.size(-1),
+                    ),
+                    device=audio_chunk.device,
+                    dtype=audio_chunk.dtype,
+                )
+
+        added_len = cache_pre_encode.size(-1)
+        audio_chunk = torch.cat((cache_pre_encode, audio_chunk), dim=-1)
+
+        # Apply online normalization if enabled
+        if self.online_normalization:
+            audio_chunk, x_mean, x_std = normalize_batch(
+                x=audio_chunk,
+                seq_len=torch.tensor([audio_chunk.size(-1)] * audio_chunk.size(0)),
+                normalize_type=self.model_normalize_type,
+            )
+
+        # Add zero padding if needed
+        if zeros_pads is not None:
+            audio_chunk = torch.cat((zeros_pads, audio_chunk), dim=-1)
+            added_len += zeros_pads.size(-1)
+
+        # Calculate valid chunk lengths for each stream
+        max_chunk_lengths = self.streams_length - self.buffer_idx
+        max_chunk_lengths = max_chunk_lengths + added_len
+        chunk_lengths = torch.clamp(max_chunk_lengths, min=0, max=audio_chunk.size(-1))
+
+        # Update buffer position and step counter
+        print(
+            f"[get_next_chunk] BEFORE: buffer_idx={self.buffer_idx}, shift_size={shift_size}, streams_length={self.streams_length}, chunk_lengths={chunk_lengths}"
+        )
+        self.buffer_idx += shift_size
+        self.step += 1
+        print(
+            f"[get_next_chunk] AFTER: buffer_idx={self.buffer_idx}, buffer.size(-1)={self.buffer.size(-1)}"
+        )
+
+        return audio_chunk, chunk_lengths
+
+    def __iter__(self):
+        """
+        Iterator interface for batch processing.
+        Yields chunks by repeatedly calling get_next_chunk().
+        """
+        while True:
+            result = self.get_next_chunk()
+            if result is None:
+                return
+            yield result
+
+    def is_buffer_empty(self):
+        if self.buffer_idx >= self.buffer.size(-1):
+            return True
+        else:
+            return False
+
+    def has_next_chunk(self):
+        """
+        Check if there are more chunks available to process.
+
+        Returns:
+            bool: True if get_next_chunk() will return data, False otherwise
+        """
+        if self.buffer is None or self.streams_length is None:
+            return False
+
+        # Determine the required chunk size for the next chunk
+        if self.buffer_idx == 0 and isinstance(self.streaming_cfg.chunk_size, list):
+            if self.pad_and_drop_preencoded:
+                chunk_size = self.streaming_cfg.chunk_size[1]
+            else:
+                chunk_size = self.streaming_cfg.chunk_size[0]
+        else:
+            chunk_size = (
+                self.streaming_cfg.chunk_size[1]
+                if isinstance(self.streaming_cfg.chunk_size, list)
+                else self.streaming_cfg.chunk_size
+            )
+
+        # Check if we have enough valid data available
+        available_valid_frames = self.streams_length - self.buffer_idx
+        return available_valid_frames.min() >= chunk_size
+
+    def __len__(self):
+        return len(self.buffer)
+
+    def reset_buffer(self):
+        self.buffer = None
+        self.buffer_idx = 0
+        self.streams_length = None
+        self.step = 0
+
+    def reset_buffer_pointer(self):
+        self.buffer_idx = 0
+        self.step = 0
+
+    def extract_preprocessor(self):
+        cfg = copy.deepcopy(self.model._cfg)
+        self.model_normalize_type = cfg.preprocessor.normalize
+        OmegaConf.set_struct(cfg.preprocessor, False)
+        cfg.preprocessor.dither = 0.0
+        cfg.preprocessor.pad_to = 0
+        if self.online_normalization:
+            cfg.preprocessor.normalize = "None"
+
+        preprocessor = self.model.from_config_dict(cfg.preprocessor)
+        return preprocessor.to(self.get_model_device())
+
+    def append_audio_file(self, audio_filepath, stream_id=-1):
+        audio = get_samples(audio_filepath)
+        processed_signal, processed_signal_length, stream_id = self.append_audio(
+            audio, stream_id
+        )
+        return processed_signal, processed_signal_length, stream_id
+
+    def append_audio(self, audio, stream_id=-1):
+        processed_signal, processed_signal_length = self.preprocess_audio(audio)
+        print(
+            f"[append_audio] Audio samples: {len(audio)}, Preprocessed to {processed_signal_length} frames"
+        )
+        processed_signal, processed_signal_length, stream_id = (
+            self.append_processed_signal(processed_signal, stream_id)
+        )
+        print(
+            f"[append_audio] After append: stream_id={stream_id}, streams_length={self.streams_length}, buffer_idx={self.buffer_idx}"
+        )
+        return processed_signal, processed_signal_length, stream_id
+
+    def append_processed_signal(self, processed_signal, stream_id=-1):
+        processed_signal_length = torch.tensor(
+            processed_signal.size(-1), device=processed_signal.device
+        )
+        if stream_id >= 0 and (
+            self.streams_length is not None and stream_id >= len(self.streams_length)
+        ):
+            raise ValueError("Not valid stream_id!")
+        if self.buffer is None:
+            if stream_id >= 0:
+                raise ValueError(
+                    "stream_id can not be specified when there is no stream."
+                )
+            self.buffer = processed_signal
+            self.streams_length = torch.tensor(
+                [processed_signal_length], device=processed_signal.device
+            )
+        else:
+            if self.buffer.size(1) != processed_signal.size(1):
+                raise ValueError(
+                    "Buffer and the processed signal have different dimensions!"
+                )
+            if stream_id < 0:
+                self.buffer = torch.nn.functional.pad(
+                    self.buffer, pad=(0, 0, 0, 0, 0, 1)
+                )
+                self.streams_length = torch.cat(
+                    (
+                        self.streams_length,
+                        torch.tensor([0], device=self.streams_length.device),
+                    ),
+                    dim=-1,
+                )
+                stream_id = len(self.streams_length) - 1
+            needed_len = self.streams_length[stream_id] + processed_signal_length
+            if needed_len > self.buffer.size(-1):
+                self.buffer = torch.nn.functional.pad(
+                    self.buffer, pad=(0, needed_len - self.buffer.size(-1))
+                )
+
+            self.buffer[
+                stream_id,
+                :,
+                self.streams_length[stream_id] : self.streams_length[stream_id]
+                + processed_signal_length,
+            ] = processed_signal
+            self.streams_length[stream_id] = self.streams_length[
+                stream_id
+            ] + processed_signal.size(-1)
+
+        if self.online_normalization:
+            processed_signal, x_mean, x_std = normalize_batch(
+                x=processed_signal,
+                seq_len=torch.tensor([processed_signal_length]),
+                normalize_type=self.model_normalize_type,
+            )
+        return processed_signal, processed_signal_length, stream_id
+
+    def get_model_device(self):
+        return self.model.device
+
+    def preprocess_audio(self, audio, device=None):
+        if device is None:
+            device = self.get_model_device()
+        audio_signal = torch.from_numpy(audio).unsqueeze_(0).to(device)
+        audio_signal_len = torch.Tensor([audio.shape[0]]).to(device)
+        processed_signal, processed_signal_length = self.preprocessor(
+            input_signal=audio_signal, length=audio_signal_len
+        )
+        return processed_signal, processed_signal_length
+
+    def get_all_audios(self):
+        processed_signal = self.buffer
+        if self.online_normalization:
+            processed_signal, x_mean, x_std = normalize_batch(
+                x=processed_signal,
+                seq_len=torch.tensor(self.streams_length),
+                normalize_type=self.model_normalize_type,
+            )
+        return processed_signal, self.streams_length
+
+```
+
+### Cbx Load Test
+
+# Example (cbx_load_test.py)
+
+This is the source code for **07_web_endpoints.fasthtml-checkboxes.cbx_load_test**.
+```python
+import os
+from datetime import datetime
+from pathlib import Path
+
+import modal
+
+if modal.is_local():
+    workspace = modal.config._profile or ""
+    environment = modal.config.config["environment"] or ""
+else:
+    workspace = os.environ["MODAL_WORKSPACE"] or ""
+    environment = os.environ["MODAL_ENVIRONMENT"] or ""
+
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install("locust~=2.29.1", "beautifulsoup4~=4.12.3", "lxml~=5.3.0")
+    .env({"MODAL_WORKSPACE": workspace, "MODAL_ENVIRONMENT": environment})
+    .add_local_file(
+        Path(__file__).parent / "cbx_locustfile.py",
+        remote_path="/root/locustfile.py",
+    )
+    .add_local_file(
+        Path(__file__).parent / "constants.py",
+        remote_path="/root/constants.py",
+    )
+)
+volume = modal.Volume.from_name("example-cbx-load-test-results", create_if_missing=True)
+remote_path = Path("/root") / "loadtests"
+OUT_DIRECTORY = remote_path / datetime.utcnow().replace(microsecond=0).isoformat()
+
+app = modal.App("example-cbx-load-test", image=image, volumes={remote_path: volume})
+
+workers = 8
+host = f"https://{workspace}{'-' + environment if environment else ''}--example-fasthtml-checkboxes-web.modal.run"
+csv_file = OUT_DIRECTORY / "stats.csv"
+default_args = [
+    "-H",
+    host,
+    "--processes",
+    str(workers),
+    "--csv",
+    csv_file,
+]
+
+MINUTES = 60  # seconds
+
+@app.function(cpu=workers)
+@modal.concurrent(max_inputs=100)
+@modal.web_server(port=8089)
+def serve():
+    run_locust.local(default_args)
+
+@app.function(cpu=workers, timeout=60 * MINUTES)
+def run_locust(args: list, wait=False):
+    import subprocess
+
+    process = subprocess.Popen(["locust"] + args)
+    if wait:
+        process.wait()
+        return process.returncode
+
+@app.local_entrypoint()
+def main(
+    r: float = 1.0,
+    u: int = 36,
+    t: str = "1m",  # no more than the timeout of run_locust, one hour
+):
+    args = default_args + [
+        "--spawn-rate",
+        str(r),
+        "--users",
+        str(u),
+        "--run-time",
+        t,
+    ]
+
+    html_report_file = OUT_DIRECTORY / "report.html"
+    args += [
+        "--headless",  # run without browser UI
+        "--autostart",  # start test immediately
+        "--autoquit",  # stop once finished...
+        "10",  # ...but wait ten seconds
+        "--html",  # output an HTML-formatted report
+        html_report_file,  # to this location
+    ]
+
+    if exit_code := run_locust.remote(args, wait=True):
+        SystemExit(exit_code)
+    else:
+        print("finished successfully")
+
+```
+
+### Cbx Locustfile
+
+# Example (cbx_locustfile.py)
+
+This is the source code for **07_web_endpoints.fasthtml-checkboxes.cbx_locustfile**.
+
+```python
+import random
+
+from bs4 import BeautifulSoup
+from constants import N_CHECKBOXES
+from locust import HttpUser, between, task
+
+class CheckboxesUser(HttpUser):
+    wait_time = between(0.01, 0.1)  # Simulates a wait time between requests
+
+    def load_homepage(self):
+        """
+        Simulates a user loading the homepage and fetching the state of the checkboxes.
+        """
+        response = self.client.get("/")
+        soup = BeautifulSoup(response.text, "lxml")
+        main_div = soup.find("main")
+        self.id = main_div["hx-get"].split("/")[-1]
+
+    @task(10)
+    def toggle_random_checkboxes(self):
+        """
+        Simulates a user toggling a random checkbox.
+        """
+        n_checkboxes = random.binomialvariate(  # approximately poisson at 10
+            n=100,
+            p=0.1,
+        )
+        for _ in range(min(n_checkboxes, 1)):
+            checkbox_id = int(
+                N_CHECKBOXES * random.random() ** 2
+            )  # Choose a random checkbox between 0 and 9999, more likely to be closer to 0
+            self.client.post(
+                f"/checkbox/toggle/{checkbox_id}",
+                name="/checkbox/toggle",
+            )
+
+    @task(1)
+    def poll_for_diffs(self):
+        """
+        Simulates a user polling for any outstanding diffs.
+        """
+        self.client.get(f"/diffs/{self.id}", name="/diffs")
+
+    def on_start(self):
+        """
+        Called when a simulated user starts, typically used to initialize or login a user.
+        """
+        self.id = str(random.randint(1, 9999))
+        self.load_homepage()
+
+```
+
+### Chai1
+
+# Fold proteins with Chai-1
+
+In biology, function follows form quite literally:
+the physical shapes of proteins dictate their behavior.
+Measuring those shapes directly is difficult
+and first-principles physical simulation prohibitively expensive.
+
+And so predicting protein shape from content --
+determining how the one-dimensional chain of amino acids encoded by DNA _folds_ into a 3D object --
+has emerged as a key application for machine learning and neural networks in biology.
+
+In this example, we demonstrate how to run the open source [Chai-1](https://github.com/chaidiscovery/chai-lab/)
+protein structure prediction model on Modal's flexible serverless infrastructure.
+For details on how the Chai-1 model works and what it can be used for,
+see the authors' [technical report on bioRxiv](https://www.biorxiv.org/content/10.1101/2024.10.10.615955).
+
+This simple script is meant as a starting point showing how to handle fiddly bits
+like installing dependencies, loading weights, and formatting outputs so that you can get on with the fun stuff.
+To experience the full power of Modal, try scaling inference up and running on hundreds or thousands of structures!
+
+<center>
+<a href="https://molstar.org/viewer" aria-label="Open the Mol* viewer"> <video controls autoplay loop muted> <source src="https://modal-cdn.com/example-chai1-folding.mp4" type="video/mp4"> </video> </a>
+</center>
+
+## Setup
+
+```python
+import hashlib
+import json
+from pathlib import Path
+from typing import Optional
+from uuid import uuid4
+
+import modal
+
+here = Path(__file__).parent  # the directory of this file
+
+MINUTES = 60  # seconds
+
+app = modal.App(name="example-chai1")
+
+```
+
+## Fold a protein from the command line
+
+The logic for running Chai-1 is encapsulated in the function below,
+which you can trigger from the command line by running
+
+```shell
+modal run chai1
+```
+
+This will set up the environment for running Chai-1 inference in Modal's cloud,
+run it, and then save the results remotely and locally. The results are returned in the
+[Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File) format,
+which you can render with the online [Molstar Viewer](https://molstar.org/).
+
+To see more options, run the command with the `--help` flag.
+
+To learn how it works, read on!
+
+```python
+@app.local_entrypoint()
+def main(
+    force_redownload: bool = False,
+    fasta_file: Optional[str] = None,
+    inference_config_file: Optional[str] = None,
+    output_dir: Optional[str] = None,
+    run_id: Optional[str] = None,
+):
+    print("🧬 checking inference dependencies")
+    download_inference_dependencies.remote(force=force_redownload)
+
+    if fasta_file is None:
+        fasta_file = here / "data" / "chai1_default_input.fasta"
+    print(f"🧬 running Chai inference on {fasta_file}")
+    fasta_content = Path(fasta_file).read_text()
+
+    if inference_config_file is None:
+        inference_config_file = here / "data" / "chai1_default_inference.json"
+    print(f"🧬 loading Chai inference config from {inference_config_file}")
+    inference_config = json.loads(Path(inference_config_file).read_text())
+
+    if run_id is None:
+        run_id = hashlib.sha256(uuid4().bytes).hexdigest()[:8]  # short id
+    print(f"🧬 running inference with {run_id=}")
+
+    results = chai1_inference.remote(fasta_content, inference_config, run_id)
+
+    if output_dir is None:
+        output_dir = Path("/tmp/chai1")
+        output_dir.mkdir(parents=True, exist_ok=True)
+
+    print(f"🧬 saving results to disk locally in {output_dir}")
+    for ii, (scores, cif) in enumerate(results):
+        (Path(output_dir) / f"{run_id}-scores.model_idx_{ii}.npz").write_bytes(scores)
+        (Path(output_dir) / f"{run_id}-preds.model_idx_{ii}.cif").write_text(cif)
+
+```
+
+## Installing Chai-1 Python dependencies on Modal
+
+Code running on Modal runs inside containers built from [container images](https://modal.com/docs/guide/images)
+that include that code's dependencies.
+
+Because Modal images include [GPU drivers](https://modal.com/docs/guide/cuda) by default,
+installation of higher-level packages like `chai_lab` that require GPUs is painless.
+
+Here, we do it with one line, using the `uv` package manager for extra speed.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "chai_lab==0.5.0",
+        "huggingface-hub==0.36.0",
+    )
+    .uv_pip_install(
+        "torch==2.7.1",
+        index_url="https://download.pytorch.org/whl/cu128",
+    )
+)
+
+```
+
+## Storing Chai-1 model weights on Modal with Volumes
+
+Not all "dependencies" belong in a container image. Chai-1, for example, depends on
+the weights of several models.
+
+Rather than loading them dynamically at run-time (which would add several minutes of GPU time to each inference),
+or installing them into the image (which would require they be re-downloaded any time the other dependencies changed),
+we load them onto a [Modal Volume](https://modal.com/docs/guide/volumes).
+A Modal Volume is a file system that all of your code running on Modal (or elsewhere!) can access.
+For more on storing model weights on Modal, see [this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+chai_model_volume = (
+    modal.Volume.from_name(  # create distributed filesystem for model weights
+        "chai1-models",
+        create_if_missing=True,
+    )
+)
+models_dir = Path("/models/chai1")
+
+```
+
+The details of how we handle the download here (e.g. running concurrently for extra speed)
+are in the [Addenda](#addenda).
+
+```python
+image = image.env(  # update the environment variables in the image to...
+    {
+        "CHAI_DOWNLOADS_DIR": str(models_dir),  # point the chai code to it
+        "HF_XET_HIGH_PERFORMANCE": "1",  # speed up downloads
+    }
+)
+
+```
+
+## Storing Chai-1 outputs on Modal Volumes
+
+Chai-1 produces its outputs by writing to disk --
+the model's scores for the structure and the structure itself along with rich metadata.
+
+But Modal is a _serverless_ platform, and the filesystem your Modal Functions write to
+is not persistent. Any file can be converted into bytes and sent back from a Modal Function
+-- and we mean any! You can send files that are gigabytes in size that way.
+So we do that below.
+
+But for larger jobs, like folding every protein in the PDB, storing bytes on a local client
+like a laptop won't cut it.
+
+So we again lean on Modal Volumes, which can store thousands of files each.
+We attach a Volume to a Modal Function that runs Chai-1 and the inference code
+saves the results to distributed storage, without any fuss or source code changes.
+
+```python
+chai_preds_volume = modal.Volume.from_name("chai1-preds", create_if_missing=True)
+preds_dir = Path("/preds")
+
+```
+
+## Running Chai-1 on Modal
+
+Now we're ready to define a Modal Function that runs Chai-1.
+
+We put our function on Modal by wrapping it in a decorator, `@app.function`.
+We provide that decorator with some arguments that describe the infrastructure our code needs to run:
+the Volumes we created, the Image we defined, and of course a fast GPU!
+
+Note that Chai-1 takes a file path as input --
+specifically, a path to a file in the [FASTA format](https://en.wikipedia.org/wiki/FASTA_format).
+We pass the file contents to the function as a string and save them to disk so they can be picked up by the inference code.
+
+Because Modal is serverless, we don't need to worry about cleaning up these resources:
+the disk is ephemeral and the GPU only costs you money when you're using it.
+
+```python
+@app.function(
+    timeout=15 * MINUTES,
+    gpu="H100",
+    volumes={models_dir: chai_model_volume, preds_dir: chai_preds_volume},
+    image=image,
+)
+def chai1_inference(
+    fasta_content: str, inference_config: dict, run_id: str
+) -> list[(bytes, str)]:
+    from pathlib import Path
+
+    import torch
+    from chai_lab import chai1
+
+    N_DIFFUSION_SAMPLES = 5  # hard-coded in chai-1
+
+    fasta_file = Path("/tmp/inputs.fasta")
+    fasta_file.write_text(fasta_content.strip())
+
+    output_dir = Path("/preds") / run_id
+
+    chai1.run_inference(
+        fasta_file=fasta_file,
+        output_dir=output_dir,
+        device=torch.device("cuda"),
+        **inference_config,
+    )
+
+    print(
+        f"🧬 done, results written to /{output_dir.relative_to('/preds')} on remote volume"
+    )
+
+    results = []
+    for ii in range(N_DIFFUSION_SAMPLES):
+        scores = (output_dir / f"scores.model_idx_{ii}.npz").read_bytes()
+        cif = (output_dir / f"pred.model_idx_{ii}.cif").read_text()
+
+        results.append((scores, cif))
+
+    return results
+
+```
+
+## Addenda
+
+Above, we glossed over just how we got hold of the model weights --
+the `local_entrypoint` just called a function named `download_inference_dependencies`.
+
+Here's that function's implementation.
+
+A few highlights:
+
+- This Modal Function can access the model weights Volume, like the inference Function,
+but it can't access the model predictions Volume.
+
+- This Modal Function has a different Image (the default!) and doesn't use a GPU. Modal helps you
+separate the concerns, and the costs, of your infrastructure's components.
+
+- We use the `async` keyword here so that we can run the download for each model file
+as a separate task, concurrently. We don't need to worry about this use of `async`
+spreading to the rest of our code -- Modal launches just this Function in an async runtime.
+
+```python
+@app.function(volumes={models_dir: chai_model_volume})
+async def download_inference_dependencies(force=False):
+    import asyncio
+
+    import aiohttp
+
+    base_url = "https://chaiassets.com/chai1-inference-depencencies/"  # sic
+    inference_dependencies = [
+        "conformers_v1.apkl",
+        "models_v2/trunk.pt",
+        "models_v2/token_embedder.pt",
+        "models_v2/feature_embedding.pt",
+        "models_v2/diffusion_module.pt",
+        "models_v2/confidence_head.pt",
+    ]
+
+    headers = {
+        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
+    }
+
+    # launch downloads concurrently
+    async with aiohttp.ClientSession(headers=headers) as session:
+        tasks = []
+        for dep in inference_dependencies:
+            local_path = models_dir / dep
+            if force or not local_path.exists():
+                url = base_url + dep
+                print(f"🧬 downloading {dep}")
+                tasks.append(download_file(session, url, local_path))
+
+        # run all of the downloads and await their completion
+        await asyncio.gather(*tasks)
+
+    chai_model_volume.commit()  # ensures models are visible on remote filesystem before exiting, otherwise takes a few seconds, racing with inference
+
+async def download_file(session, url: str, local_path: Path):
+    async with session.get(url) as response:
+        response.raise_for_status()
+        local_path.parent.mkdir(parents=True, exist_ok=True)
+        with open(local_path, "wb") as f:
+            while chunk := await response.content.read(8192):
+                f.write(chunk)
+
+```
+
+### Chat With Pdf Vision
+
+# Chat with PDF: RAG with ColQwen2
+
+In this example, we demonstrate how to use the the [ColQwen2](https://huggingface.co/vidore/colqwen2-v0.1) model to build a simple
+"Chat with PDF" retrieval-augmented generation (RAG) app.
+The ColQwen2 model is based on [ColPali](https://huggingface.co/blog/manu/colpali) but uses the
+[Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) vision-language model.
+ColPali is in turn based on the late-interaction embedding approach pioneered in [ColBERT](https://dl.acm.org/doi/pdf/10.1145/3397271.3401075).
+
+Vision-language models with high-quality embeddings obviate the need for complex pre-processing pipelines.
+See [this blog post from Jo Bergum of Vespa](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/) for more.
+
+## Setup
+
+First, we’ll import the libraries we need locally and define some constants.
+
+```python
+from pathlib import Path
+from typing import Optional
+from urllib.request import urlopen
+from uuid import uuid4
+
+import modal
+
+MINUTES = 60  # seconds
+
+app = modal.App("example-chat-with-pdf-vision")
+
+```
+
+## Setting up dependencies
+
+In Modal, we define [container images](https://modal.com/docs/guide/custom-container) that run our serverless workloads.
+We install the packages required for our application in those images.
+
+```python
+CACHE_DIR = "/hf-cache"
+
+model_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install("git")
+    .uv_pip_install(
+        [
+            "colpali-engine==0.3.5",
+            "transformers>=4.45.0",
+            "torch>=2.0.0",
+            "huggingface-hub==0.36.0",
+            "qwen-vl-utils==0.0.8",
+            "torchvision==0.19.1",
+        ]
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1", "HF_HUB_CACHE": CACHE_DIR})
+)
+
+```
+
+These dependencies are only installed remotely, so we can't import them locally.
+Use the `.imports` context manager to import them only on Modal instead.
+
+```python
+with model_image.imports():
+    import torch
+    from colpali_engine.models import ColQwen2, ColQwen2Processor
+    from qwen_vl_utils import process_vision_info
+    from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
+
+```
+
+## Specifying the ColQwen2 model
+
+Vision-language models (VLMs) for embedding and generation add another layer of simplification
+to RAG apps based on vector search: we only need one model.
+
+```python
+MODEL_NAME = "Qwen/Qwen2-VL-2B-Instruct"
+MODEL_REVISION = "aca78372505e6cb469c4fa6a35c60265b00ff5a4"
+
+```
+
+## Managing state with Modal Volumes and Dicts
+
+Chat services are stateful:
+the response to an incoming user message depends on past user messages in a session.
+
+RAG apps add even more state:
+the documents being retrieved from and the index over those documents,
+e.g. the embeddings.
+
+Modal Functions are stateless in and of themselves.
+They don't retain information from input to input.
+That's what enables Modal Functions to automatically scale up and down
+[based on the number of incoming requests](https://modal.com/docs/guide/cold-start).
+
+### Managing chat sessions with Modal Dicts
+
+In this example, we use a [`modal.Dict`](https://modal.com/docs/guide/dicts-and-queues)
+to store state information between Function calls.
+
+Modal Dicts behave similarly to Python dictionaries,
+but they are backed by remote storage and accessible to all of your Modal Functions.
+They can contain any Python object
+that can be serialized using [`cloudpickle`](https://github.com/cloudpipe/cloudpickle).
+
+A Dict can hold a few gigabytes across keys of size up to 100 MiB,
+so it works well for our chat session state, which is a few KiB per session,
+and for our embeddings, which are a few hundred KiB per PDF page,
+up to about 100,000 pages of PDFs.
+
+At a larger scale, we'd need to replace this with a database, like Postgres,
+or push more state to the client.
+
+```python
+sessions = modal.Dict.from_name("colqwen-chat-sessions", create_if_missing=True)
+
+class Session:
+    def __init__(self):
+        self.images = None
+        self.messages = []
+        self.pdf_embeddings = None
+
+```
+
+### Storing PDFs on a Modal Volume
+
+Images extracted from PDFs are larger than our session state or embeddings
+-- low tens of MiB per page.
+
+So we store them on a [Modal Volume](https://modal.com/docs/guide/volumes),
+which can store terabytes (or more!) of data across tens of thousands of files.
+
+Volumes behave like a remote file system:
+we read and write from them much like a local file system.
+
+```python
+pdf_volume = modal.Volume.from_name("colqwen-chat-pdfs", create_if_missing=True)
+PDF_ROOT = Path("/vol/pdfs/")
+
+```
+
+### Caching the model weights
+
+We'll also use a Volume to cache the model weights.
+
+```python
+cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+
+```
+
+Running this function will download the model weights to the cache volume.
+Otherwise, the model weights will be downloaded on the first query. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+@app.function(
+    image=model_image, volumes={CACHE_DIR: cache_volume}, timeout=20 * MINUTES
+)
+def download_model():
+    from huggingface_hub import snapshot_download
+
+    result = snapshot_download(
+        MODEL_NAME,
+        revision=MODEL_REVISION,
+        ignore_patterns=["*.pt", "*.bin"],  # using safetensors
+    )
+    print(f"Downloaded model weights to {result}")
+
+```
+
+## Defining a Chat with PDF service
+
+To deploy an autoscaling "Chat with PDF" vision-language model service on Modal,
+we just need to wrap our Python logic in a [Modal App](https://modal.com/docs/guide/apps):
+
+It uses [Modal `@app.cls`](https://modal.com/docs/guide/lifecycle-functions) decorators
+to organize the "lifecycle" of the app:
+loading the model on container start (`@modal.enter`) and running inference on request (`@modal.method`).
+
+We include in the arguments to the `@app.cls` decorator
+all the information about this service's infrastructure:
+the container image, the remote storage, and the GPU requirements.
+
+```python
+@app.cls(
+    image=model_image,
+    gpu="A100-80GB",
+    scaledown_window=10 * MINUTES,  # spin down when inactive
+    volumes={"/vol/pdfs/": pdf_volume, CACHE_DIR: cache_volume},
+)
+class Model:
+    @modal.enter()
+    def load_models(self):
+        self.colqwen2_model = ColQwen2.from_pretrained(
+            "vidore/colqwen2-v0.1",
+            torch_dtype=torch.bfloat16,
+            device_map="cuda:0",
+        )
+        self.colqwen2_processor = ColQwen2Processor.from_pretrained(
+            "vidore/colqwen2-v0.1"
+        )
+        self.qwen2_vl_model = Qwen2VLForConditionalGeneration.from_pretrained(
+            MODEL_NAME,
+            revision=MODEL_REVISION,
+            torch_dtype=torch.bfloat16,
+        )
+        self.qwen2_vl_model.to("cuda:0")
+        self.qwen2_vl_processor = AutoProcessor.from_pretrained(
+            "Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True
+        )
+
+    @modal.method()
+    def index_pdf(self, session_id, target: bytes | list):
+        # We store concurrent user chat sessions in a modal.Dict
+
+        # For simplicity, we assume that each user only runs one session at a time
+
+        session = sessions.get(session_id)
+        if session is None:
+            session = Session()
+
+        if isinstance(target, bytes):
+            images = convert_pdf_to_images.remote(target)
+        else:
+            images = target
+
+        # Store images on a Volume for later retrieval
+        session_dir = PDF_ROOT / f"{session_id}"
+        session_dir.mkdir(exist_ok=True, parents=True)
+        for ii, image in enumerate(images):
+            filename = session_dir / f"{str(ii).zfill(3)}.jpg"
+            image.save(filename)
+
+        # Generated embeddings from the image(s)
+        BATCH_SZ = 4
+        pdf_embeddings = []
+        batches = [images[i : i + BATCH_SZ] for i in range(0, len(images), BATCH_SZ)]
+        for batch in batches:
+            batch_images = self.colqwen2_processor.process_images(batch).to(
+                self.colqwen2_model.device
+            )
+            pdf_embeddings += list(self.colqwen2_model(**batch_images).to("cpu"))
+
+        # Store the image embeddings in the session, for later retrieval
+        session.pdf_embeddings = pdf_embeddings
+
+        # Write embeddings back to the modal.Dict
+        sessions[session_id] = session
+
+    @modal.method()
+    def respond_to_message(self, session_id, message):
+        session = sessions.get(session_id)
+        if session is None:
+            session = Session()
+
+        pdf_volume.reload()  # make sure we have the latest data
+
+        images = (PDF_ROOT / str(session_id)).glob("*.jpg")
+        images = list(sorted(images, key=lambda p: int(p.stem)))
+
+        # Nothing to chat about without a PDF!
+        if not images:
+            return "Please upload a PDF first"
+        elif session.pdf_embeddings is None:
+            return "Indexing PDF..."
+
+        # RAG, Retrieval-Augmented Generation, is two steps:
+
+        # _Retrieval_ of the most relevant data to answer the user's query
+        relevant_image = self.get_relevant_image(message, session, images)
+
+        # _Generation_ based on the retrieved data
+        output_text = self.generate_response(message, session, relevant_image)
+
+        # Update session state for future chats
+        append_to_messages(message, session, user_type="user")
+        append_to_messages(output_text, session, user_type="assistant")
+        sessions[session_id] = session
+
+        return output_text
+
+    # Retrieve the most relevant image from the PDF for the input query
+    def get_relevant_image(self, message, session, images):
+        import PIL
+
+        batch_queries = self.colqwen2_processor.process_queries([message]).to(
+            self.colqwen2_model.device
+        )
+        query_embeddings = self.colqwen2_model(**batch_queries)
+
+        # This scores our query embedding against the image embeddings from index_pdf
+        scores = self.colqwen2_processor.score_multi_vector(
+            query_embeddings, session.pdf_embeddings
+        )[0]
+
+        # Select the best matching image
+        max_index = max(range(len(scores)), key=lambda index: scores[index])
+        return PIL.Image.open(images[max_index])
+
+    # Pass the query and retrieved image along with conversation history into the VLM for a response
+    def generate_response(self, message, session, image):
+        chatbot_message = get_chatbot_message_with_image(message, image)
+        query = self.qwen2_vl_processor.apply_chat_template(
+            [*session.messages, chatbot_message],
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        image_inputs, _ = process_vision_info([chatbot_message])
+        inputs = self.qwen2_vl_processor(
+            text=[query],
+            images=image_inputs,
+            padding=True,
+            return_tensors="pt",
+        )
+        inputs = inputs.to("cuda:0")
+
+        generated_ids = self.qwen2_vl_model.generate(**inputs, max_new_tokens=512)
+        generated_ids_trimmed = [
+            out_ids[len(in_ids) :]
+            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+        ]
+        output_text = self.qwen2_vl_processor.batch_decode(
+            generated_ids_trimmed,
+            skip_special_tokens=True,
+            clean_up_tokenization_spaces=False,
+        )[0]
+        return output_text
+
+```
+
+## Loading PDFs as images
+
+Vision-Language Models operate on images, not PDFs directly,
+so we need to convert our PDFs into images first.
+
+We separate this from our indexing and chatting logic --
+we run on a different container with different dependencies.
+
+```python
+pdf_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install("poppler-utils")
+    .uv_pip_install("pdf2image==1.17.0", "pillow==10.4.0")
+)
+
+@app.function(image=pdf_image)
+def convert_pdf_to_images(pdf_bytes):
+    from pdf2image import convert_from_bytes
+
+    images = convert_from_bytes(pdf_bytes, fmt="jpeg")
+    return images
+
+```
+
+## Chatting with a PDF from the terminal
+
+Before deploying in a UI, we can test our service from the terminal.
+
+Just run
+```bash
+modal run chat_with_pdf_vision.py
+```
+
+and optionally pass in a path to or URL of a PDF with the `--pdf-path` argument
+and specify a question with the `--question` argument.
+
+Continue a previous chat by passing the session ID printed to the terminal at start
+with the `--session-id` argument.
+
+```python
+@app.local_entrypoint()
+def main(
+    question: Optional[str] = None,
+    pdf_path: Optional[str] = None,
+    session_id: Optional[str] = None,
+):
+    model = Model()
+    if session_id is None:
+        session_id = str(uuid4())
+        print("Starting a new session with id", session_id)
+
+        if pdf_path is None:
+            pdf_path = "https://arxiv.org/pdf/1706.03762"  # all you need
+
+        if pdf_path.startswith("http"):
+            pdf_bytes = urlopen(pdf_path).read()
+        else:
+            pdf_bytes = Path(pdf_path).read_bytes()
+
+        print("Indexing PDF from", pdf_path)
+        model.index_pdf.remote(session_id, pdf_bytes)
+    else:
+        if pdf_path is not None:
+            raise ValueError("Start a new session to chat with a new PDF")
+        print("Resuming session with id", session_id)
+
+    if question is None:
+        question = "What is this document about?"
+
+    print("QUESTION:", question)
+    print(model.respond_to_message.remote(session_id, question))
+
+```
+
+## A hosted Gradio interface
+
+With the [Gradio](https://gradio.app) library, we can create a simple web interface around our class in Python,
+then use Modal to host it for anyone to try out.
+
+To deploy your own, run
+
+```bash
+modal deploy chat_with_pdf_vision.py
+```
+
+and navigate to the URL that appears in your teriminal.
+If you’re editing the code, use `modal serve` instead to see changes hot-reload.
+
+```python
+web_image = pdf_image.uv_pip_install(
+    "fastapi[standard]==0.115.4",
+    "pydantic==2.9.2",
+    "starlette==0.41.2",
+    "gradio==4.44.1",
+    "pillow==10.4.0",
+    "gradio-pdf==0.0.15",
+    "pdf2image==1.17.0",
+)
+
+@app.function(
+    image=web_image,
+    # gradio requires sticky sessions
+    # so we limit the number of concurrent containers to 1
+    # and allow it to scale to 1000 concurrent inputs
+    max_containers=1,
+)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def ui():
+    import uuid
+
+    import gradio as gr
+    from fastapi import FastAPI
+    from gradio.routes import mount_gradio_app
+    from gradio_pdf import PDF
+    from pdf2image import convert_from_path
+
+    web_app = FastAPI()
+
+    # Since this Gradio app is running from its own container,
+    # allowing us to run the inference service via .remote() methods.
+    model = Model()
+
+    def upload_pdf(path, session_id):
+        if session_id == "" or session_id is None:
+            # Generate session id if new client
+            session_id = str(uuid.uuid4())
+
+        images = convert_from_path(path)
+        # Call to our remote inference service to index the PDF
+        model.index_pdf.remote(session_id, images)
+
+        return session_id
+
+    def respond_to_message(message, _, session_id):
+        # Call to our remote inference service to run RAG
+        return model.respond_to_message.remote(session_id, message)
+
+    with gr.Blocks(theme="soft") as demo:
+        session_id = gr.State("")
+
+        gr.Markdown("# Chat with PDF")
+        with gr.Row():
+            with gr.Column(scale=1):
+                gr.ChatInterface(
+                    fn=respond_to_message,
+                    additional_inputs=[session_id],
+                    retry_btn=None,
+                    undo_btn=None,
+                    clear_btn=None,
+                )
+            with gr.Column(scale=1):
+                pdf = PDF(
+                    label="Upload a PDF",
+                )
+                pdf.upload(upload_pdf, [pdf, session_id], session_id)
+
+    return mount_gradio_app(app=web_app, blocks=demo, path="/")
+
+```
+
+## Addenda
+
+The remainder of this code consists of utility functions and boiler plate used in the
+main code above.
+
+```python
+def get_chatbot_message_with_image(message, image):
+    return {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": message},
+        ],
+    }
+
+def append_to_messages(message, session, user_type="user"):
+    session.messages.append(
+        {
+            "role": user_type,
+            "content": {"type": "text", "text": message},
+        }
+    )
+
+```
+
+### Chatterbox Tts
+
+# Create a Chatterbox TTS API on Modal
+
+This example demonstrates how to deploy a text-to-speech (TTS) API using the open source model Chatterbox Turbo on Modal.
+
+Chatterbox Turbo is a state-of-the-art TTS model that can generate natural, expressive speech that rivals proprietary models.
+Prompts can include paralinguistic tags like `[chuckle]`, `[sigh]`, and `[gasp]`. Chatterbox also support voice cloning by passing
+a short (about 10 seconds) audio prompt of the target voice.
+
+Check out [Resemble AI's website](https://www.resemble.ai/) or
+the [Chatterbox Github](https://github.com/resemble-ai/chatterbox) repo for more details.
+
+## Setup
+
+Import `modal`, the only required local dependency.
+
+```python
+import modal
+
+```
+
+## Define a container image
+
+We start with Modal's baseline `debian_slim` image and install the required packages.
+- `chatterbox-tts`: The TTS model library
+- `fastapi`: Web framework for creating the API endpoint
+- "peft": Required for properly loading the model
+
+```python
+image = modal.Image.debian_slim(python_version="3.10").uv_pip_install(
+    "chatterbox-tts==0.1.6",
+    "fastapi[standard]==0.124.4",
+    "peft==0.18.0",
+)
+
+```
+
+We'll also use Chatterbox's provided set of voice prompts which you can download [here](https://modal-cdn.com/blog/audio/chatterbox-tts-voices.zip).
+Unzip the file and upload it to a `modal.Volume` called `chatterbox-tts-voices` with the following CLI commands:
+```shell
+modal volume create chatterbox-tts-voices
+modal volume put chatterbox-tts-voices <PATH-TO-UNZIPPED-VOICE-PROMPTS-DIRECTORY>
+```
+Now we can instantiate the volume and use it with our app.
+
+```python
+chatterbox_tts_voices_vol = modal.Volume.from_name("chatterbox-tts-voices")
+VOICE_PROMPTS_DIR = "/chatterbox-tts/prompts"
+
+app = modal.App("example-chatterbox-tts", image=image)
+
+```
+
+Import the required libraries within the image context to ensure they're available
+when the container runs. This includes audio processing modules and the Chatterbox TTS module itself.
+
+```python
+with image.imports():
+    import io
+
+    import torchaudio as ta
+    from chatterbox.tts_turbo import ChatterboxTurboTTS
+    from fastapi.responses import StreamingResponse
+
+```
+
+## The TTS model class
+
+The TTS service is implemented using Modal's class syntax with GPU acceleration.
+We configure the class to use an A10G GPU with additional parameters:
+
+- `scaledown_window=60 * 5`: Keep containers alive for 5 minutes after last request
+- `@modal.concurrent(max_inputs=10)`: Allow up to 10 concurrent requests per container
+
+We'll also need to provide a Hugging Face token using a `modal.Secret` to access the model weights,
+and attach the `chatterbox-tts-voices` volume to the container.
+
+```python
+@app.cls(
+    gpu="a10g",
+    scaledown_window=60 * 5,
+    secrets=[modal.Secret.from_name("hf-token")],
+    volumes={VOICE_PROMPTS_DIR: chatterbox_tts_voices_vol},
+)
+@modal.concurrent(max_inputs=10)
+class Chatterbox:
+    @modal.enter()
+    def load(self):
+        self.model = ChatterboxTurboTTS.from_pretrained(device="cuda")
+
+    @modal.fastapi_endpoint(docs=True, method="POST")
+    def api_endpoint(self, prompt: str):
+        # Get the audio bytes from the generate method
+        audio_bytes = self.generate.local(prompt)
+
+        # Return the audio as a streaming response with appropriate MIME type.
+        # This allows for browsers to playback audio directly.
+        return StreamingResponse(
+            io.BytesIO(audio_bytes),
+            media_type="audio/wav",
+        )
+
+    @modal.method()
+    def generate(self, prompt: str) -> bytes:
+        # Generate audio waveform from the input text
+        wav = self.model.generate(
+            prompt,
+            audio_prompt_path=VOICE_PROMPTS_DIR
+            + "/chatterbox-tts-voices"
+            + "/prompts"
+            + "/Lucy.wav",
+        )
+
+        # Convert the waveform to bytes
+        buffer = io.BytesIO()
+        ta.save(buffer, wav, self.model.sr, format="wav")
+        buffer.seek(0)
+        return buffer.read()
+
+@app.local_entrypoint()
+def test(
+    prompt: str = "Chatterbox running on Modal [chuckle].",
+    output_path: str = "/tmp/chatterbox-tts/output.wav",
+):
+    chatterbox = Chatterbox()
+    audio_bytes = chatterbox.generate.remote(prompt=prompt)
+
+    # Save the audio bytes to a file
+    import pathlib
+
+    output_path = pathlib.Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_bytes(audio_bytes)
+    print(f"Audio saved to {output_path}")
+
+```
+
+Now deploy the Chatterbox API from this file's directory:
+
+```shell
+modal deploy -m 06_gpu_and_ml.text-to-audio.chatterbox_tts
+```
+
+And query the endpoint with:
+
+```shell
+mkdir -p /tmp/chatterbox-tts  # create tmp directory
+
+curl -X POST --get "<YOUR-ENDPOINT-URL>" \
+  --data-urlencode "prompt=Chatterbox running on Modal [chuckle]." \
+  --output /tmp/chatterbox-tts/output.wav
+```
+
+You'll receive a WAV file named `/tmp/chatterbox-tts/output.wav` containing the generated audio.
+
+### Cloud Bucket Mount Loras
+
+# LoRAs Galore: Create a LoRA Playground with Modal, Gradio, and S3
+
+This example shows how to mount an S3 bucket in a Modal app using [`CloudBucketMount`](https://modal.com/docs/reference/modal.CloudBucketMount).
+We will download a bunch of LoRA adapters from the [HuggingFace Hub](https://huggingface.co/models) into our S3 bucket
+then read from that bucket, on the fly, when doing inference.
+
+By default, we use the [IKEA instructions LoRA](https://huggingface.co/ostris/ikea-instructions-lora-sdxl) as an example,
+which produces the following image when prompted to generate "IKEA instructions for building a GPU rig for deep learning":
+
+![IKEA instructions for building a GPU rig for deep learning](./ikea-instructions-for-building-a-gpu-rig-for-deep-learning.png)
+
+By the end of this example, we've deployed a "playground" app where anyone with a browser can try
+out these custom models. That's the power of Modal: custom, autoscaling AI applications, deployed in seconds.
+You can try out our deployment [here](https://modal-labs-examples--example-cloud-bucket-mount-loras-ui.modal.run).
+
+## Basic setup
+
+```python
+import io
+import os
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+```
+
+You will need to have an S3 bucket and AWS credentials to run this example. Refer to the documentation
+for the detailed [IAM permissions](https://modal.com/docs/guide/cloud-bucket-mounts#iam-permissions) those credentials will need.
+
+After you are done creating a bucket and configuring IAM settings,
+you now need to create a [Modal Secret](https://modal.com/docs/guide/secrets). Navigate to the "Secrets" tab and
+click on the AWS card, then fill in the fields with the AWS key and secret created
+previously. Name the Secret `s3-bucket-secret`.
+
+```python
+bucket_secret = modal.Secret.from_name(
+    "s3-bucket-secret",
+    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"],
+)
+
+MOUNT_PATH: Path = Path("/mnt/bucket")
+LORAS_PATH: Path = MOUNT_PATH / "loras/v5"
+
+BASE_MODEL = "stabilityai/stable-diffusion-xl-base-1.0"
+CACHE_DIR = "/hf-cache"
+
+```
+
+Modal runs serverless functions inside containers.
+The environments those functions run in are defined by
+the container `Image`. The line below constructs an image
+with the dependencies we need -- no need to install them locally.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "huggingface_hub==0.21.4",
+        "transformers==4.38.2",
+        "diffusers==0.26.3",
+        "peft==0.9.0",
+        "accelerate==0.27.2",
+    )
+    .env({"HF_HUB_CACHE": CACHE_DIR})
+)
+
+with image.imports():
+    # we import these dependencies only inside the container
+    import diffusers
+    import huggingface_hub
+    import torch
+
+```
+
+We attach the S3 bucket to all the Modal functions in this app by mounting it on the filesystem they see,
+passing a `CloudBucketMount` to the `volumes` dictionary argument. We can read and write to this mounted bucket
+(almost) as if it were a local directory.
+
+```python
+app = modal.App(
+    "example-cloud-bucket-mount-loras",
+    image=image,
+    volumes={
+        MOUNT_PATH: modal.CloudBucketMount(
+            "modal-s3mount-test-bucket",
+            secret=bucket_secret,
+        )
+    },
+)
+
+```
+
+For the base model, we'll use a modal.Volume to store the Hugging Face cache.
+
+```python
+cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+
+@app.function(image=image, volumes={CACHE_DIR: cache_volume})
+def download_model():
+    loc = huggingface_hub.snapshot_download(repo_id=BASE_MODEL)
+    print(f"Saved model to {loc}")
+
+```
+
+## Acquiring LoRA weights
+
+`search_loras()` will use the Hub API to search for LoRAs. We limit LoRAs
+to a maximum size to avoid downloading very large model weights.
+We went with 800 MiB, but feel free to adapt to what works best for you.
+
+```python
+@app.function(secrets=[bucket_secret])
+def search_loras(limit: int, max_model_size: int = 1024 * 1024 * 1024):
+    api = huggingface_hub.HfApi()
+
+    model_ids: list[str] = []
+    for model in api.list_models(
+        tags=["lora", f"base_model:{BASE_MODEL}"],
+        library="diffusers",
+        sort="downloads",  # sort by most downloaded
+    ):
+        try:
+            model_size = 0
+            for file in api.list_files_info(model.id):
+                model_size += file.size
+
+        except huggingface_hub.utils.GatedRepoError:
+            print(f"gated model ({model.id}); skipping")
+            continue
+
+        # Skip models that are larger than file limit.
+        if model_size > max_model_size:
+            print(f"model {model.id} is too large; skipping")
+            continue
+
+        model_ids.append(model.id)
+        if len(model_ids) >= limit:
+            return model_ids
+
+    return model_ids
+
+```
+
+We want to take the LoRA weights we found and move them from Hugging Face onto S3,
+where they'll be accessible, at short latency and high throughput, for our Modal functions.
+Downloading files in this mount will automatically upload files to S3.
+To speed things up, we will run this function in parallel using Modal's
+[`map`](https://modal.com/docs/reference/modal.Function#map).
+
+```python
+@app.function()
+def download_lora(repository_id: str) -> Optional[str]:
+    os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
+
+    # CloudBucketMounts will report 0 bytes of available space leading to many
+    # unnecessary warnings, so we patch the method that emits those warnings.
+    from huggingface_hub import file_download
+
+    file_download._check_disk_space = lambda x, y: False
+
+    repository_path = LORAS_PATH / repository_id
+    try:
+        # skip models we've already downloaded
+        if not repository_path.exists():
+            huggingface_hub.snapshot_download(
+                repository_id,
+                local_dir=repository_path.as_posix().replace(".", "_"),
+                allow_patterns=["*.safetensors"],
+            )
+        downloaded_lora = len(list(repository_path.rglob("*.safetensors"))) > 0
+    except OSError:
+        downloaded_lora = False
+    except FileNotFoundError:
+        downloaded_lora = False
+    if downloaded_lora:
+        return repository_id
+    else:
+        return None
+
+```
+
+## Inference with LoRAs
+
+We define a `StableDiffusionLoRA` class to organize our inference code.
+We load Stable Diffusion XL 1.0 as a base model, then, when doing inference,
+we load whichever LoRA the user specifies from the S3 bucket.
+For more on the decorators we use on the methods below to speed up building and booting,
+check out the [container lifecycle hooks guide](https://modal.com/docs/guide/lifecycle-functions).
+
+```python
+@app.cls(
+    gpu="a10g",  # A10G GPUs are great for inference
+    volumes={CACHE_DIR: cache_volume},  # We cache the base model
+)
+class StableDiffusionLoRA:
+    @modal.enter()  # when a new container starts, we load the base model into the GPU
+    def load(self):
+        self.pipe = diffusers.DiffusionPipeline.from_pretrained(
+            BASE_MODEL, torch_dtype=torch.float16
+        ).to("cuda")
+
+    @modal.method()  # at inference time, we pull in the LoRA weights and pass the final model the prompt
+    def run_inference_with_lora(
+        self, lora_id: str, prompt: str, seed: int = 8888
+    ) -> bytes:
+        for file in (LORAS_PATH / lora_id).rglob("*.safetensors"):
+            self.pipe.load_lora_weights(lora_id, weight_name=file.name)
+            break
+
+        lora_scale = 0.9
+        image = self.pipe(
+            prompt,
+            num_inference_steps=10,
+            cross_attention_kwargs={"scale": lora_scale},
+            generator=torch.manual_seed(seed),
+        ).images[0]
+
+        buffer = io.BytesIO()
+        image.save(buffer, format="PNG")
+
+        return buffer.getvalue()
+
+```
+
+## Try it locally!
+
+To use our inference code from our local command line, we add a `local_entrypoint` to our `app`.
+Run it using `modal run cloud_bucket_mount_loras.py`, and pass `--help`
+to see the available options.
+
+The inference code will run on our machines, but the results will be available on yours.
+
+```python
+@app.local_entrypoint()
+def main(
+    limit: int = 100,
+    example_lora: str = "ostris/ikea-instructions-lora-sdxl",
+    prompt: str = "IKEA instructions for building a GPU rig for deep learning",
+    seed: int = 8888,
+):
+    # Download LoRAs in parallel.
+    lora_model_ids = [example_lora]
+    lora_model_ids += search_loras.remote(limit)
+
+    downloaded_loras = []
+    for model in download_lora.map(lora_model_ids):
+        if model:
+            downloaded_loras.append(model)
+
+    print(f"downloaded {len(downloaded_loras)} loras => {downloaded_loras}")
+
+    # Run inference using one of the downloaded LoRAs.
+    byte_stream = StableDiffusionLoRA().run_inference_with_lora.remote(
+        example_lora, prompt, seed
+    )
+    dir = Path("/tmp/stable-diffusion-xl")
+    if not dir.exists():
+        dir.mkdir(exist_ok=True, parents=True)
+
+    output_path = dir / f"{as_slug(prompt.lower())}.png"
+    print(f"Saving it to {output_path}")
+    with open(output_path, "wb") as f:
+        f.write(byte_stream)
+
+```
+
+## LoRA Exploradora: A hosted Gradio interface
+
+Command line tools are cool, but we can do better!
+With the Gradio library by Hugging Face, we can create a simple web interface
+around our Python inference function, then use Modal to host it for anyone to try out.
+
+To set up your own, run `modal deploy cloud_bucket_mount_loras.py` and navigate to the URL it prints out.
+If you're playing with the code, use `modal serve` instead to see changes live.
+
+```python
+web_image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "fastapi[standard]==0.115.4",
+    "gradio~=5.7.1",
+    "pillow~=10.2.0",
+)
+
+@app.function(
+    image=web_image,
+    scaledown_window=60 * 20,
+    # gradio requires sticky sessions
+    # so we limit the number of concurrent containers to 1
+    # and allow it to scale to 100 concurrent inputs
+    max_containers=1,
+)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def ui():
+    """A simple Gradio interface around our LoRA inference."""
+    import io
+
+    import gradio as gr
+    from fastapi import FastAPI
+    from gradio.routes import mount_gradio_app
+    from PIL import Image
+
+    # determine which loras are available
+    lora_ids = [
+        f"{lora_dir.parent.stem}/{lora_dir.stem}" for lora_dir in LORAS_PATH.glob("*/*")
+    ]
+
+    # pick one to be default, set a default prompt
+    default_lora_id = (
+        "ostris/ikea-instructions-lora-sdxl"
+        if "ostris/ikea-instructions-lora-sdxl" in lora_ids
+        else lora_ids[0]
+    )
+    default_prompt = (
+        "IKEA instructions for building a GPU rig for deep learning"
+        if default_lora_id == "ostris/ikea-instructions-lora-sdxl"
+        else "text"
+    )
+
+    # the simple path to making an app on Gradio is an Interface: a UI wrapped around a function.
+    def go(lora_id: str, prompt: str, seed: int) -> Image:
+        return Image.open(
+            io.BytesIO(
+                StableDiffusionLoRA().run_inference_with_lora.remote(
+                    lora_id, prompt, seed
+                )
+            ),
+        )
+
+    iface = gr.Interface(
+        go,
+        inputs=[  # the inputs to go/our inference function
+            gr.Dropdown(choices=lora_ids, value=default_lora_id, label="👉 LoRA ID"),
+            gr.Textbox(default_prompt, label="🎨 Prompt"),
+            gr.Number(value=8888, label="🎲 Random Seed"),
+        ],
+        outputs=gr.Image(label="Generated Image"),
+        # some extra bits to make it look nicer
+        title="LoRAs Galore",
+        description="# Try out some of the top custom SDXL models!"
+        "\n\nPick a LoRA finetune of SDXL from the dropdown, then prompt it to generate an image."
+        "\n\nCheck out [the code on GitHub](https://github.com/modal-labs/modal-examples/blob/main/10_integrations/cloud_bucket_mount_loras.py)"
+        " if you want to create your own version or just see how it works."
+        "\n\nPowered by [Modal](https://modal.com) 🚀",
+        theme="soft",
+        allow_flagging="never",
+    )
+
+    return mount_gradio_app(app=FastAPI(), blocks=iface, path="/")
+
+def as_slug(name):
+    """Converts a string, e.g. a prompt, into something we can use as a filename."""
+    import re
+
+    s = str(name).strip().replace(" ", "-")
+    s = re.sub(r"(?u)[^-\w.]", "", s)
+    return s
+
+```
+
+### Coco
+
+This script demonstrates ingestion of the [COCO](https://cocodataset.org/#download) (Common Objects in Context)
+dataset.
+
+It is recommended to iterate on this code from a modal.Function running Jupyter server.
+This better supports experimentation and maintains state in the face of errors:
+11_notebooks/jupyter_inside_modal.py
+
+```python
+import os
+import pathlib
+import shutil
+import subprocess
+import sys
+import threading
+import time
+import zipfile
+
+import modal
+
+bucket_creds = modal.Secret.from_name(
+    "aws-s3-modal-examples-datasets", environment_name="main"
+)
+bucket_name = "modal-examples-datasets"
+volume = modal.CloudBucketMount(
+    bucket_name,
+    secret=bucket_creds,
+)
+image = modal.Image.debian_slim().apt_install("wget").uv_pip_install("tqdm")
+app = modal.App(
+    "example-coco",
+    image=image,
+    secrets=[],
+)
+
+def start_monitoring_disk_space(interval: int = 120) -> None:
+    """Start monitoring the disk space in a separate thread."""
+    task_id = os.environ["MODAL_TASK_ID"]
+
+    def log_disk_space(interval: int) -> None:
+        while True:
+            statvfs = os.statvfs("/")
+            free_space = statvfs.f_frsize * statvfs.f_bavail
+            print(
+                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
+                file=sys.stderr,
+            )
+            time.sleep(interval)
+
+    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
+    monitoring_thread.daemon = True
+    monitoring_thread.start()
+
+def extractall(fzip, dest, desc="Extracting"):
+    from tqdm.auto import tqdm
+    from tqdm.utils import CallbackIOWrapper
+
+    dest = pathlib.Path(dest).expanduser()
+    with (
+        zipfile.ZipFile(fzip) as zipf,
+        tqdm(
+            desc=desc,
+            unit="B",
+            unit_scale=True,
+            unit_divisor=1024,
+            total=sum(getattr(i, "file_size", 0) for i in zipf.infolist()),
+        ) as pbar,
+    ):
+        for i in zipf.infolist():
+            if not getattr(i, "file_size", 0):  # directory
+                zipf.extract(i, os.fspath(dest))
+            else:
+                full_path = dest / i.filename
+                full_path.parent.mkdir(exist_ok=True, parents=True)
+                with zipf.open(i) as fi, open(full_path, "wb") as fo:
+                    shutil.copyfileobj(CallbackIOWrapper(pbar.update, fi), fo)
+
+def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
+    from multiprocessing.pool import ThreadPool
+
+    class MultithreadedCopier:
+        def __init__(self, max_threads):
+            self.pool = ThreadPool(max_threads)
+
+        def copy(self, source, dest):
+            self.pool.apply_async(shutil.copy2, args=(source, dest))
+
+        def __enter__(self):
+            return self
+
+        def __exit__(self, exc_type, exc_val, exc_tb):
+            self.pool.close()
+            self.pool.join()
+
+    with MultithreadedCopier(max_threads=48) as copier:
+        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)
+
+```
+
+This script uses wget to download ZIP files over HTTP because while the official
+website recommends using gsutil to download from a bucket (https://cocodataset.org/#download)
+that bucket no longer exists.
+
+```python
+@app.function(
+    volumes={"/vol/": volume},
+    timeout=60 * 60 * 5,  # 5 hours
+    ephemeral_disk=600 * 1024,  # 600 GiB,
+)
+def _do_part(url: str) -> None:
+    start_monitoring_disk_space()
+    part = url.replace("http://images.cocodataset.org/", "")
+    name = pathlib.Path(part).name.replace(".zip", "")
+    zip_path = pathlib.Path("/tmp/") / pathlib.Path(part).name
+    extract_tmp_path = pathlib.Path("/tmp", name)
+    dest_path = pathlib.Path("/vol/coco/", name)
+
+    print(f"Downloading {name} from {url}")
+    command = f"wget {url} -O {zip_path}"
+    subprocess.run(command, shell=True, check=True)
+    print(f"Download of {name} completed successfully.")
+    extract_tmp_path.mkdir()
+    extractall(
+        zip_path, extract_tmp_path, desc=f"Extracting {name}"
+    )  # extract into /tmp/
+    zip_path.unlink()  # free up disk space by deleting the zip
+    print(f"Copying extract {name} data to volume.")
+    copy_concurrent(extract_tmp_path, dest_path)  # copy from /tmp/ into mounted volume
+
+```
+
+We can process each part of the dataset in parallel, using a 'parent' Function just to execute
+the map and wait on completion of all children.
+
+```python
+@app.function(
+    timeout=60 * 60 * 5,  # 5 hours
+)
+def import_transform_load() -> None:
+    print("Starting import, transform, and load of COCO dataset")
+    list(
+        _do_part.map(
+            [
+                "http://images.cocodataset.org/zips/train2017.zip",
+                "http://images.cocodataset.org/zips/val2017.zip",
+                "http://images.cocodataset.org/zips/test2017.zip",
+                "http://images.cocodataset.org/zips/unlabeled2017.zip",
+                "http://images.cocodataset.org/annotations/annotations_trainval2017.zip",
+                "http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip",
+                "http://images.cocodataset.org/annotations/image_info_test2017.zip",
+                "http://images.cocodataset.org/annotations/image_info_unlabeled2017.zip",
+            ]
+        )
+    )
+    print("✅ Done")
+
+```
+
+### Comfyapp
+
+# Run Flux on ComfyUI as an API
+
+In this example, we show you how to turn a [ComfyUI](https://github.com/comfyanonymous/ComfyUI) workflow into a scalable API endpoint.
+
+## Quickstart
+
+To run this simple text-to-image [Flux Schnell workflow](https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/comfyui/workflow_api.json) as an API:
+
+1. Deploy ComfyUI behind a web endpoint:
+
+```bash
+modal deploy 06_gpu_and_ml/comfyui/comfyapp.py
+```
+
+2. In another terminal, run inference:
+
+```bash
+python 06_gpu_and_ml/comfyui/comfyclient.py --modal-workspace $(modal profile current) --prompt "Surreal dreamscape with floating islands, upside-down waterfalls, and impossible geometric structures, all bathed in a soft, ethereal light"
+```
+
+![example comfyui image](https://modal-cdn.com/cdnbot/flux_gen_imagesenr_0w3_209b7170.webp)
+
+The first inference will take ~1m since the container needs to launch the ComfyUI server and load Flux into memory. Successive calls on a warm container should take a few seconds.
+
+## Installing ComfyUI
+
+We use [comfy-cli](https://github.com/Comfy-Org/comfy-cli) to install ComfyUI and its dependencies.
+
+```python
+import json
+import subprocess
+import uuid
+from pathlib import Path
+from typing import Dict
+
+import modal
+import modal.experimental
+
+image = (  # build up a Modal Image to run ComfyUI, step by step
+    modal.Image.debian_slim(  # start from basic Linux with Python
+        python_version="3.11"
+    )
+    .apt_install("git")  # install git to clone ComfyUI
+    .uv_pip_install("fastapi[standard]==0.115.4")  # install web dependencies
+    .uv_pip_install("comfy-cli==1.5.3")  # install comfy-cli
+    .run_commands(  # use comfy-cli to install ComfyUI and its dependencies
+        "comfy --skip-prompt install --fast-deps --nvidia --version 0.3.71"
+    )
+)
+
+```
+
+## Downloading custom nodes
+
+We'll also use `comfy-cli` to download custom nodes, in this case the popular [WAS Node Suite](https://github.com/WASasquatch/was-node-suite-comfyui).
+
+Use the [ComfyUI Registry](https://registry.comfy.org/) to find the specific custom node name to use with this command.
+
+```python
+image = (
+    image.run_commands(  # download a custom node
+        "comfy node install --fast-deps was-ns@3.0.1"
+    )
+    # Add .run_commands(...) calls for any other custom nodes you want to download
+)
+
+```
+
+See [this post](https://modal.com/blog/comfyui-custom-nodes) for more examples
+on how to install popular custom nodes like ComfyUI Impact Pack and ComfyUI IPAdapter Plus.
+
+## Downloading models
+
+`comfy-cli` also supports downloading models, but we've found it's faster to use
+[`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/en/guides/download#download-a-single-file)
+directly by:
+
+1. Enabling [faster downloads](https://huggingface.co/docs/huggingface_hub/en/guides/download#faster-downloads)
+2. Mounting the cache directory to a [Volume](https://modal.com/docs/guide/volumes)
+
+By persisting the cache to a Volume, you avoid re-downloading the models every time you rebuild your image.
+For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+def hf_download():
+    from huggingface_hub import hf_hub_download
+
+    flux_model = hf_hub_download(
+        repo_id="Comfy-Org/flux1-schnell",
+        filename="flux1-schnell-fp8.safetensors",
+        cache_dir="/cache",
+    )
+
+    # symlink the model to the right ComfyUI directory
+    subprocess.run(
+        f"ln -s {flux_model} /root/comfy/ComfyUI/models/checkpoints/flux1-schnell-fp8.safetensors",
+        shell=True,
+        check=True,
+    )
+
+vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+
+image = (
+    # install huggingface_hub with hf_xet support to speed up downloads
+    image.uv_pip_install("huggingface-hub==0.36.0")
+    .env({"HF_XET_HIGH_PERFORMANCE": "1"})
+    .run_function(
+        hf_download,
+        # persist the HF cache to a Modal Volume so future runs don't re-download models
+        volumes={"/cache": vol},
+    )
+)
+
+```
+
+Lastly, copy the ComfyUI workflow JSON to the container.
+
+```python
+image = image.add_local_file(
+    Path(__file__).parent / "workflow_api.json", "/root/workflow_api.json"
+)
+
+```
+
+## Running ComfyUI interactively
+
+Spin up an interactive ComfyUI server by wrapping the `comfy launch` command in a Modal Function
+and serving it as a [web server](https://modal.com/docs/guide/webhooks#non-asgi-web-servers).
+
+```python
+app = modal.App(name="example-comfyapp", image=image)
+
+@app.function(
+    max_containers=1,  # limit interactive session to 1 container
+    gpu="L40S",  # good starter GPU for inference
+    volumes={"/cache": vol},  # mounts our cached models
+)
+@modal.concurrent(
+    max_inputs=10
+)  # required for UI startup process which runs several API calls concurrently
+@modal.web_server(8000, startup_timeout=60)
+def ui():
+    subprocess.Popen("comfy launch -- --listen 0.0.0.0 --port 8000", shell=True)
+
+```
+
+At this point you can run `modal serve 06_gpu_and_ml/comfyui/comfyapp.py` and open the UI in your browser for the classic ComfyUI experience.
+
+Remember to **close your UI tab** when you are done developing.
+This will close the connection with the container serving ComfyUI and you will stop being charged.
+
+## Running ComfyUI as an API
+
+To run a workflow as an API:
+
+1. Stand up a "headless" ComfyUI server in the background when the app starts.
+
+2. Define an `infer` method that takes in a workflow path and runs the workflow on the ComfyUI server.
+
+3. Create a web handler `api` as a web endpoint, so that we can run our workflow as a service and accept inputs from clients.
+
+We group all these steps into a single Modal `cls` object, which we'll call `ComfyUI`.
+
+```python
+@app.cls(
+    scaledown_window=300,  # 5 minute container keep alive after it processes an input
+    gpu="L40S",
+    volumes={"/cache": vol},
+)
+@modal.concurrent(max_inputs=5)  # run 5 inputs per container
+class ComfyUI:
+    port: int = 8000
+
+    @modal.enter()
+    def launch_comfy_background(self):
+        # launch the ComfyUI server exactly once when the container starts
+        cmd = f"comfy launch --background -- --port {self.port}"
+        subprocess.run(cmd, shell=True, check=True)
+
+    @modal.method()
+    def infer(self, workflow_path: str = "/root/workflow_api.json"):
+        # sometimes the ComfyUI server stops responding (we think because of memory leaks), so this makes sure it's still up
+        self.poll_server_health()
+
+        # runs the comfy run --workflow command as a subprocess
+        cmd = f"comfy run --workflow {workflow_path} --wait --timeout 1200 --verbose"
+        subprocess.run(cmd, shell=True, check=True)
+
+        # completed workflows write output images to this directory
+        output_dir = "/root/comfy/ComfyUI/output"
+
+        # looks up the name of the output image file based on the workflow
+        workflow = json.loads(Path(workflow_path).read_text())
+        file_prefix = [
+            node.get("inputs")
+            for node in workflow.values()
+            if node.get("class_type") == "SaveImage"
+        ][0]["filename_prefix"]
+
+        # returns the image as bytes
+        for f in Path(output_dir).iterdir():
+            if f.name.startswith(file_prefix):
+                return f.read_bytes()
+
+    @modal.fastapi_endpoint(method="POST")
+    def api(self, item: Dict):
+        from fastapi import Response
+
+        workflow_data = json.loads(
+            (Path(__file__).parent / "workflow_api.json").read_text()
+        )
+
+        # insert the prompt
+        workflow_data["6"]["inputs"]["text"] = item["prompt"]
+
+        # give the output image a unique id per client request
+        client_id = uuid.uuid4().hex
+        workflow_data["9"]["inputs"]["filename_prefix"] = client_id
+
+        # save this updated workflow to a new file
+        new_workflow_file = f"{client_id}.json"
+        json.dump(workflow_data, Path(new_workflow_file).open("w"))
+
+        # run inference on the currently running container
+        img_bytes = self.infer.local(new_workflow_file)
+
+        return Response(img_bytes, media_type="image/jpeg")
+
+    def poll_server_health(self) -> Dict:
+        import socket
+        import urllib
+
+        try:
+            # check if the server is up (response should be immediate)
+            req = urllib.request.Request(f"http://127.0.0.1:{self.port}/system_stats")
+            urllib.request.urlopen(req, timeout=5)
+            print("ComfyUI server is healthy")
+        except (socket.timeout, urllib.error.URLError) as e:
+            # if no response in 5 seconds, stop the container
+            print(f"Server health check failed: {str(e)}")
+            modal.experimental.stop_fetching_inputs()
+
+            # all queued inputs will be marked "Failed", so you need to catch these errors in your client and then retry
+            raise Exception("ComfyUI server is not healthy, stopping container")
+
+```
+
+This serves the `workflow_api.json` in this repo. When deploying your own workflows, make sure you select the "Export (API)" option in the ComfyUI menu:
+
+![comfyui menu](https://modal-cdn.com/cdnbot/comfyui_menugo5j8ahx_27d72c45.webp)
+
+## More resources
+- Use [memory snapshots](https://modal.com/docs/guide/memory-snapshot) to speed up cold starts (check out the `memory_snapshot` directory on [Github](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/comfyui))
+- Run a ComfyUI workflow as a [Python script](https://modal.com/blog/comfyui-prototype-to-production)
+
+- When to use [A1111 vs ComfyUI](https://modal.com/blog/a1111-vs-comfyui)
+
+- Understand tradeoffs of parallel processing strategies when
+[scaling ComfyUI](https://modal.com/blog/scaling-comfyui)
+
+### Comfyclient
+
+```python
+import argparse
+import json
+import pathlib
+import sys
+import time
+import urllib.request
+
+OUTPUT_DIR = pathlib.Path("/tmp/comfyui")
+OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
+
+def main(args: argparse.Namespace):
+    url = f"https://{args.modal_workspace}--example-comfyapp-comfyui-api{'-dev' if args.dev else ''}.modal.run/"
+    data = json.dumps({"prompt": args.prompt}).encode("utf-8")
+    print(f"Sending request to {url} with prompt: {args.prompt}")
+    print("Waiting for response...")
+    start_time = time.time()
+    req = urllib.request.Request(
+        url, data=data, headers={"Content-Type": "application/json"}
+    )
+    try:
+        with urllib.request.urlopen(req) as response:
+            assert response.status == 200, response.status
+            elapsed = round(time.time() - start_time, 1)
+            print(f"Image finished generating in {elapsed} seconds!")
+            filename = OUTPUT_DIR / f"{slugify(args.prompt)}.png"
+            filename.write_bytes(response.read())
+            print(f"Saved to '{filename}'")
+    except urllib.error.HTTPError as e:
+        if e.code == 404:
+            print(f"Workflow API not found at {url}")
+
+def parse_args(arglist: list[str]) -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--modal-workspace",
+        type=str,
+        required=True,
+        help="Name of the Modal workspace with the deployed app. Run `modal profile current` to check.",
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        required=True,
+        help="Prompt for the image generation model.",
+    )
+    parser.add_argument(
+        "--dev",
+        action="store_true",
+        help="use this flag when running the ComfyUI server in development mode with `modal serve`",
+    )
+
+    return parser.parse_args(arglist[1:])
+
+def slugify(s: str) -> str:
+    return s.lower().replace(" ", "-").replace(".", "-").replace("/", "-")[:32]
+
+if __name__ == "__main__":
+    args = parse_args(sys.argv)
+    main(args)
+
+```
+
+### Constants
+
+# Example (constants.py)
+
+This is the source code for **07_web_endpoints.fasthtml-checkboxes.constants**.
+
+```python
+N_CHECKBOXES = 100_000  # feel free to increase, if you dare!
+
+```
+
+### Controlnet Gradio Demos
+
+# Play with the ControlNet demos
+
+This example allows you to play with all 10 demonstration Gradio apps from the new and amazing ControlNet project.
+ControlNet provides a minimal interface allowing users to use images to constrain StableDiffusion's generation process.
+With ControlNet, users can easily condition the StableDiffusion image generation with different spatial contexts
+including a depth maps, segmentation maps, scribble drawings, and keypoints!
+
+<center>
+<video controls autoplay loop muted>
+<source src="https://user-images.githubusercontent.com/12058921/222927911-3ab52dd1-f2ee-4fb8-97e8-dafbf96ed5c5.mp4" type="video/mp4">
+</video>
+</center>
+
+## Imports and config preamble
+
+```python
+import importlib
+import os
+import pathlib
+from dataclasses import dataclass, field
+
+import modal
+from fastapi import FastAPI
+
+```
+
+Below are the configuration objects for all **10** demos provided in the original [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet) repo.
+The demos each depend on their own custom pretrained StableDiffusion model, and these models are 5-6GB each.
+We can only run one demo at a time, so this module avoids downloading the model and 'detector' dependencies for
+all 10 demos and instead uses the demo configuration object to download only what's necessary for the chosen demo.
+
+Even just limiting our dependencies setup to what's required for one demo, the resulting container image is *huge*.
+
+```python
+@dataclass(frozen=True)
+class DemoApp:
+    """Config object defining a ControlNet demo app's specific dependencies."""
+
+    name: str
+    model_files: list[str]
+    detector_files: list[str] = field(default_factory=list)
+
+demos = [
+    DemoApp(
+        name="canny2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_canny.pth"
+        ],
+    ),
+    DemoApp(
+        name="depth2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_depth.pth"
+        ],
+        detector_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/dpt_hybrid-midas-501f0c75.pt"
+        ],
+    ),
+    DemoApp(
+        name="fake_scribble2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_scribble.pth"
+        ],
+        detector_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/network-bsds500.pth"
+        ],
+    ),
+    DemoApp(
+        name="hed2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_hed.pth"
+        ],
+        detector_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/network-bsds500.pth"
+        ],
+    ),
+    DemoApp(
+        name="hough2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_mlsd.pth"
+        ],
+        detector_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/mlsd_large_512_fp32.pth",
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/mlsd_tiny_512_fp32.pth",
+        ],
+    ),
+    DemoApp(
+        name="normal2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_normal.pth"
+        ],
+    ),
+    DemoApp(
+        name="pose2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_openpose.pth"
+        ],
+        detector_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/body_pose_model.pth",
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/hand_pose_model.pth",
+        ],
+    ),
+    DemoApp(
+        name="scribble2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_scribble.pth"
+        ],
+    ),
+    DemoApp(
+        name="scribble2image_interactive",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_scribble.pth"
+        ],
+    ),
+    DemoApp(
+        name="seg2image",
+        model_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_seg.pth"
+        ],
+        detector_files=[
+            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/upernet_global_small.pth"
+        ],
+    ),
+]
+demos_map: dict[str, DemoApp] = {d.name: d for d in demos}
+
+```
+
+## Pick a demo, any demo
+
+Simply by changing the `DEMO_NAME` below, you can change which ControlNet demo app is setup
+and run by this Modal script.
+
+```python
+DEMO_NAME = "scribble2image"  # Change this value to change the active demo app.
+selected_demo = demos_map[DEMO_NAME]
+
+```
+
+## Setting up the dependencies
+
+ControlNet requires *a lot* of dependencies which could be fiddly to setup manually, but Modal's programmatic
+container image building Python APIs handle this complexity straightforwardly and automatically.
+
+To run any of the 10 demo apps, we need the following:
+
+1. a base Python 3 Linux image (we use Debian Slim)
+2. a bunch of third party PyPi packages
+3. `git`, so that we can download the ControlNet source code (there's no `controlnet` PyPi package)
+4. some image process Linux system packages, including `ffmpeg`
+5. and demo specific pre-trained model and detector `.pth` files
+
+That's a lot! Fortunately, the code below is already written for you that stitches together a working container image
+ready to produce remarkable ControlNet images.
+
+**Note:** a ControlNet model pipeline is [now available in Huggingface's `diffusers` package](https://huggingface.co/blog/controlnet). But this does not contain the demo apps.
+
+```python
+def download_file(url: str, output_path: pathlib.Path):
+    import httpx
+    from tqdm import tqdm
+
+    with open(output_path, "wb") as download_file:
+        with httpx.stream("GET", url, follow_redirects=True) as response:
+            total = int(response.headers["Content-Length"])
+            with tqdm(
+                total=total, unit_scale=True, unit_divisor=1024, unit="B"
+            ) as progress:
+                num_bytes_downloaded = response.num_bytes_downloaded
+                for chunk in response.iter_bytes():
+                    download_file.write(chunk)
+                    progress.update(
+                        response.num_bytes_downloaded - num_bytes_downloaded
+                    )
+                    num_bytes_downloaded = response.num_bytes_downloaded
+
+def download_demo_files() -> None:
+    """
+    The ControlNet repo instructs: 'Make sure that SD models are put in "ControlNet/models".'
+    'ControlNet' is just the repo root, so we place in /root/models.
+
+    The ControlNet repo also instructs: 'Make sure that... detectors are put in "ControlNet/annotator/ckpts".'
+    'ControlNet' is just the repo root, so we place in /root/annotator/ckpts.
+    """
+    demo = demos_map[os.environ["DEMO_NAME"]]
+    models_dir = pathlib.Path("/root/models")
+    for url in demo.model_files:
+        filepath = pathlib.Path(url).name
+        download_file(url=url, output_path=models_dir / filepath)
+        print(f"download complete for {filepath}")
+
+    detectors_dir = pathlib.Path("/root/annotator/ckpts")
+    for url in demo.detector_files:
+        filepath = pathlib.Path(url).name
+        download_file(url=url, output_path=detectors_dir / filepath)
+        print(f"download complete for {filepath}")
+    print("🎉 finished baking demo file(s) into image.")
+
+image = (
+    modal.Image.debian_slim(python_version="3.10")
+    .uv_pip_install(
+        "fastapi[standard]==0.115.4",
+        "pydantic==2.9.1",
+        "starlette==0.41.2",
+        "gradio==3.16.2",
+        "albumentations==1.3.0",
+        "opencv-contrib-python",
+        "imageio==2.9.0",
+        "imageio-ffmpeg==0.4.2",
+        "pytorch-lightning==1.5.0",
+        "omegaconf==2.1.1",
+        "test-tube>=0.7.5",
+        "streamlit==1.12.1",
+        "einops==0.3.0",
+        "transformers==4.19.2",
+        "webdataset==0.2.5",
+        "kornia==0.6",
+        "open_clip_torch==2.0.2",
+        "invisible-watermark>=0.1.5",
+        "streamlit-drawable-canvas==0.8.0",
+        "torchmetrics==0.6.0",
+        "timm==0.6.12",
+        "addict==2.4.0",
+        "yapf==0.32.0",
+        "prettytable==3.6.0",
+        "safetensors==0.2.7",
+        "basicsr==1.4.2",
+        "tqdm~=4.64.1",
+    )
+    # xformers library offers performance improvement.
+    .uv_pip_install("xformers", pre=True)
+    .apt_install("git")
+    # Here we place the latest ControlNet repository code into /root.
+    # Because /root is almost empty, but not entirely empty, `git clone` won't work,
+    # so this `init` then `checkout` workaround is used.
+    .run_commands(
+        "cd /root && git init .",
+        "cd /root && git remote add --fetch origin https://github.com/lllyasviel/ControlNet.git",
+        "cd /root && git checkout main",
+    )
+    .apt_install("ffmpeg", "libsm6", "libxext6")
+    .run_function(
+        download_demo_files,
+        secrets=[modal.Secret.from_dict({"DEMO_NAME": DEMO_NAME})],
+    )
+)
+app = modal.App(name="example-controlnet-gradio-demos", image=image)
+
+web_app = FastAPI()
+
+```
+
+## Serving the Gradio web UI
+
+Each ControlNet gradio demo module exposes a `block` Gradio interface running in queue-mode,
+which is initialized in module scope on import and served on `0.0.0.0`. We want the block interface object,
+but the queueing and launched webserver aren't compatible with Modal's serverless web endpoint interface,
+so in the `import_gradio_app_blocks` function we patch out these behaviors.
+
+```python
+def import_gradio_app_blocks(demo: DemoApp):
+    from gradio import blocks
+
+    # The ControlNet repo demo scripts are written to be run as
+    # standalone scripts, and have a lot of code that executes
+    # in global scope on import, including the launch of a Gradio web server.
+    # We want Modal to control the Gradio web app serving, so we
+    # monkeypatch the .launch() function to be a no-op.
+    blocks.Blocks.launch = lambda self, server_name: print(
+        "launch() has been monkeypatched to do nothing."
+    )
+
+    # each demo app module is a file like gradio_{name}.py
+    module_name = f"gradio_{demo.name}"
+    mod = importlib.import_module(module_name)
+    blocks = mod.block
+    # disable queueing mode, which is incompatible with our Modal web app setup.
+    blocks.enable_queue = False
+    return blocks
+
+```
+
+Because the ControlNet gradio apps are so time and compute intensive to cold-start,
+the web app function is limited to running just 1 warm container (max_containers=1).
+This way, while playing with the demos we can pay the cold-start cost once and have
+all web requests hit the same warm container.
+Spinning up extra containers to handle additional requests would not be efficient
+given the cold-start time.
+We set the scaledown_window to 600 seconds so the container will be kept
+running for 10 minutes after the last request, to keep the app responsive in case
+of continued experimentation.
+
+```python
+@app.function(
+    gpu="A10G",
+    max_containers=1,
+    scaledown_window=600,
+)
+@modal.asgi_app()
+def run():
+    from gradio.routes import mount_gradio_app
+
+    # mount for execution on Modal
+    return mount_gradio_app(
+        app=web_app,
+        blocks=import_gradio_app_blocks(demo=selected_demo),
+        path="/",
+    )
+
+```
+
+## Have fun!
+
+Serve your chosen demo app with `modal serve controlnet_gradio_demos.py`. If you don't have any images ready at hand,
+try one that's in the `06_gpu_and_ml/controlnet/demo_images/` folder.
+
+StableDiffusion was already impressive enough, but ControlNet's ability to so accurately and intuitively constrain
+the image generation process is sure to put a big, dumb grin on your face.
+
+### Count Faces
+
+# Run OpenCV face detection on an image
+
+This example shows how you can use OpenCV on Modal to detect faces in an image. We use
+the `opencv-python` package to load the image and the `opencv` library to
+detect faces. The function `count_faces` takes an image as input and returns
+the number of faces detected in the image.
+
+The code below also shows how you can create wrap this function
+in a simple FastAPI server to create a web interface.
+
+```python
+import os
+
+import modal
+
+app = modal.App("example-count-faces")
+
+open_cv_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("python3-opencv")
+    .uv_pip_install(
+        "fastapi[standard]==0.115.4",
+        "opencv-python~=4.10.0",
+        "numpy<2",
+    )
+)
+
+@app.function(image=open_cv_image)
+def count_faces(image_bytes: bytes) -> int:
+    import cv2
+    import numpy as np
+
+    # Example borrowed from https://towardsdatascience.com/face-detection-in-2-minutes-using-opencv-python-90f89d7c0f81
+    # Load the cascade
+    face_cascade = cv2.CascadeClassifier(
+        os.path.join(cv2.data.haarcascades, "haarcascade_frontalface_default.xml")
+    )
+    # Read the input image
+    np_bytes = np.frombuffer(image_bytes, dtype=np.uint8)
+    img = cv2.imdecode(np_bytes, cv2.IMREAD_COLOR)
+    # Convert into grayscale
+    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+    # Detect faces
+    faces = face_cascade.detectMultiScale(gray, 1.1, 4)
+    return len(faces)
+
+@app.function(
+    image=modal.Image.debian_slim(python_version="3.11").uv_pip_install("inflect")
+)
+@modal.asgi_app()
+def web():
+    import inflect
+    from fastapi import FastAPI, File, HTTPException, UploadFile
+    from fastapi.responses import HTMLResponse
+
+    app = FastAPI()
+
+    @app.get("/", response_class=HTMLResponse)
+    async def index():
+        """
+        Render an HTML form for file upload.
+        """
+        return """
+        <html>
+            <head>
+                <title>Face Counter</title>
+            </head>
+            <body>
+                <h1>Upload an Image to Count Faces</h1>
+                <form action="/process" method="post" enctype="multipart/form-data">
+                    <input type="file" name="file" id="file" accept="image/*" required />
+                    <button type="submit">Upload</button>
+                </form>
+            </body>
+        </html>
+        """
+
+    @app.post("/process", response_class=HTMLResponse)
+    async def process(file: UploadFile = File(...)):
+        """
+        Process the uploaded image and return the number of faces detected.
+        """
+        try:
+            file_content = await file.read()
+            num_faces = await count_faces.remote.aio(file_content)
+            return f"""
+            <html>
+                <head>
+                    <title>Face Counter Result</title>
+                </head>
+                <body>
+                    <h1>{inflect.engine().number_to_words(num_faces).title()} {"Face" if num_faces == 1 else "Faces"} Detected</h1>
+                    <h2>{"😀" * num_faces}</h2>
+                    <a href="/">Go back</a>
+                </body>
+            </html>
+            """
+        except Exception as e:
+            raise HTTPException(
+                status_code=400, detail=f"Error processing image: {str(e)}"
+            )
+
+    return app
+
+```
+
+### Cron Datasette
+
+# Publish interactive datasets with Datasette
+
+![Datasette user interface](https://modal-cdn.com/cdnbot/imdb_datasetteqzaj3q9d_a83d82fd.webp)
+
+Build and deploy an interactive movie database that automatically updates daily with the latest IMDb data.
+This example shows how to serve a Datasette application on Modal with millions of movie and TV show records.
+
+Try it out for yourself [here](https://modal-labs-examples--example-cron-datasette-ui.modal.run).
+
+Along the way, we will learn how to use the following Modal features:
+
+* [Volumes](https://modal.com/docs/guide/volumes): a persisted volume lets us store and grow the published dataset over time.
+
+* [Scheduled functions](https://modal.com/docs/guide/cron): the underlying dataset is refreshed daily, so we schedule a function to run daily.
+
+* [Web endpoints](https://modal.com/docs/guide/webhooks): exposes the Datasette application for web browser interaction and API requests.
+
+## Basic setup
+
+Let's get started writing code.
+For the Modal container image we need a few Python packages.
+
+```python
+import asyncio
+import gzip
+import pathlib
+import shutil
+import tempfile
+from datetime import datetime
+from urllib.request import urlretrieve
+
+import modal
+
+app = modal.App("example-cron-datasette")
+cron_image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "datasette==0.65.1", "sqlite-utils==3.38", "tqdm~=4.67.1", "setuptools<80"
+)
+
+```
+
+## Persistent dataset storage
+
+To separate database creation and maintenance from serving, we'll need the underlying
+database file to be stored persistently. To achieve this we use a
+[Volume](https://modal.com/docs/guide/volumes).
+
+```python
+volume = modal.Volume.from_name(
+    "example-cron-datasette-cache-vol", create_if_missing=True
+)
+DB_FILENAME = "imdb.db"
+VOLUME_DIR = "/cache-vol"
+DATA_DIR = pathlib.Path(VOLUME_DIR, "imdb-data")
+DB_PATH = pathlib.Path(VOLUME_DIR, DB_FILENAME)
+
+```
+
+## Getting a dataset
+
+[IMDb Datasets](https://datasets.imdbws.com/) are available publicly and are updated daily.
+We will download the title.basics.tsv.gz file which contains basic information about all titles (movies, TV shows, etc.).
+Since we are serving an interactive database which updates daily, we will download the files into a temporary directory and then move them to the volume to prevent downtime.
+
+```python
+BASE_URL = "https://datasets.imdbws.com/"
+IMDB_FILES = [
+    "title.basics.tsv.gz",
+]
+
+@app.function(
+    image=cron_image,
+    volumes={VOLUME_DIR: volume},
+    retries=2,
+    timeout=1800,
+)
+def download_dataset(force_refresh=False):
+    """Download IMDb dataset files."""
+    if DATA_DIR.exists() and not force_refresh:
+        print(
+            f"Dataset already present and force_refresh={force_refresh}. Skipping download."
+        )
+        return
+
+    TEMP_DATA_DIR = pathlib.Path(VOLUME_DIR, "imdb-data-temp")
+    if TEMP_DATA_DIR.exists():
+        shutil.rmtree(TEMP_DATA_DIR)
+
+    TEMP_DATA_DIR.mkdir(parents=True, exist_ok=True)
+
+    print("Downloading IMDb dataset...")
+
+    try:
+        for filename in IMDB_FILES:
+            print(f"Downloading {filename}...")
+            url = BASE_URL + filename
+            output_path = TEMP_DATA_DIR / filename
+
+            urlretrieve(url, output_path)
+            print(f"Successfully downloaded {filename}")
+
+        if DATA_DIR.exists():
+            # move the current data to a backup location
+            OLD_DATA_DIR = pathlib.Path(VOLUME_DIR, "imdb-data-old")
+            if OLD_DATA_DIR.exists():
+                shutil.rmtree(OLD_DATA_DIR)
+            shutil.move(DATA_DIR, OLD_DATA_DIR)
+
+            # move the new data into place
+            shutil.move(TEMP_DATA_DIR, DATA_DIR)
+
+            # clean up the old data
+            shutil.rmtree(OLD_DATA_DIR)
+        else:
+            shutil.move(TEMP_DATA_DIR, DATA_DIR)
+
+        volume.commit()
+        print("Finished downloading dataset.")
+
+    except Exception as e:
+        print(f"Error during download: {e}")
+        if TEMP_DATA_DIR.exists():
+            shutil.rmtree(TEMP_DATA_DIR)
+        raise
+
+```
+
+## Data processing
+
+This dataset is no swamp, but a bit of data cleaning is still in order.
+The following function reads a .tsv file, cleans the data and yields batches of records.
+
+```python
+def parse_tsv_file(filepath, batch_size=50000, filter_year=None):
+    """Parse a gzipped TSV file and yield batches of records."""
+    import csv
+
+    with gzip.open(filepath, "rt", encoding="utf-8") as gz_file:
+        reader = csv.DictReader(gz_file, delimiter="\t")
+        batch = []
+        total_processed = 0
+
+        for row in reader:
+            # map missing values to None
+            row = {k: (None if v == "\\N" else v) for k, v in row.items()}
+
+            # remove nsfw data
+            if row.get("isAdult") == "1":
+                continue
+
+            if filter_year:
+                start_year = int(row.get("startYear", 0) or 0)
+                if start_year < filter_year:
+                    continue
+
+            batch.append(row)
+            total_processed += 1
+
+            if len(batch) >= batch_size:
+                yield batch
+                batch = []
+
+        # Yield any remaining records
+        if batch:
+            yield batch
+
+        print(f"Finished processing {total_processed:,} titles.")
+
+```
+
+## Inserting into SQLite
+
+With the TSV processing out of the way, we’re ready to create a SQLite database and feed data into it.
+
+Importantly, the `prep_db` function mounts the same volume used by `download_dataset`, and rows are batch inserted with progress logged after each batch,
+as the full IMDb dataset has millions of rows and does take some time to be fully inserted.
+
+A more sophisticated implementation would only load new data instead of performing a full refresh,
+but we’re keeping things simple for this example!
+We will also create indexes for the titles table to speed up queries.
+
+```python
+@app.function(
+    image=cron_image,
+    volumes={VOLUME_DIR: volume},
+    timeout=900,
+)
+def prep_db(filter_year=None):
+    """Process IMDb data files and create SQLite database."""
+    import sqlite_utils
+    import tqdm
+
+    volume.reload()
+
+    # Create database in a temporary directory first
+    with tempfile.TemporaryDirectory() as tmpdir:
+        tmpdir_path = pathlib.Path(tmpdir)
+        tmp_db_path = tmpdir_path / DB_FILENAME
+
+        db = sqlite_utils.Database(tmp_db_path)
+
+        # Process title.basics.tsv.gz
+        titles_file = DATA_DIR / "title.basics.tsv.gz"
+
+        if titles_file.exists():
+            titles_table = db["titles"]
+            batch_count = 0
+            total_processed = 0
+
+            with tqdm.tqdm(desc="Processing titles", unit="batch", leave=True) as pbar:
+                for i, batch in enumerate(
+                    parse_tsv_file(
+                        titles_file, batch_size=50000, filter_year=filter_year
+                    )
+                ):
+                    titles_table.insert_all(batch, batch_size=50000, truncate=(i == 0))
+                    batch_count += len(batch)
+                    total_processed += len(batch)
+                    pbar.update(1)
+                    pbar.set_postfix({"titles": f"{total_processed:,}"})
+
+            print(f"Total titles in database: {batch_count:,}")
+
+            # Create indexes for titles so we can query the database faster
+            print("Creating indexes...")
+            titles_table.create_index(["tconst"], if_not_exists=True, unique=True)
+            titles_table.create_index(["primaryTitle"], if_not_exists=True)
+            titles_table.create_index(["titleType"], if_not_exists=True)
+            titles_table.create_index(["startYear"], if_not_exists=True)
+            titles_table.create_index(["genres"], if_not_exists=True)
+            print("Created indexes for titles table")
+
+        db.close()
+
+        # Copy the database to the volume
+        DB_PATH.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copyfile(tmp_db_path, DB_PATH)
+
+    print("Syncing DB with volume.")
+    volume.commit()
+    print("Volume changes committed.")
+
+```
+
+## Keep it fresh
+
+IMDb updates their data daily, so we set up
+a [scheduled](https://modal.com/docs/guide/cron) function to automatically refresh the database
+every 24 hours.
+
+```python
+@app.function(schedule=modal.Period(hours=24), timeout=4000)
+def refresh_db():
+    """Scheduled function to refresh the database daily."""
+    print(f"Running scheduled refresh at {datetime.now()}")
+    download_dataset.remote(force_refresh=True)
+    prep_db.remote()
+
+```
+
+## Web endpoint
+
+Hooking up the SQLite database to a Modal webhook is as simple as it gets.
+The Modal `@asgi_app` decorator wraps a few lines of code: one `import` and a few
+lines to instantiate the `Datasette` instance and return its app server.
+
+First, let's define a metadata object for the database.
+This will be used to configure Datasette to display a custom UI with some pre-defined queries.
+
+```python
+columns = {
+    "tconst": "Unique identifier",
+    "titleType": "Type (movie, tvSeries, short, etc.)",
+    "primaryTitle": "Main title",
+    "originalTitle": "Original language title",
+    "startYear": "Release year",
+    "endYear": "End year (for TV series)",
+    "runtimeMinutes": "Runtime in minutes",
+    "genres": "Comma-separated genres",
+}
+
+queries = {
+    "movies_2024": {
+        "sql": """
+                        SELECT
+                            primaryTitle as title,
+                            genres,
+                            runtimeMinutes as runtime
+                        FROM titles
+                        WHERE titleType = 'movie'
+                        AND startYear = 2024
+                        ORDER BY primaryTitle
+                        LIMIT 100
+                    """,
+        "title": "Movies Released in 2024",
+    },
+    "longest_movies": {
+        "sql": """
+                        SELECT
+                            primaryTitle as title,
+                            startYear as year,
+                            runtimeMinutes as runtime,
+                            genres
+                        FROM titles
+                        WHERE titleType = 'movie'
+                        AND runtimeMinutes IS NOT NULL
+                        AND runtimeMinutes > 180
+                        ORDER BY runtimeMinutes DESC
+                        LIMIT 50
+                    """,
+        "title": "Longest Movies (3+ hours)",
+    },
+    "genre_breakdown": {
+        "sql": """
+                        SELECT
+                            genres,
+                            COUNT(*) as count
+                        FROM titles
+                        WHERE titleType = 'movie'
+                        AND genres IS NOT NULL
+                        GROUP BY genres
+                        ORDER BY count DESC
+                        LIMIT 25
+                    """,
+        "title": "Popular Genres",
+    },
+}
+
+metadata = {
+    "title": "IMDb Database Explorer",
+    "description": "Explore IMDb movie and TV show data",
+    "databases": {
+        "imdb": {
+            "tables": {
+                "titles": {
+                    "description": "Basic information about all titles (movies, TV shows, etc.)",
+                    "columns": columns,
+                }
+            },
+            "queries": {
+                "movies_2024": queries["movies_2024"],
+                "longest_movies": queries["longest_movies"],
+                "genre_breakdown": queries["genre_breakdown"],
+            },
+        }
+    },
+}
+
+```
+
+Now we can define the web endpoint that will serve the Datasette application
+
+```python
+@app.function(
+    image=cron_image,
+    volumes={VOLUME_DIR: volume},
+)
+@modal.concurrent(max_inputs=16)
+@modal.asgi_app()
+def ui():
+    """Web endpoint for Datasette UI."""
+    from datasette.app import Datasette
+
+    ds = Datasette(
+        files=[DB_PATH],
+        settings={
+            "sql_time_limit_ms": 60000,
+            "max_returned_rows": 10000,
+            "allow_download": True,
+            "facet_time_limit_ms": 5000,
+            "allow_facet": True,
+        },
+        metadata=metadata,
+    )
+    asyncio.run(ds.invoke_startup())
+    return ds.app()
+
+```
+
+## Publishing to the web
+
+Run this script using `modal run cron_datasette.py` and it will create the database under 5 minutes!
+
+If you would like to force a refresh of the dataset, you can use:
+
+`modal run cron_datasette.py --force-refresh`
+
+If you would like to filter the data to be after a specific year, you can use:
+
+`modal run cron_datasette.py --filter-year year`
+
+You can then use `modal serve cron_datasette.py` to create a short-lived web URL
+that exists until you terminate the script.
+
+When publishing the interactive Datasette app you'll want to create a persistent URL.
+Just run `modal deploy cron_datasette.py` and your app will be deployed in seconds!
+
+```python
+@app.local_entrypoint()
+def run(force_refresh: bool = False, filter_year: int = None):
+    if force_refresh:
+        print("Force refreshing the dataset...")
+
+    if filter_year:
+        print(f"Filtering data to be after {filter_year}")
+
+    print("Downloading IMDb dataset...")
+    download_dataset.remote(force_refresh=force_refresh)
+    print("Processing data and creating SQLite DB...")
+    prep_db.remote(filter_year=filter_year)
+    print("\nDatabase ready! You can now run:")
+    print("  modal serve cron_datasette.py  # For development")
+    print("  modal deploy cron_datasette.py  # For production deployment")
+
+```
+
+You can explore the data at the [deployed web endpoint](https://modal-labs-examples--example-cron-datasette-ui.modal.run).
+
+### Db To Sheet
+
+# Write to Google Sheets from Postgres
+
+In this tutorial, we'll show how to use Modal to schedule a daily report in a spreadsheet on Google Sheets
+that combines data from a PostgreSQL database with data from an external API.
+
+In particular, we'll extract the city of each user from the database, look up the current weather in that city,
+and then build a count/histogram of how many users are experiencing each type of weather.
+
+## Entering credentials
+
+We begin by setting up some credentials that we'll need in order to access our database and output
+spreadsheet. To do that in a secure manner, we log in to our Modal account on the web and go to
+the [Secrets](https://modal.com/secrets) section.
+
+### Database
+
+First we will enter our database credentials. The easiest way to do this is to click **New
+secret** and select the **Postgres compatible** Secret preset and fill in the requested
+information. Then we press **Next** and name our Secret `postgres-secret` and click **Create**.
+
+### Google Sheets/GCP
+
+We'll now add another Secret for Google Sheets access through Google Cloud Platform. Click **New
+secret** and select the Google Sheets preset.
+
+In order to access the Google Sheets API, we'll need to create a *Service Account* in Google Cloud
+Platform. You can skip this step if you already have a Service Account json file.
+
+1. Sign up to Google Cloud Platform or log in if you haven't
+   ([https://cloud.google.com/](https://cloud.google.com/)).
+
+2. Go to [https://console.cloud.google.com/](https://console.cloud.google.com/).
+
+3. In the navigation pane on the left, go to **IAM & Admin** > **Service Accounts**.
+
+4. Click the **+ CREATE SERVICE ACCOUNT** button.
+
+5. Give the service account a suitable name, like "sheet-access-bot". Click **Done**. You don't
+   have to grant it any specific access privileges at this time.
+
+6. Click your new service account in the list view that appears and navigate to the **Keys**
+   section.
+
+7. Click **Add key** and choose **Create new key**. Use the **JSON** key type and confirm by
+   clicking **Create**.
+
+8. A json key file should be downloaded to your computer at this point. Copy the contents of that
+   file and use it as the value for the `SERVICE_ACCOUNT_JSON` field in your new secret.
+
+We'll name this other Secret `"gsheets-secret"`.
+
+Now you can access the values of your Secrets from Modal Functions that you annotate with the
+corresponding `modal.Secret`s, e.g.:
+
+```python
+import os
+
+import modal
+
+app = modal.App("example-db-to-sheet")
+
+@app.function(secrets=[modal.Secret.from_name("postgres-secret")])
+def show_host():
+    # automatically filled from the specified secret
+    print("Host is " + os.environ["PGHOST"])
+
+```
+
+Because these Secrets are Python objects, you can construct and manipulate them in your code.
+We'll do that below by defining a variable to hold our Secret for accessing Postgres
+
+You can additionally specify
+
+```python
+pg_secret = modal.Secret.from_name(
+    "postgres-secret",
+    required_keys=["PGHOST", "PGPORT", "PGDATABASE", "PGUSER", "PGPASSWORD"],
+)
+
+```
+
+In order to connect to the database, we'll use the `psycopg2` Python package. To make it available
+to your Modal Function you need to supply it with an `image` argument that tells Modal how to
+build the container image that contains that package. We'll base it off of the `Image.debian_slim` base
+image that's built into Modal, and make sure to install the required binary packages as well as
+the `psycopg2` package itself:
+
+```python
+pg_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("libpq-dev")
+    .uv_pip_install("psycopg2~=2.9.9")
+)
+
+```
+
+Since the default keynames for a **Postgres compatible** secret correspond to the environment
+variables that `psycopg2` looks for, we can now easily connect to the database even without
+explicit credentials in your code. We'll create a simple function that queries the city for each
+user in the `users` table.
+
+```python
+@app.function(image=pg_image, secrets=[pg_secret])
+def get_db_rows(verbose=True):
+    import psycopg2
+
+    conn = psycopg2.connect()  # no explicit credentials needed
+    cur = conn.cursor()
+    cur.execute("SELECT city FROM users")
+    results = [row[0] for row in cur.fetchall()]
+    if verbose:
+        print(results)
+    return results
+
+```
+
+Note that we import `psycopg2` inside our function instead of the global scope. This allows us to
+run this Modal Function even from an environment where `psycopg2` is not installed. We can test run
+this function using the `modal run` shell command: `modal run db_to_sheet.py::app.get_db_rows`.
+
+To run this function, make sure there is a table called `users` in your database with a column called `city`.
+You can populate the table with some example data using the following SQL commands:
+
+```sql
+CREATE TABLE users (city TEXT);
+INSERT INTO users VALUES ('Stockholm,,Sweden');
+INSERT INTO users VALUES ('New York,NY,USA');
+INSERT INTO users VALUES ('Tokyo,,Japan');
+```
+
+## Applying Python logic
+
+For each row in our source data we'll run an online lookup of the current weather using the
+[http://openweathermap.org](http://openweathermap.org) API. To do this, we'll add the API key to
+another Modal Secret. We'll use a custom secret called "weather-secret" with the key
+`OPENWEATHER_API_KEY` containing our API key for OpenWeatherMap.
+
+```python
+requests_image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "requests~=2.31.0"
+)
+
+@app.function(
+    image=requests_image,
+    secrets=[
+        modal.Secret.from_name("weather-secret", required_keys=["OPENWEATHER_API_KEY"])
+    ],
+)
+def city_weather(city):
+    import requests
+
+    url = "https://api.openweathermap.org/data/2.5/weather"
+    params = {"q": city, "appid": os.environ["OPENWEATHER_API_KEY"]}
+    response = requests.get(url, params=params)
+    weather_label = response.json()["weather"][0]["main"]
+    return weather_label
+
+```
+
+We'll make use of Modal's built-in `function.map` method to create our report. `function.map`
+makes it really easy to parallelize work by executing a Function on every element in a sequence of
+data. For this example we'll just do a simple count of rows per weather type --
+answering the question "how many of our users are experiencing each type of weather?".
+
+```python
+from collections import Counter
+
+@app.function()
+def create_report(cities):
+    # run city_weather for each city in parallel
+    user_weather = city_weather.map(cities)
+    count_users_by_weather = Counter(user_weather).items()
+    return count_users_by_weather
+
+```
+
+Let's try to run this! To make it simple to trigger the function with some
+predefined input data, we create a "local entrypoint" that can be
+run from the command line with
+
+```bash
+modal run db_to_sheet.py
+```
+
+```python
+@app.local_entrypoint()
+def main():
+    cities = [
+        "Stockholm,,Sweden",
+        "New York,NY,USA",
+        "Tokyo,,Japan",
+    ]
+    print(create_report.remote(cities))
+
+```
+
+Running the local entrypoint using `modal run db_to_sheet.py` should print something like:
+`dict_items([('Clouds', 3)])`.
+Note that since this file only has a single app, and the app has only one local entrypoint
+we only have to specify the file to run it - the function/entrypoint is inferred.
+
+In this case the logic is quite simple, but in a real world context you could have applied a
+machine learning model or any other tool you could build into a container to transform the data.
+
+## Sending output to a Google Sheet
+
+We'll set up a new Google Sheet to send our report to. Using the "Sharing" dialog in Google
+Sheets, share the document to the service account's email address (the value of the `client_email` field in the json file)
+and make the service account an editor of the document.
+
+You may also need to enable the Google Sheets API for your project in the Google Cloud Platform console.
+If so, the URL will be printed inside the message of a 403 Forbidden error when you run the function.
+It begins with https://console.developers.google.com/apis/api/sheets.googleapis.com/overview.
+
+Lastly, we need to point our code to the correct Google Sheet. We'll need the *key* of the document.
+You can find the key in the URL of the Google Sheet. It appears after the `/d/` in the URL, like:
+`https://docs.google.com/spreadsheets/d/1wOktal......IJR77jD8Do`.
+
+We'll make use of the `pygsheets` python package to authenticate with
+Google Sheets and then update the spreadsheet with information from the report we just created:
+
+```python
+pygsheets_image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "pygsheets~=2.0.6"
+)
+
+@app.function(
+    image=pygsheets_image,
+    secrets=[
+        modal.Secret.from_name("gsheets-secret", required_keys=["SERVICE_ACCOUNT_JSON"])
+    ],
+)
+def update_sheet_report(rows):
+    import pygsheets
+
+    gc = pygsheets.authorize(service_account_env_var="SERVICE_ACCOUNT_JSON")
+    document_key = "1JxhGsht4wltyPFFOd2hP0eIv6lxZ5pVxJN_ZwNT-l3c"
+    sh = gc.open_by_key(document_key)
+    worksheet = sh.sheet1
+    worksheet.clear("A2")
+
+    worksheet.update_values("A2", [list(row) for row in rows])
+
+```
+
+At this point, we have everything we need in order to run the full program. We can put it all together in
+another Modal Function, and add a [`schedule`](https://modal.com/docs/guide/cron) argument so it runs every day automatically:
+
+```python
+@app.function(schedule=modal.Period(days=1))
+def db_to_sheet():
+    rows = get_db_rows.remote()
+    report = create_report.remote(rows)
+    update_sheet_report.remote(report)
+    print("Updated sheet with new weather distribution")
+    for weather, count in report:
+        print(f"{weather}: {count}")
+
+```
+
+This entire app can now be deployed using `modal deploy db_to_sheet.py`. The [apps page](https://modal.com/apps)
+shows our cron job's execution history and lets you navigate to each invocation's logs.
+To trigger a manual run from your local code during development, you can also trigger this function using the cli:
+`modal run db_to_sheet.py::db_to_sheet`
+
+Note that all of the `@app.function()` annotated functions above run remotely in isolated containers that are specified per
+function, but they are called as seamlessly as if we were using regular Python functions. This is a simple
+showcase of how you can mix and match Modal Functions that use different environments and have them feed
+into each other or even call each other as if they were all functions in the same local program.
+
+### Dbt Duckdb
+
+# Build your own data warehouse with DuckDB, DBT, and Modal
+
+This example contains a minimal but capable [data warehouse](https://en.wikipedia.org/wiki/Data_warehouse).
+It's comprised of the following:
+
+- [DuckDB](https://duckdb.org) as the warehouse's [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database engine
+
+- [AWS S3](https://aws.amazon.com/s3/) as the data storage provider
+
+- [DBT](https://docs.getdbt.com/docs/introduction) as the data transformation tool
+
+Meet your new serverless cloud data warehouse, powered by Modal!
+
+## Configure Modal, S3, and DBT
+
+The only thing in the source code that you must update is the S3 bucket name.
+AWS S3 bucket names are globally unique, and the one in this source is used by us to host this example.
+
+Update the `BUCKET_NAME` variable below and also any references to the original value
+within `sample_proj_duckdb_s3/models/`. The AWS IAM policy below also includes the bucket
+name and that must be updated.
+
+```python
+from pathlib import Path
+
+import modal
+
+BUCKET_NAME = "modal-example-dbt-duckdb-s3"
+LOCAL_DBT_PROJECT = (  # local path
+    Path(__file__).parent / "sample_proj_duckdb_s3"
+)
+PROJ_PATH = "/root/dbt"  # remote paths
+PROFILES_PATH = "/root/dbt_profile"
+TARGET_PATH = "/root/target"
+```
+
+Most of the DBT code and configuration is taken directly from the classic
+[Jaffle Shop](https://github.com/dbt-labs/jaffle_shop) demo and modified to support
+using `dbt-duckdb` with an S3 bucket.
+
+The DBT `profiles.yml` configuration is taken from
+[the `dbt-duckdb` docs](https://github.com/jwills/dbt-duckdb#configuring-your-profile).
+
+We also define the environment our application will run in --
+a container image, as in Docker.
+See [this guide](https://modal.com/docs/guide/custom-container) for details.
+
+```python
+dbt_image = (  # start from a slim Linux image
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install(  # install python packages
+        "boto3~=1.34",  # aws client sdk
+        "dbt-duckdb~=1.8.1",  # dbt and duckdb and a connector
+        "pandas~=2.2.2",  # dataframes
+        "pyarrow~=16.1.0",  # columnar data lib
+        "fastapi[standard]~=0.115.4",  # web app
+    )
+    .env(  # configure DBT environment variables
+        {
+            "DBT_PROJECT_DIR": PROJ_PATH,
+            "DBT_PROFILES_DIR": PROFILES_PATH,
+            "DBT_TARGET_PATH": TARGET_PATH,
+        }
+    )
+    # Here we add all local code and configuration into the Modal Image
+    # so that it will be available when we run DBT on Modal.
+    .add_local_dir(LOCAL_DBT_PROJECT, remote_path=PROJ_PATH)
+    .add_local_file(
+        LOCAL_DBT_PROJECT / "profiles.yml",
+        remote_path=f"{PROFILES_PATH}/profiles.yml",
+    )
+)
+
+app = modal.App(name="example-dbt-duckdb", image=dbt_image)
+
+dbt_target = modal.Volume.from_name("dbt-target-vol", create_if_missing=True)
+
+```
+
+We'll also need to authenticate with AWS to store data in S3.
+
+```python
+s3_secret = modal.Secret.from_name(
+    "modal-examples-aws-user",
+    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_REGION"],
+)
+
+```
+
+Create this Secret using the "AWS" template from the [Secrets dashboard](https://modal.com/secrets).
+Below we will use the provided credentials in a Modal Function to create an S3 bucket and
+populate it with `.parquet` data, so be sure to provide credentials for a user
+with permission to create S3 buckets and read & write data from them.
+
+The policy required for this example is the following.
+Not that you *must* update the bucket name listed in the policy to your
+own bucket name.
+
+```json
+{
+    "Statement": [
+        {
+            "Action": "s3:*",
+            "Effect": "Allow",
+            "Resource": [
+                "arn:aws:s3:::modal-example-dbt-duckdb-s3/*",
+                "arn:aws:s3:::modal-example-dbt-duckdb-s3"
+            ],
+            "Sid": "duckdbs3access"
+        }
+    ],
+    "Version": "2012-10-17"
+}
+```
+
+## Upload seed data
+
+In order to provide source data for DBT to ingest and transform,
+we have the below `create_source_data` function which creates an AWS S3 bucket and
+populates it with Parquet files based off the CSV data in the `seeds/` directory.
+
+You can kick it off by running this script on Modal:
+
+```bash
+modal run dbt_duckdb.py
+```
+
+This script also runs the full data warehouse setup, and the whole process takes a minute or two.
+We'll walk through the rest of the steps below. See the `app.local_entrypoint`
+below for details.
+
+Note that this is not the typical way that `seeds/` data is used, but it's useful for this
+demonstration. See [the DBT docs](https://docs.getdbt.com/docs/build/seeds) for more info.
+
+```python
+@app.function(
+    secrets=[s3_secret],
+)
+def create_source_data():
+    import boto3
+    import pandas as pd
+    from botocore.exceptions import ClientError
+
+    s3_client = boto3.client("s3")
+    s3_client.create_bucket(Bucket=BUCKET_NAME)
+
+    for seed_csv_path in Path(PROJ_PATH, "seeds").glob("*.csv"):
+        print(f"Found seed file {seed_csv_path}")
+        name = seed_csv_path.stem
+        parquet_filename = f"{name}.parquet"
+        object_key = f"sources/{parquet_filename}"
+        try:
+            s3_client.head_object(Bucket=BUCKET_NAME, Key=object_key)
+            print(
+                f"File '{object_key}' already exists in bucket '{BUCKET_NAME}'. Skipping."
+            )
+        except ClientError:
+            df = pd.read_csv(seed_csv_path)
+            df.to_parquet(parquet_filename)
+            print(f"Uploading '{object_key}' to S3 bucket '{BUCKET_NAME}'")
+            s3_client.upload_file(parquet_filename, BUCKET_NAME, object_key)
+            print(f"File '{object_key}' uploaded successfully.")
+
+```
+
+## Run DBT on the cloud with Modal
+
+Modal makes it easy to run Python code in the cloud.
+And DBT is a Python tool, so it's easy to run DBT with Modal:
+below, we import the `dbt` library's `dbtRunner` to pass commands from our
+Python code, running on Modal, the same way we'd pass commands on a command line.
+
+Note that this Modal Function has access to our AWS S3 Secret,
+the local files associated with our DBT project and profiles,
+and a remote Modal Volume that acts as a distributed file system.
+
+```python
+@app.function(
+    secrets=[s3_secret],
+    volumes={TARGET_PATH: dbt_target},
+)
+def run(command: str) -> None:
+    from dbt.cli.main import dbtRunner
+
+    res = dbtRunner().invoke(command.split(" "))
+    if res.exception:
+        print(res.exception)
+
+```
+
+You can run this Modal Function from the command line with
+
+`modal run dbt_duckdb.py::run --command run`
+
+A successful run will log something like the following:
+
+```
+03:41:04  Running with dbt=1.5.0
+03:41:05  Found 5 models, 8 tests, 0 snapshots, 0 analyses, 313 macros, 0 operations, 3 seed files, 3 sources, 0 exposures, 0 metrics, 0 groups
+03:41:05
+03:41:06  Concurrency: 1 threads (target='modal')
+03:41:06
+03:41:06  1 of 5 START sql table model main.stg_customers ................................ [RUN]
+03:41:06  1 of 5 OK created sql table model main.stg_customers ........................... [OK in 0.45s]
+03:41:06  2 of 5 START sql table model main.stg_orders ................................... [RUN]
+03:41:06  2 of 5 OK created sql table model main.stg_orders .............................. [OK in 0.34s]
+03:41:06  3 of 5 START sql table model main.stg_payments ................................. [RUN]
+03:41:07  3 of 5 OK created sql table model main.stg_payments ............................ [OK in 0.36s]
+03:41:07  4 of 5 START sql external model main.customers ................................. [RUN]
+03:41:07  4 of 5 OK created sql external model main.customers ............................ [OK in 0.72s]
+03:41:07  5 of 5 START sql table model main.orders ....................................... [RUN]
+03:41:08  5 of 5 OK created sql table model main.orders .................................. [OK in 0.22s]
+03:41:08
+03:41:08  Finished running 4 table models, 1 external model in 0 hours 0 minutes and 3.15 seconds (3.15s).
+03:41:08  Completed successfully
+03:41:08
+03:41:08  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
+```
+
+Look for the `'materialized='external'` DBT config in the SQL templates
+to see how `dbt-duckdb` is able to write back the transformed data to AWS S3!
+
+After running the `run` command and seeing it succeed, check what's contained
+under the bucket's `out/` key prefix. You'll see that DBT has run the transformations
+defined in `sample_proj_duckdb_s3/models/` and produced output `.parquet` files.
+
+## Serve fresh data documentation with FastAPI and Modal
+
+DBT also automatically generates [rich, interactive data docs](https://docs.getdbt.com/docs/collaborate/explore-projects).
+You can serve these docs on Modal.
+Just define a simple [FastAPI](https://fastapi.tiangolo.com/) app:
+
+```python
+@app.function(volumes={TARGET_PATH: dbt_target})
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()  # wrap a function that returns a FastAPI app in this decorator to host on Modal
+def serve_dbt_docs():
+    import fastapi
+    from fastapi.staticfiles import StaticFiles
+
+    web_app = fastapi.FastAPI()
+    web_app.mount(
+        "/",
+        StaticFiles(  # dbt docs are automatically generated and sitting in the Volume
+            directory=TARGET_PATH, html=True
+        ),
+        name="static",
+    )
+
+    return web_app
+
+```
+
+And deploy that app to Modal with
+
+```bash
+modal deploy dbt_duckdb.py
+# ...
+# Created web function serve_dbt_docs => <output-url>
+```
+
+If you navigate to the output URL, you should see something like
+[![example dbt docs](./dbt_docs.png)](https://modal-labs-examples--example-dbt-duckdb-serve-dbt-docs.modal.run)
+
+You can also check out our instance of the docs [here](https://modal-labs-examples--example-dbt-duckdb-serve-dbt-docs.modal.run).
+The app will be served "serverlessly" -- it will automatically scale up or down
+during periods of increased or decreased usage, and you won't be charged at all
+when it has scaled to zero.
+
+## Schedule daily updates
+
+The following `daily_build` function [runs on a schedule](https://modal.com/docs/guide/cron)
+to keep the DuckDB data warehouse up-to-date. It is also deployed by the same `modal deploy` command for the docs app.
+
+The source data for this warehouse is static,
+so the daily executions don't really "update" anything, just re-build. But this example could be extended
+to have sources which continually provide new data across time.
+It will also generate the DBT docs daily to keep them fresh.
+
+```python
+@app.function(
+    schedule=modal.Period(days=1),
+    secrets=[s3_secret],
+    volumes={TARGET_PATH: dbt_target},
+)
+def daily_build() -> None:
+    run.remote("build")
+    run.remote("docs generate")
+
+@app.local_entrypoint()
+def main():
+    create_source_data.remote()
+    run.remote("run")
+    daily_build.remote()
+
+```
+
+### Dicts And Queues
+
+# Use Modal Dicts and Queues together
+
+Modal Dicts and Queues store and communicate objects in distributed applications on Modal.
+
+To illustrate how Dicts and Queues can interact together in a simple distributed
+system, consider the following example program that crawls the web, starting
+from some initial page and traversing links to many sites in breadth-first order.
+
+The Modal Queue acts as a job queue, accepting new pages to crawl as they are discovered
+by the crawlers and doling them out to be crawled via [`.spawn`](https://modal.com/docs/reference/modal.Function#spawn).
+
+The Dict is used to coordinate termination once the maximum number of URLs to crawl is reached.
+
+Starting from Wikipedia, this spawns several dozen containers (auto-scaled on
+demand) and crawls about 100,000 URLs per minute.
+
+```python
+import queue
+import sys
+from datetime import datetime
+
+import modal
+
+app = modal.App(
+    "example-dicts-and-queues",
+    image=modal.Image.debian_slim().uv_pip_install(
+        "requests~=2.32.4", "beautifulsoup4~=4.13.4"
+    ),
+)
+
+def extract_links(url: str) -> list[str]:
+    """Extract links from a given URL."""
+    import urllib.parse
+
+    import requests
+    from bs4 import BeautifulSoup
+
+    resp = requests.get(url, timeout=10)
+    resp.raise_for_status()
+    soup = BeautifulSoup(resp.text, "html.parser")
+    links = []
+    for link in soup.find_all("a"):
+        links.append(urllib.parse.urljoin(url, link.get("href")))
+    return links
+
+@app.function()
+def crawl_pages(q: modal.Queue, d: modal.Dict, urls: set[str]) -> None:
+    for url in urls:
+        if "stop" in d:
+            return
+        try:
+            s = datetime.now()
+            links = extract_links(url)
+            print(f"Crawled: {url} in {datetime.now() - s}, with {len(links)} links")
+            q.put_many(links)
+        except Exception as exc:
+            print(
+                f"Failed to crawl: {url} with error {exc}, skipping...", file=sys.stderr
+            )
+
+@app.function()
+def scrape(url: str, max_urls: int = 50_000):
+    start_time = datetime.now()
+
+    # Create ephemeral dicts and queues
+    with modal.Dict.ephemeral() as d, modal.Queue.ephemeral() as q:
+        # The dict is used to signal the scraping to stop
+        # The queue contains the URLs that have been crawled
+
+        # Initialize queue with a starting URL
+        q.put(url)
+
+        # Crawl until the queue is empty, or reaching some number of URLs
+        visited = set()
+        max_urls = min(max_urls, 50_000)
+        while True:
+            try:
+                next_urls = q.get_many(2000, timeout=5)
+            except queue.Empty:
+                break
+            new_urls = set(next_urls) - visited
+            visited |= new_urls
+            if len(visited) < max_urls:
+                crawl_pages.spawn(q, d, new_urls)
+            else:
+                d["stop"] = True
+
+        elapsed = (datetime.now() - start_time).total_seconds()
+        print(f"Crawled {len(visited)} URLs in {elapsed:.2f} seconds")
+
+@app.local_entrypoint()
+def main(starting_url=None, max_urls: int = 10_000):
+    starting_url = starting_url or "https://www.wikipedia.org/"
+    scrape.remote(starting_url, max_urls=max_urls)
+
+```
+
+### Diffusers Lora Finetune
+
+# Fine-tune Flux on your pet using LoRA
+
+This example finetunes the [Flux.1-dev model](https://huggingface.co/black-forest-labs/FLUX.1-dev)
+on images of a pet (by default, a puppy named Qwerty)
+using a technique called textual inversion from [the "Dreambooth" paper](https://dreambooth.github.io/).
+Effectively, it teaches a general image generation model a new "proper noun",
+allowing for the personalized generation of art and photos.
+We supplement textual inversion with low-rank adaptation (LoRA)
+for increased efficiency during training.
+
+It then makes the model shareable with others -- without costing $25/day for a GPU server--
+by hosting a [Gradio app](https://gradio.app/) on Modal.
+
+It demonstrates a simple, productive, and cost-effective pathway
+to building on large pretrained models using Modal's building blocks, like
+[GPU-accelerated](https://modal.com/docs/guide/gpu) Modal Functions and Clses for compute-intensive work,
+[Volumes](https://modal.com/docs/guide/volumes) for storage,
+and [web endpoints](https://modal.com/docs/guide/webhooks) for serving.
+
+And with some light customization, you can use it to generate images of your pet!
+
+![Gradio.app image generation interface](./gradio-image-generate.png)
+
+You can find a video walkthrough of this example on the Modal YouTube channel
+[here](https://www.youtube.com/watch?v=df-8fiByXMI).
+
+## Imports and setup
+
+We start by importing the necessary libraries and setting up the environment.
+
+```python
+from dataclasses import dataclass
+from pathlib import Path
+
+import modal
+
+```
+
+## Building up the environment
+
+Machine learning environments are complex, and the dependencies can be hard to manage.
+Modal makes creating and working with environments easy via
+[containers and container images](https://modal.com/docs/guide/custom-container).
+
+We start from a base image and specify all of our dependencies.
+We'll call out the interesting ones as they come up below.
+Note that these dependencies are not installed locally
+-- they are only installed in the remote environment where our Modal App runs.
+
+```python
+app = modal.App(name="example-diffusers-lora-finetune")
+
+image = modal.Image.debian_slim(python_version="3.10").uv_pip_install(
+    "accelerate==0.31.0",
+    "datasets~=2.13.0",
+    "fastapi[standard]==0.115.4",
+    "ftfy~=6.1.0",
+    "gradio~=5.5.0",
+    "huggingface-hub==0.36.0",
+    "numpy<2",
+    "peft==0.11.1",
+    "pydantic==2.9.2",
+    "sentencepiece>=0.1.91,!=0.1.92",
+    "smart_open~=6.4.0",
+    "starlette==0.41.2",
+    "transformers~=4.41.2",
+    "torch~=2.2.0",
+    "torchvision~=0.16",
+    "triton~=2.2.0",
+    "wandb==0.17.6",
+)
+
+```
+
+### Downloading scripts and installing a git repo with `run_commands`
+
+We'll use an example script from the `diffusers` library to train the model.
+We acquire it from GitHub and install it in our environment with a series of commands.
+The container environments Modal Functions run in are highly flexible --
+see [the docs](https://modal.com/docs/guide/custom-container) for more details.
+
+```python
+GIT_SHA = "e649678bf55aeaa4b60bd1f68b1ee726278c0304"  # specify the commit to fetch
+
+image = (
+    image.apt_install("git")
+    # Perform a shallow fetch of just the target `diffusers` commit, checking out
+    # the commit in the container's home directory, /root. Then install `diffusers`
+    .run_commands(
+        "cd /root && git init .",
+        "cd /root && git remote add origin https://github.com/huggingface/diffusers",
+        f"cd /root && git fetch --depth=1 origin {GIT_SHA} && git checkout {GIT_SHA}",
+        "cd /root && pip install -e .",
+    )
+)
+
+```
+
+### Configuration with `dataclass`es
+
+Machine learning apps often have a lot of configuration information.
+We collect up all of our configuration into dataclasses to avoid scattering special/magic values throughout code.
+
+```python
+@dataclass
+class SharedConfig:
+    """Configuration information shared across project components."""
+
+    # The instance name is the "proper noun" we're teaching the model
+    instance_name: str = "Qwerty"
+    # That proper noun is usually a member of some class (person, bird),
+    # and sharing that information with the model helps it generalize better.
+    class_name: str = "Golden Retriever"
+    # identifier for pretrained models on Hugging Face
+    model_name: str = "black-forest-labs/FLUX.1-dev"
+
+```
+
+### Storing data created by our app with `modal.Volume`
+
+The tools we've used so far work well for fetching external information,
+which defines the environment our app runs in,
+but what about data that we create or modify during the app's execution?
+A persisted [`modal.Volume`](https://modal.com/docs/guide/volumes) can store and share data across Modal Apps and Functions.
+
+We'll use one to store both the original and fine-tuned weights we create during training
+and then load them back in for inference. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+volume = modal.Volume.from_name(
+    "dreambooth-finetuning-volume-flux", create_if_missing=True
+)
+MODEL_DIR = "/model"
+
+```
+
+Note that access to the Flux.1-dev model on Hugging Face is
+[gated by a license agreement](https://huggingface.co/docs/hub/en/models-gated) which
+you must agree to [here](https://huggingface.co/black-forest-labs/FLUX.1-dev).
+After you have accepted the license, [create a Modal Secret](https://modal.com/secrets)
+with the name `huggingface-secret` following the instructions in the template.
+
+```python
+huggingface_secret = modal.Secret.from_name(
+    "huggingface-secret", required_keys=["HF_TOKEN"]
+)
+
+image = image.env(
+    {"HF_XET_HIGH_PERFORMANCE": "1"}  # turn on faster downloads from HF
+)
+
+@app.function(
+    volumes={MODEL_DIR: volume},
+    image=image,
+    secrets=[huggingface_secret],
+    timeout=600,  # 10 minutes
+)
+def download_models(config):
+    import torch
+    from diffusers import DiffusionPipeline
+    from huggingface_hub import snapshot_download
+
+    snapshot_download(
+        config.model_name,
+        local_dir=MODEL_DIR,
+        ignore_patterns=["*.pt", "*.bin"],  # using safetensors
+    )
+
+    DiffusionPipeline.from_pretrained(MODEL_DIR, torch_dtype=torch.bfloat16)
+
+```
+
+### Load fine-tuning dataset
+
+Part of the magic of the low-rank fine-tuning is that we only need 3-10 images for fine-tuning.
+So we can fetch just a few images, stored on consumer platforms like Imgur or Google Drive,
+whenever we need them -- no need for expensive, hard-to-maintain data pipelines.
+
+```python
+def load_images(image_urls: list[str]) -> Path:
+    import PIL.Image
+    from smart_open import open
+
+    img_path = Path("/img")
+
+    img_path.mkdir(parents=True, exist_ok=True)
+    for ii, url in enumerate(image_urls):
+        with open(url, "rb") as f:
+            image = PIL.Image.open(f)
+            image.save(img_path / f"{ii}.png")
+    print(f"{ii + 1} images loaded")
+
+    return img_path
+
+```
+
+## Low-Rank Adaptation (LoRA) fine-tuning for a text-to-image model
+
+The base model we start from is trained to do a sort of "reverse [ekphrasis](https://en.wikipedia.org/wiki/Ekphrasis)":
+it attempts to recreate a visual work of art or image from only its description.
+
+We can use the model to synthesize wholly new images
+by combining the concepts it has learned from the training data.
+
+We use a pretrained model, the Flux model from Black Forest Labs.
+In this example, we "finetune" Flux, making only small adjustments to the weights.
+Furthermore, we don't change all the weights in the model.
+Instead, using a technique called [_low-rank adaptation_](https://arxiv.org/abs/2106.09685),
+we change a much smaller matrix that works "alongside" the existing weights, nudging the model in the direction we want.
+
+We can get away with such a small and simple training process because we're just teach the model the meaning of a single new word: the name of our pet.
+
+The result is a model that can generate novel images of our pet:
+as an astronaut in space, as painted by Van Gogh or Bastiat, etc.
+
+### Finetuning with Hugging Face 🧨 Diffusers and Accelerate
+
+The model weights, training libraries, and training script are all provided by [🤗 Hugging Face](https://huggingface.co).
+
+You can kick off a training job with the command `modal run dreambooth_app.py::app.train`.
+It should take about ten minutes.
+
+Training machine learning models takes time and produces a lot of metadata --
+metrics for performance and resource utilization,
+metrics for model quality and training stability,
+and model inputs and outputs like images and text.
+This is especially important if you're fiddling around with the configuration parameters.
+
+This example can optionally use [Weights & Biases](https://wandb.ai) to track all of this training information.
+Just sign up for an account, switch the flag below, and add your API key as a [Modal Secret](https://modal.com/secrets).
+
+```python
+USE_WANDB = False
+
+```
+
+You can see an example W&B dashboard [here](https://wandb.ai/cfrye59/dreambooth-lora-sd-xl).
+Check out [this run](https://wandb.ai/cfrye59/dreambooth-lora-sd-xl/runs/ca3v1lsh?workspace=user-cfrye59),
+which [despite having high GPU utilization](https://wandb.ai/cfrye59/dreambooth-lora-sd-xl/runs/ca3v1lsh/system)
+suffered from numerical instability during training and produced only black images -- hard to debug without experiment management logs!
+
+You can read more about how the values in `TrainConfig` are chosen and adjusted [in this blog post on Hugging Face](https://huggingface.co/blog/dreambooth).
+To run training on images of your own pet, upload the images to separate URLs and edit the contents of the file at `TrainConfig.instance_example_urls_file` to point to them.
+
+Tip: if the results you're seeing don't match the prompt too well, and instead produce an image
+of your subject without taking the prompt into account, the model has likely overfit. In this case, repeat training with a lower
+value of `max_train_steps`. If you used W&B, look back at results earlier in training to determine where to stop.
+On the other hand, if the results don't look like your subject, you might need to increase `max_train_steps`.
+
+```python
+@dataclass
+class TrainConfig(SharedConfig):
+    """Configuration for the finetuning step."""
+
+    # training prompt looks like `{PREFIX} {INSTANCE_NAME} the {CLASS_NAME} {POSTFIX}`
+    prefix: str = "a photo of"
+    postfix: str = ""
+
+    # locator for plaintext file with urls for images of target instance
+    instance_example_urls_file: str = str(
+        Path(__file__).parent / "instance_example_urls.txt"
+    )
+
+    # Hyperparameters/constants from the huggingface training example
+    resolution: int = 512
+    train_batch_size: int = 3
+    rank: int = 16  # lora rank
+    gradient_accumulation_steps: int = 1
+    learning_rate: float = 4e-4
+    lr_scheduler: str = "constant"
+    lr_warmup_steps: int = 0
+    max_train_steps: int = 500
+    checkpointing_steps: int = 1000
+    seed: int = 117
+
+@app.function(
+    image=image,
+    gpu="A100-80GB",  # fine-tuning is VRAM-heavy and requires a high-VRAM GPU
+    volumes={MODEL_DIR: volume},  # stores fine-tuned model
+    timeout=1800,  # 30 minutes
+    secrets=[huggingface_secret]
+    + (
+        [modal.Secret.from_name("wandb-secret", required_keys=["WANDB_API_KEY"])]
+        if USE_WANDB
+        else []
+    ),
+)
+def train(instance_example_urls, config):
+    import subprocess
+
+    from accelerate.utils import write_basic_config
+
+    # load data locally
+    img_path = load_images(instance_example_urls)
+
+    # set up hugging face accelerate library for fast training
+    write_basic_config(mixed_precision="bf16")
+
+    # define the training prompt
+    instance_phrase = f"{config.instance_name} the {config.class_name}"
+    prompt = f"{config.prefix} {instance_phrase} {config.postfix}".strip()
+
+    # the model training is packaged as a script, so we have to execute it as a subprocess, which adds some boilerplate
+    def _exec_subprocess(cmd: list[str]):
+        """Executes subprocess and prints log to terminal while subprocess is running."""
+        process = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+        )
+        with process.stdout as pipe:
+            for line in iter(pipe.readline, b""):
+                line_str = line.decode()
+                print(f"{line_str}", end="")
+
+        if exitcode := process.wait() != 0:
+            raise subprocess.CalledProcessError(exitcode, "\n".join(cmd))
+
+    # run training -- see huggingface accelerate docs for details
+    print("launching dreambooth training script")
+    _exec_subprocess(
+        [
+            "accelerate",
+            "launch",
+            "examples/dreambooth/train_dreambooth_lora_flux.py",
+            "--mixed_precision=bf16",  # half-precision floats most of the time for faster training
+            f"--pretrained_model_name_or_path={MODEL_DIR}",
+            f"--instance_data_dir={img_path}",
+            f"--output_dir={MODEL_DIR}",
+            f"--instance_prompt={prompt}",
+            f"--resolution={config.resolution}",
+            f"--train_batch_size={config.train_batch_size}",
+            f"--gradient_accumulation_steps={config.gradient_accumulation_steps}",
+            f"--learning_rate={config.learning_rate}",
+            f"--lr_scheduler={config.lr_scheduler}",
+            f"--lr_warmup_steps={config.lr_warmup_steps}",
+            f"--max_train_steps={config.max_train_steps}",
+            f"--checkpointing_steps={config.checkpointing_steps}",
+            f"--seed={config.seed}",  # increased reproducibility by seeding the RNG
+        ]
+        + (
+            [
+                "--report_to=wandb",
+                # validation output tracking is useful, but currently broken for Flux LoRA training
+                # f"--validation_prompt={prompt} in space",  # simple test prompt
+                # f"--validation_epochs={config.max_train_steps // 5}",
+            ]
+            if USE_WANDB
+            else []
+        ),
+    )
+    # The trained model information has been output to the volume mounted at `MODEL_DIR`.
+    # To persist this data for use in our web app, we 'commit' the changes
+    # to the volume.
+    volume.commit()
+
+```
+
+## Running our model
+
+To generate images from prompts using our fine-tuned model, we define a Modal Function called `inference`.
+
+Naively, this would seem to be a bad fit for the flexible, serverless infrastructure of Modal:
+wouldn't you need to include the steps to load the model and spin it up in every function call?
+
+In order to initialize the model just once on container startup,
+we use Modal's [container lifecycle](https://modal.com/docs/guide/lifecycle-functions) features, which require the function to be part
+of a class. Note that the `modal.Volume` we saved the model to is mounted here as well,
+so that the fine-tuned model created  by `train` is available to us.
+
+```python
+@app.cls(image=image, gpu="A100", volumes={MODEL_DIR: volume})
+class Model:
+    @modal.enter()
+    def load_model(self):
+        import torch
+        from diffusers import DiffusionPipeline
+
+        # Reload the modal.Volume to ensure the latest state is accessible.
+        volume.reload()
+
+        # set up a hugging face inference pipeline using our model
+        pipe = DiffusionPipeline.from_pretrained(
+            MODEL_DIR,
+            torch_dtype=torch.bfloat16,
+        ).to("cuda")
+        pipe.load_lora_weights(MODEL_DIR)
+        self.pipe = pipe
+
+    @modal.method()
+    def inference(self, text, config):
+        image = self.pipe(
+            text,
+            num_inference_steps=config.num_inference_steps,
+            guidance_scale=config.guidance_scale,
+        ).images[0]
+
+        return image
+
+```
+
+## Wrap the trained model in a Gradio web UI
+
+[Gradio](https://gradio.app) makes it super easy to expose a model's functionality
+in an easy-to-use, responsive web interface.
+
+This model is a text-to-image generator,
+so we set up an interface that includes a user-entry text box
+and a frame for displaying images.
+
+We also provide some example text inputs to help
+guide users and to kick-start their creative juices.
+
+And we couldn't resist adding some Modal style to it as well!
+
+You can deploy the app on Modal with the command
+`modal deploy dreambooth_app.py`.
+You'll be able to come back days, weeks, or months later and find it still ready to go,
+even though you don't have to pay for a server to run while you're not using it.
+
+```python
+@dataclass
+class AppConfig(SharedConfig):
+    """Configuration information for inference."""
+
+    num_inference_steps: int = 50
+    guidance_scale: float = 6
+
+web_image = image.add_local_dir(
+    # Add local web assets to the image
+    Path(__file__).parent / "assets",
+    remote_path="/assets",
+)
+
+@app.function(
+    image=web_image,
+    max_containers=1,
+)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def fastapi_app():
+    import gradio as gr
+    from fastapi import FastAPI
+    from fastapi.responses import FileResponse
+    from gradio.routes import mount_gradio_app
+
+    web_app = FastAPI()
+
+    # Call out to the inference in a separate Modal environment with a GPU
+    def go(text=""):
+        if not text:
+            text = example_prompts[0]
+        return Model().inference.remote(text, config)
+
+    # set up AppConfig
+    config = AppConfig()
+
+    instance_phrase = f"{config.instance_name} the {config.class_name}"
+
+    example_prompts = [
+        f"{instance_phrase}",
+        f"a painting of {instance_phrase.title()} With A Pearl Earring, by Vermeer",
+        f"oil painting of {instance_phrase} flying through space as an astronaut",
+        f"a painting of {instance_phrase} in cyberpunk city. character design by cory loftis. volumetric light, detailed, rendered in octane",
+        f"drawing of {instance_phrase} high quality, cartoon, path traced, by studio ghibli and don bluth",
+    ]
+
+    modal_docs_url = "https://modal.com/docs"
+    modal_example_url = f"{modal_docs_url}/examples/dreambooth_app"
+
+    description = f"""Describe what they are doing or how a particular artist or style would depict them. Be fantastical! Try the examples below for inspiration.
+
+### Learn how to make a "Dreambooth" for your own pet [here]({modal_example_url}).
+    """
+
+    # custom styles: an icon, a background, and a theme
+    @web_app.get("/favicon.ico", include_in_schema=False)
+    async def favicon():
+        return FileResponse("/assets/favicon.svg")
+
+    @web_app.get("/assets/background.svg", include_in_schema=False)
+    async def background():
+        return FileResponse("/assets/background.svg")
+
+    with open("/assets/index.css") as f:
+        css = f.read()
+
+    theme = gr.themes.Default(
+        primary_hue="green", secondary_hue="emerald", neutral_hue="neutral"
+    )
+
+    # add a gradio UI around inference
+    with gr.Blocks(
+        theme=theme,
+        css=css,
+        title=f"Generate images of {config.instance_name} on Modal",
+    ) as interface:
+        gr.Markdown(
+            f"# Generate images of {instance_phrase}.\n\n{description}",
+        )
+        with gr.Row():
+            inp = gr.Textbox(  # input text component
+                label="",
+                placeholder=f"Describe the version of {instance_phrase} you'd like to see",
+                lines=10,
+            )
+            out = gr.Image(  # output image component
+                height=512, width=512, label="", min_width=512, elem_id="output"
+            )
+        with gr.Row():
+            btn = gr.Button("Dream", variant="primary", scale=2)
+            btn.click(
+                fn=go, inputs=inp, outputs=out
+            )  # connect inputs and outputs with inference function
+
+            gr.Button(  # shameless plug
+                "⚡️ Powered by Modal",
+                variant="secondary",
+                link="https://modal.com",
+            )
+
+        with gr.Column(variant="compact"):
+            # add in a few examples to inspire users
+            for ii, prompt in enumerate(example_prompts):
+                btn = gr.Button(prompt, variant="secondary")
+                btn.click(fn=lambda idx=ii: example_prompts[idx], outputs=inp)
+
+    # mount for execution on Modal
+    return mount_gradio_app(
+        app=web_app,
+        blocks=interface,
+        path="/",
+    )
+
+```
+
+## Running your fine-tuned model from the command line
+
+You can use the `modal` command-line interface to set up, customize, and deploy this app:
+
+- `modal run diffusers_lora_finetune.py` will train the model. Change the `instance_example_urls_file` to point to your own pet's images.
+- `modal serve diffusers_lora_finetune.py` will [serve](https://modal.com/docs/guide/webhooks#developing-with-modal-serve) the Gradio interface at a temporary location. Great for iterating on code!
+- `modal shell diffusers_lora_finetune.py` is a convenient helper to open a bash [shell](https://modal.com/docs/guide/developing-debugging#interactive-shell) in our image. Great for debugging environment issues.
+
+Remember, once you've trained your own fine-tuned model, you can deploy it permanently -- for no cost when it is not being used! --
+using `modal deploy diffusers_lora_finetune.py`.
+
+If you just want to try the app out, you can find our deployment [here](https://modal-labs--example-diffusers-lora-finetune-fastapi-app.modal.run).
+
+```python
+@app.local_entrypoint()
+def run(  # add more config params here to make training configurable
+    max_train_steps: int = 250,
+):
+    print("🎨 loading model")
+    download_models.remote(SharedConfig())
+    print("🎨 setting up training")
+    config = TrainConfig(max_train_steps=max_train_steps)
+    instance_example_urls = (
+        Path(TrainConfig.instance_example_urls_file).read_text().splitlines()
+    )
+    train.remote(instance_example_urls, config)
+    print("🎨 training finished")
+
+```
+
+### Discord Bot
+
+# Serve a Discord Bot on Modal
+
+In this example we will demonstrate how to use Modal to build and serve a Discord bot that uses
+[slash commands](https://discord.com/developers/docs/interactions/application-commands).
+
+Slash commands send information from Discord server members to a service at a URL.
+Here, we set up a simple [FastAPI app](https://fastapi.tiangolo.com/)
+to run that service and deploy it easily  Modal’s
+[`@asgi_app`](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps) decorator.
+
+As our example service, we hit a simple free API:
+the [Free Public APIs API](https://www.freepublicapis.com/api),
+a directory of free public APIs.
+
+[Try it out on Discord](https://discord.gg/PmG7P47EPQ)!
+
+## Set up our App and its Image
+
+First, we define the [container image](https://modal.com/docs/guide/images)
+that all the pieces of our bot will run in.
+
+We set that as the default image for a Modal [App](https://modal.com/docs/guide/apps).
+The App is where we'll attach all the components of our bot.
+
+```python
+import json
+from enum import Enum
+
+import modal
+
+image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "fastapi[standard]==0.115.4", "pynacl~=1.5.0", "requests~=2.32.3"
+)
+
+app = modal.App("example-discord-bot", image=image)
+
+```
+
+## Hit the Free Public APIs API
+
+We start by defining the core service that our bot will provide.
+
+In a real application, this might be [music generation](https://modal.com/docs/examples/musicgen),
+a [chatbot](https://modal.com/docs/examples/chat_with_pdf_vision),
+or [interacting with a database](https://modal.com/docs/examples/cron_datasette).
+
+Here, we just hit a simple free public API:
+the [Free Public APIs](https://www.freepublicapis.com) API,
+an "API of APIs" that returns information about free public APIs,
+like the [Global Shark Attack API](https://www.freepublicapis.com/global-shark-attack-api)
+and the [Corporate Bullshit Generator](https://www.freepublicapis.com/corporate-bullshit-generator).
+We convert the response into a Markdown-formatted message.
+
+We turn our Python function into a Modal Function by attaching the `app.function` decorator.
+We make the function `async` and add `@modal.concurrent()` with a large `max_inputs` value, because
+communicating with an external API is a classic case for better performance from asynchronous execution.
+Modal handles things like the async event loop for us.
+
+```python
+@app.function()
+@modal.concurrent(max_inputs=100)
+async def fetch_api() -> str:
+    import aiohttp
+
+    url = "https://www.freepublicapis.com/api/random"
+
+    async with aiohttp.ClientSession() as session:
+        try:
+            async with session.get(url) as response:
+                response.raise_for_status()
+                data = await response.json()
+                message = (
+                    f"# {data.get('emoji') or '🤖'} [{data['title']}]({data['source']})"
+                )
+                message += f"\n _{''.join(data['description'].splitlines())}_"
+        except Exception as e:
+            message = f"# 🤖: Oops! {e}"
+
+    return message
+
+```
+
+This core component has nothing to do with Discord,
+and it's nice to be able to interact with and test it in isolation.
+
+For that, we add a `local_entrypoint` that calls the Modal Function.
+Notice that we add `.remote` to the function's name.
+
+Later, when you replace this component of the app with something more interesting,
+test it by triggering this entrypoint with  `modal run discord_bot.py`.
+
+```python
+@app.local_entrypoint()
+def test_fetch_api():
+    result = fetch_api.remote()
+    if result.startswith("# 🤖: Oops! "):
+        raise Exception(result)
+    else:
+        print(result)
+
+```
+
+## Integrate our Modal Function with Discord Interactions
+
+Now we need to map this function onto Discord's interface --
+in particular the [Interactions API](https://discord.com/developers/docs/interactions/overview).
+
+Reviewing the documentation, we see that we need to send a JSON payload
+to a specific API URL that will include an `app_id` that identifies our bot
+and a `token` that identifies the interaction (loosely, message) that we're participating in.
+
+So let's write that out. This function doesn't need to live on Modal,
+since it's just encapsulating some logic -- we don't want to turn it into a service or an API on its own.
+That means we don't need any Modal decorators.
+
+```python
+async def send_to_discord(payload: dict, app_id: str, interaction_token: str):
+    import aiohttp
+
+    interaction_url = f"https://discord.com/api/v10/webhooks/{app_id}/{interaction_token}/messages/@original"
+
+    async with aiohttp.ClientSession() as session:
+        async with session.patch(interaction_url, json=payload) as resp:
+            print("🤖 Discord response: " + await resp.text())
+
+```
+
+Other parts of our application might want to both hit the Free Public APIs API and send the result to Discord,
+so we both write a Python function for this and we promote it to a Modal Function with a decorator.
+
+Notice that we use the `.local` suffix to call our `fetch_api` Function. That means we run
+the Function the same way we run all the other Python functions, rather than treating it as a special
+Modal Function. This reduces a bit of extra latency, but couples these two Functions more tightly.
+
+```python
+@app.function()
+@modal.concurrent(max_inputs=100)
+async def reply(app_id: str, interaction_token: str):
+    message = await fetch_api.local()
+    await send_to_discord({"content": message}, app_id, interaction_token)
+
+```
+
+## Set up a Discord app
+
+Now, we need to actually connect to Discord.
+We start by creating an application on the Discord Developer Portal.
+
+1. Go to the
+   [Discord Developer Portal](https://discord.com/developers/applications) and
+   log in with your Discord account.
+2. On the portal, go to **Applications** and create a new application by
+   clicking **New Application** in the top right next to your profile picture.
+3. [Create a custom Modal Secret](https://modal.com/docs/guide/secrets) for your Discord bot.
+   On Modal's Secret creation page, select 'Discord'. Copy your Discord application’s
+   **Public Key** and **Application ID** (from the **General Information** tab in the Discord Developer Portal)
+   and paste them as the value of `DISCORD_PUBLIC_KEY` and `DISCORD_CLIENT_ID`.
+   Additionally, head to the **Bot** tab and use the **Reset Token** button to create a new bot token.
+   Paste this in the value of an additional key in the Secret, `DISCORD_BOT_TOKEN`.
+   Name this Secret `discord-secret`.
+
+We access that Secret in code like so:
+
+```python
+discord_secret = modal.Secret.from_name(
+    "discord-secret",
+    required_keys=[  # included so we get nice error messages if we forgot a key
+        "DISCORD_BOT_TOKEN",
+        "DISCORD_CLIENT_ID",
+        "DISCORD_PUBLIC_KEY",
+    ],
+)
+
+```
+
+## Register a Slash Command
+
+Next, we’re going to register a [Slash Command](https://discord.com/developers/docs/interactions/application-commands#slash-commands)
+for our Discord app. Slash Commands are triggered by users in servers typing `/` and the name of the command.
+
+The Modal Function below will register a Slash Command for your bot named `bored`.
+More information about Slash Commands can be found in the Discord docs
+[here](https://discord.com/developers/docs/interactions/application-commands).
+
+You can run this Function with
+
+```bash
+modal run discord_bot::create_slash_command
+```
+
+```python
+@app.function(secrets=[discord_secret], image=image)
+def create_slash_command(force: bool = False):
+    """Registers the slash command with Discord. Pass the force flag to re-register."""
+    import os
+
+    import requests
+
+    BOT_TOKEN = os.getenv("DISCORD_BOT_TOKEN")
+    CLIENT_ID = os.getenv("DISCORD_CLIENT_ID")
+
+    headers = {
+        "Content-Type": "application/json",
+        "Authorization": f"Bot {BOT_TOKEN}",
+    }
+    url = f"https://discord.com/api/v10/applications/{CLIENT_ID}/commands"
+
+    command_description = {
+        "name": "api",
+        "description": "Information about a random free, public API",
+    }
+
+    # first, check if the command already exists
+    response = requests.get(url, headers=headers)
+    try:
+        response.raise_for_status()
+    except Exception as e:
+        raise Exception("Failed to create slash command") from e
+
+    commands = response.json()
+    command_exists = any(
+        command.get("name") == command_description["name"] for command in commands
+    )
+
+    # and only recreate it if the force flag is set
+    if command_exists and not force:
+        print(f"🤖: command {command_description['name']} exists")
+        return
+
+    response = requests.post(url, headers=headers, json=command_description)
+    try:
+        response.raise_for_status()
+    except Exception as e:
+        raise Exception("Failed to create slash command") from e
+    print(f"🤖: command {command_description['name']} created")
+
+```
+
+## Host a Discord Interactions endpoint on Modal
+
+If you look carefully at the definition of the Slash Command above,
+you'll notice that it doesn't know anything about our bot besides an ID.
+
+To hook the Slash Commands in the Discord UI up to our logic for hitting the Bored API,
+we need to set up a service that listens at some URL and follows a specific protocol,
+described [here](https://discord.com/developers/docs/interactions/overview#configuring-an-interactions-endpoint-url).
+
+Here are some of the most important facets:
+
+1. We'll need to respond within five seconds or Discord will assume we are dead.
+Modal's fast-booting serverless containers usually start faster than that,
+but it's not guaranteed. So we'll add the `min_containers` parameter to our
+Function so that there's at least one live copy ready to respond quickly at any time.
+Modal charges a minimum of about 2¢ an hour for live containers (pricing details [here](https://modal.com/pricing)).
+Note that that still fits within Modal's $30/month of credits on the free tier.
+
+2. We have to respond to Discord that quickly, but we don't have to respond to the user that quickly.
+We instead send an acknowledgement so that they know we're alive and they can close their connection to us.
+We also trigger our `reply` Modal Function, which will respond to the user via Discord's Interactions API,
+but we don't wait for the result, we just `spawn` the call.
+
+3. The protocol includes some authentication logic that is mandatory
+and checked by Discord. We'll explain in more detail in the next section.
+
+We can set up our interaction endpoint by deploying a FastAPI app on Modal.
+This is as easy as creating a Python Function that returns a FastAPI app
+and adding the `modal.asgi_app` decorator.
+For more details on serving Python web apps on Modal, see
+[this guide](https://modal.com/docs/guide/webhooks).
+
+```python
+@app.function(secrets=[discord_secret], min_containers=1)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def web_app():
+    from fastapi import FastAPI, HTTPException, Request
+    from fastapi.middleware.cors import CORSMiddleware
+
+    web_app = FastAPI()
+
+    # must allow requests from other domains, e.g. from Discord's servers
+    web_app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+
+    @web_app.post("/api")
+    async def get_api(request: Request):
+        body = await request.body()
+
+        # confirm this is a request from Discord
+        authenticate(request.headers, body)
+
+        print("🤖: parsing request")
+        data = json.loads(body.decode())
+        if data.get("type") == DiscordInteractionType.PING.value:
+            print("🤖: acking PING from Discord during auth check")
+            return {"type": DiscordResponseType.PONG.value}
+
+        if data.get("type") == DiscordInteractionType.APPLICATION_COMMAND.value:
+            print("🤖: handling slash command")
+            app_id = data["application_id"]
+            interaction_token = data["token"]
+
+            # kick off request asynchronously, will respond when ready
+            reply.spawn(app_id, interaction_token)
+
+            # respond immediately with defer message
+            return {
+                "type": DiscordResponseType.DEFERRED_CHANNEL_MESSAGE_WITH_SOURCE.value
+            }
+
+        print(f"🤖: unable to parse request with type {data.get('type')}")
+        raise HTTPException(status_code=400, detail="Bad request")
+
+    return web_app
+
+```
+
+The authentication for Discord is a bit involved and there aren't,
+to our knowledge, any good Python libraries for it.
+
+So we have to implement the protocol "by hand".
+
+Essentially, Discord sends a header in their request
+that we can use to verify the request comes from them.
+For that, we use the `DISCORD_PUBLIC_KEY` from
+our Application Information page.
+
+The details aren't super important, but they appear in the `authenticate` function below
+(which defers the real cryptography work to [PyNaCl](https://pypi.org/project/PyNaCl/),
+a Python wrapper for [`libsodium`](https://github.com/jedisct1/libsodium)).
+
+Discord will also check that we reject unauthorized requests,
+so we have to be sure to get this right!
+
+```python
+def authenticate(headers, body):
+    import os
+
+    from fastapi.exceptions import HTTPException
+    from nacl.exceptions import BadSignatureError
+    from nacl.signing import VerifyKey
+
+    print("🤖: authenticating request")
+    # verify the request is from Discord using their public key
+    public_key = os.getenv("DISCORD_PUBLIC_KEY")
+    verify_key = VerifyKey(bytes.fromhex(public_key))
+
+    signature = headers.get("X-Signature-Ed25519")
+    timestamp = headers.get("X-Signature-Timestamp")
+
+    message = timestamp.encode() + body
+
+    try:
+        verify_key.verify(message, bytes.fromhex(signature))
+    except BadSignatureError:
+        # either an unauthorized request or Discord's "negative control" check
+        raise HTTPException(status_code=401, detail="Invalid request")
+
+```
+
+The code above used a few enums to abstract bits of the Discord protocol.
+Now that we've walked through all of it,
+we're in a position to understand what those are
+and so the code for them appears below.
+
+```python
+class DiscordInteractionType(Enum):
+    PING = 1  # hello from Discord during auth check
+    APPLICATION_COMMAND = 2  # an actual command
+
+class DiscordResponseType(Enum):
+    PONG = 1  # hello back during auth check
+    DEFERRED_CHANNEL_MESSAGE_WITH_SOURCE = 5  # we'll send a message later
+
+```
+
+## Deploy on Modal
+
+You can deploy this app on Modal by running the following commands:
+
+``` shell
+modal run discord_bot.py  # checks the API wrapper, little test
+modal run discord_bot.py::create_slash_command  # creates the slash command, if missing
+modal deploy discord_bot.py  # deploys the web app and the API wrapper
+```
+
+Copy the Modal URL that is printed in the output and go back to the **General Information** section on the
+[Discord Developer Portal](https://discord.com/developers/applications).
+Paste the URL, making sure to append the path of your `POST` route (here, `/api`), in the
+**Interactions Endpoint URL** field, then click **Save Changes**. If your
+endpoint URL is incorrect or if authentication is incorrectly implemented,
+Discord will refuse to save the URL. Once it saves, you can start
+handling interactions!
+
+## Finish setting up Discord bot
+
+To start using the Slash Command you just set up, you need to invite the bot to
+a Discord server. To do so, go to your application's **Installation** section on the
+[Discord Developer Portal](https://discord.com/developers/applications).
+Copy the **Discored Provided Link** and visit it to invite the bot to your bot to the server.
+
+Now you can open your Discord server and type `/api` in a channel to trigger the bot.
+You can see a working version [in our test Discord server](https://discord.gg/PmG7P47EPQ).
+
+### Doc Ocr Jobs
+
+# Run a job queue that turns documents into structured data with Datalab Marker
+
+This tutorial shows you how to use Modal as an infinitely scalable job queue
+that can service async tasks from a web app.
+
+Our job queue will handle a single task: converting images/PDFs into structured data.
+We'll use [Marker](https://github.com/datalab-to/marker) from [Datalab](https://www.datalab.to),
+which can convert images of documents or PDFs to Markdown, JSON, and HTML. Marker is an open-weights model;
+to learn more about commercial usage, see [here](https://github.com/datalab-to/marker?tab=readme-ov-file#commercial-usage).
+
+For the purpose of this tutorial, we've also built a [React + FastAPI web app on Modal](https://modal.com/docs/examples/doc_ocr_webapp)
+that works together with it, but note that you don't need a web app running on Modal
+to use this pattern. You can submit async tasks to Modal from any Python
+application (for example, a regular Django app running on Kubernetes).
+
+Try it out for yourself [here](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/).
+
+## Define an App
+
+Let's first import `modal` and define an [`App`](https://modal.com/docs/reference/modal.App).
+Later, we'll use the name provided for our job queue App to find it from our web app and submit tasks to it.
+
+```python
+from typing import Optional
+
+import modal
+from typing_extensions import Literal
+
+app = modal.App("example-doc-ocr-jobs")
+
+```
+
+We also define the dependencies we need by specifying an
+[Image](https://modal.com/docs/guide/images).
+
+```python
+inference_image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "marker-pdf[full]==1.9.3", "torch==2.8.0"
+)
+
+```
+
+## Cache the pre-trained model on a Modal Volume
+
+We can obtain the pre-trained model we want to run from Datalab
+by using the Marker library.
+
+```python
+def load_models():
+    import marker.models
+
+    print("loading models")
+
+    return marker.models.create_model_dict()
+
+```
+
+The `create_model_dict` function downloads model weights from Datalab's
+cloud storage (S3 bucket) if they aren't already present in the filesystem.
+However, in Modal's serverless environment, filesystems are ephemeral,
+so using this code alone would mean that models need to be downloaded
+many times (every time a new instance of our Function spins up).
+
+So instead, we create a Modal [Volume](https://modal.com/docs/guide/volumes)
+to store the models. Each Modal Volume is a durable filesystem that any Modal Function can access.
+You can read more about storing model weights on Modal in [our guide](https://modal.com/docs/guide/model-weights).
+
+```python
+marker_cache_path = "/root/.cache/datalab/"
+marker_cache_volume = modal.Volume.from_name(
+    "marker-models-modal-demo", create_if_missing=True
+)
+marker_cache = {marker_cache_path: marker_cache_volume}
+
+```
+
+## Run Datalab Marker on Modal
+
+Now let's set up the actual inference.
+
+Using the [`@app.function`](https://modal.com/docs/reference/modal.App#function)
+decorator, we set up a Modal [Function](https://modal.com/docs/reference/modal.Function).
+We provide arguments to that decorator to customize the hardware, scaling, and other features
+of the Function.
+
+Here, we say that this Function should use NVIDIA L40S [GPUs](https://modal.com/docs/guide/gpu),
+automatically [retry](https://modal.com/docs/guide/retries#function-retries) failures up to 3 times,
+and have access to our [shared model cache](https://modal.com/docs/guide/volumes).
+
+Inside the Function, we write out our inference logic,
+which mostly involves configuring components provided by the `marker` library.
+
+```python
+@app.function(gpu="l40s", retries=3, volumes=marker_cache, image=inference_image)
+def parse_document(
+    document: bytes,
+    page_range: str | None = None,
+    force_ocr: bool = False,
+    paginate_output: bool = False,
+    output_format: Literal["markdown", "html", "chunks", "json"] = "markdown",
+    use_llm: bool = False,
+) -> str | dict:
+    """
+    Args:
+        document: Document data (PDF, JPG, PNG) as bytes.
+        page_range: Specify which pages to process. Accepts comma-separated page numbers and ranges.
+        force_ocr: Force OCR processing on the entire document, even for pages that might contain extractable text.
+                    This will also format inline math properly.
+        paginate_output: Paginates the output, using \n\n{PAGE_NUMBER} followed by - * 48, then \n\n
+        output_format: Output format. Can be markdown, JSON, HTML, or chunks.
+        use_llm: use an llm to improve the marker results.
+    """
+    from tempfile import NamedTemporaryFile
+
+    import marker.config.parser
+    import marker.converters.pdf
+    import marker.output
+
+    models = load_models()
+
+    # Set up document "converter"
+    config = {
+        "page_range": page_range,
+        "force_ocr": force_ocr,
+        "paginate_output": paginate_output,
+        "output_format": output_format,
+        "use_llm": use_llm,
+    }
+
+    config_parser = marker.config.parser.ConfigParser(config)
+    config_dict = config_parser.generate_config_dict()
+    config_dict["pdftext_workers"] = 1
+
+    converter = marker.converters.pdf.PdfConverter(
+        config=config_dict,
+        artifact_dict=models,
+        processor_list=config_parser.get_processors(),
+        renderer=config_parser.get_renderer(),
+        llm_service=config_parser.get_llm_service() if use_llm else None,
+    )
+
+    # Run the converter on our document
+    with NamedTemporaryFile(delete=False, mode="wb+") as temp_path:
+        temp_path.write(document)
+        rendered_output = converter(temp_path.name)
+
+    # Format the output and return it
+    if output_format == "json":
+        result = rendered_output.model_dump_json()
+    else:
+        text, _, images = marker.output.text_from_rendered(rendered_output)
+
+        result = text
+
+    return result
+
+```
+
+## Testing and debugging remote code
+
+To make sure this code works, we want a way to kick the tires and debug it.
+
+We can run it on Modal, with no need to set up separate local testing,
+by adding a [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint)
+that invokes the Function `.remote`ly.
+
+```python
+@app.local_entrypoint()
+def main(document_filename: Optional[str] = None):
+    import urllib.request
+    from pathlib import Path
+
+    if document_filename is None:
+        document_filename = Path(__file__).parent / "receipt.png"
+    else:
+        document_filename = Path(document_filename)
+
+    if document_filename.exists():
+        image = document_filename.read_bytes()
+        print(f"running OCR on {document_filename}")
+    else:
+        document_url = "https://modal-cdn.com/cdnbot/Brandys-walmart-receipt-8g68_a_hk_f9c25fce.webp"
+        print(f"running OCR on sample from URL {document_url}")
+        request = urllib.request.Request(document_url)
+        with urllib.request.urlopen(request) as response:
+            image = response.read()
+    print(parse_document.remote(image, output_format="html"))
+
+```
+
+You can then run this from the command line with:
+
+```shell
+modal run doc_ocr_jobs.py
+```
+
+## Deploying the document conversion service
+
+Now that we have a Function, we can publish it by deploying the App:
+
+```shell
+modal deploy doc_ocr_jobs.py
+```
+
+Once it's published, we can [look up](https://modal.com/docs/guide/trigger-deployed-functions) this Function
+from another Python process and submit tasks to it:
+
+```python
+fn = modal.Function.from_name("example-doc-ocr-jobs", "parse_document")
+fn.spawn(my_document)
+```
+
+Modal will auto-scale to handle all the tasks queued, and
+then scale back down to 0 when there's no work left. To see how you could use this from a Python web
+app, take a look at the [receipt parser frontend](https://modal.com/docs/examples/doc_ocr_webapp)
+tutorial.
+
+### Doc Ocr Webapp
+
+# Serve a receipt parsing web app
+
+This tutorial shows you how to use Modal to deploy a fully serverless
+[React](https://reactjs.org/) + [FastAPI](https://fastapi.tiangolo.com/) application.
+
+We're going to build a simple "Receipt Parser" web app that submits document parsing
+tasks to a separate Modal app defined in [another example](https://modal.com/docs/examples/doc_ocr_jobs),
+polls until the task is completed, and displays
+the results. Try it out for yourself
+[here](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/).
+
+It should look something like this:
+
+[![Webapp frontend](https://modal-cdn.com/doc_ocr_frontend.jpg)](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/)
+
+## Basic setup
+
+Let's get the imports out of the way and define an [`App`](https://modal.com/docs/reference/modal.App).
+
+```python
+from pathlib import Path
+
+import fastapi
+import fastapi.staticfiles
+import modal
+
+app = modal.App("example-doc-ocr-webapp")
+
+```
+
+Modal works with any [ASGI](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps) or
+[WSGI](https://modal.com/docs/guide/webhooks#wsgi) web framework. Here, we choose to use [FastAPI](https://fastapi.tiangolo.com/).
+
+```python
+web_app = fastapi.FastAPI()
+
+```
+
+## Define endpoints
+
+We need two endpoints: one to accept an image and submit it to the Modal job queue,
+and another to poll for the results of the job.
+
+In `parse`, we're going to submit tasks to the Function defined in the [Job
+Queue tutorial](https://modal.com/docs/examples/doc_ocr_jobs), so we import it first using
+[`Function.lookup`](https://modal.com/docs/reference/modal.Function#lookup).
+
+We call [`.spawn()`](https://modal.com/docs/reference/modal.Function#spawn) on the Function handle
+we imported above to kick off our Function without blocking on the results. `spawn` returns
+a unique ID for the function call, which we then use
+to poll for its result.
+
+```python
+@web_app.post("/parse")
+async def parse(request: fastapi.Request):
+    parse_receipt = modal.Function.from_name("example-doc-ocr-jobs", "parse_document")
+
+    form = await request.form()
+    receipt = await form["receipt"].read()  # type: ignore
+    call = parse_receipt.spawn(receipt)
+    return {"call_id": call.object_id}
+
+```
+
+`/result` uses the provided `call_id` to instantiate a `modal.FunctionCall` object, and attempt
+to get its result. If the call hasn't finished yet, we return a `202` status code, which indicates
+that the server is still working on the job.
+
+```python
+@web_app.get("/result/{call_id}")
+async def poll_results(call_id: str):
+    function_call = modal.functions.FunctionCall.from_id(call_id)
+    try:
+        result = function_call.get(timeout=0)
+    except TimeoutError:
+        return fastapi.responses.JSONResponse(content="", status_code=202)
+
+    return result
+
+```
+
+Now that we've defined our endpoints, we're ready to host them on Modal.
+First, we specify our dependencies -- here, a basic Debian Linux
+environment with FastAPI installed.
+
+```python
+image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "fastapi[standard]==0.115.4"
+)
+
+```
+
+Then, we add the static files for our front-end. We've made [a simple React
+app](https://github.com/modal-labs/modal-examples/tree/main/09_job_queues/doc_ocr_frontend)
+that hits the two endpoints defined above. To package these files with our app, we use
+`add_local_dir` with the local directory of the assets, and specify that we want them
+in the `/assets` directory inside our container (the `remote_path`). Then, we instruct FastAPI to [serve
+this static file directory](https://fastapi.tiangolo.com/tutorial/static-files/) at our root path.
+
+```python
+local_assets_path = Path(__file__).parent / "doc_ocr_frontend"
+image = image.add_local_dir(local_assets_path, remote_path="/assets")
+
+```
+
+We serve them from our FastAPI app as `StaticFiles`.
+
+To put our FastAPI app on Modal, we need to return it from a Python function
+that is wrapped with some extra decorators:
+
+- [`modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app)
+to ensure the Modal system knows to route web traffic to it (and in what format)
+- [`modal.concurrent`](https://modal.com/docs/reference/modal.concurrent)
+to allow more than one request (e.g. for stylesheet and for HTML) to be served concurrently
+- [`app.function`](https://modal.com/docs/reference/modal.App#function)
+to turn our Python function into a Modal Function and define the infrastructure it needs
+(here, just the dependencies).
+
+```python
+@app.function(image=image)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def wrapper():
+    web_app.mount("/", fastapi.staticfiles.StaticFiles(directory="/assets", html=True))
+    return web_app
+
+```
+
+## Running
+
+While developing, you can run this as an ephemeral app by executing the command
+
+```shell
+modal serve doc_ocr_webapp.py
+```
+
+If successful, this will print a URL for your app that you can navigate to in
+your browser 🎉 .
+
+The result should look something like this:
+
+[![Webapp frontend](https://modal-cdn.com/doc_ocr_frontend.jpg)](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/)
+
+Modal watches all the mounted files and updates the app if anything changes.
+See [these docs](https://modal.com/docs/guide/webhooks#developing-with-modal-serve)
+for more details.
+
+## Deploy
+
+To deploy your application, run
+
+```shell
+modal deploy doc_ocr_webapp.py
+```
+
+That's all!
+
+### Dynamic Batching
+
+# Dynamic batching for ASCII and character conversion
+
+This example demonstrates how to dynamically batch a simple
+application that converts ASCII codes to characters and vice versa.
+
+For more details about using dynamic batching and optimizing
+the batching configurations for your application, see
+the [dynamic batching guide](https://modal.com/docs/guide/dynamic-batching).
+
+## Setup
+
+Let's start by defining the image for the application.
+
+```python
+import modal
+
+app = modal.App(
+    "example-dynamic-batching",
+    image=modal.Image.debian_slim(python_version="3.11"),
+)
+
+```
+
+## Defining a Batched Function
+
+Now, let's define a function that converts ASCII codes to characters. This
+async Batched Function allows us to convert up to four ASCII codes at once.
+
+```python
+@app.function()
+@modal.batched(max_batch_size=4, wait_ms=1000)
+async def asciis_to_chars(asciis: list[int]) -> list[str]:
+    return [chr(ascii) for ascii in asciis]
+
+```
+
+If there are fewer than four ASCII codes in the batch, the Function will wait
+for one second, as specified by `wait_ms`, to allow more inputs to arrive before
+returning the result.
+
+The input `asciis` to the Function is a list of integers, and the
+output is a list of strings. To allow batching, the input list `asciis`
+and the output list must have the same length.
+
+You must invoke the Function with an individual ASCII input, and a single
+character will be returned in response.
+
+## Defining a class with a Batched Method
+
+Next, let's define a class that converts characters to ASCII codes. This
+class has an async Batched Method `chars_to_asciis` that converts characters
+to ASCII codes.
+
+Note that if a class has a Batched Method, it cannot have other Batched Methods
+or Methods.
+
+```python
+@app.cls()
+class AsciiConverter:
+    @modal.batched(max_batch_size=4, wait_ms=1000)
+    async def chars_to_asciis(self, chars: list[str]) -> list[int]:
+        asciis = [ord(char) for char in chars]
+        return asciis
+
+```
+
+## ASCII and character conversion
+
+Finally, let's define the `local_entrypoint` that uses the Batched Function
+and Class Method to convert ASCII codes to characters and
+vice versa.
+
+We use [`map.aio`](https://modal.com/docs/reference/modal.Function#map) to asynchronously map
+over the ASCII codes and characters. This allows us to invoke the Batched
+Function and the Batched Method over a range of ASCII codes and characters
+in parallel.
+
+Run this script to see which characters correspond to ASCII codes 33 through 38!
+
+```python
+@app.local_entrypoint()
+async def main():
+    ascii_converter = AsciiConverter()
+    chars = []
+    async for char in asciis_to_chars.map.aio(range(33, 39)):
+        chars.append(char)
+
+    print("Characters:", chars)
+
+    asciis = []
+    async for ascii in ascii_converter.chars_to_asciis.map.aio(chars):
+        asciis.append(ascii)
+
+    print("ASCII codes:", asciis)
+
+```
+
+### Esm3
+
+# Build a protein folding dashboard with ESM3, Molstar, and Gradio
+
+![Image of dashboard UI for ESM3 protein folding](https://modal-cdn.com/example-esm3-ui.png)
+
+There are perhaps a quadrillion distinct proteins on the planet Earth,
+each one a marvel of nanotechnology discovered by painstaking evolution.
+We know the amino acid sequence of nearly a billion but we only
+know the three-dimensional structure of a few hundred thousand,
+gathered by slow, difficult observational methods like X-ray crystallography.
+Built upon this data are machine learning models like
+EvolutionaryScale's [ESM3](https://www.evolutionaryscale.ai/blog/esm3-release)
+that can predict the structure of any sequence in seconds.
+
+In this example, we'll show how you can use Modal to not
+just run the latest protein-folding model but also build tools around it for
+you and your team of scientists to understand and analyze the results.
+
+## Basic Setup
+
+```python
+import base64
+import io
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+MINUTES = 60  # seconds
+
+app = modal.App("example-esm3")
+
+```
+
+### Create a Volume to store ESM3 model weights and Entrez sequence data
+
+To minimize cold start times, we'll store the ESM3 model weights on a Modal
+[Volume](https://modal.com/docs/guide/volumes).
+For patterns and best practices for storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+We'll use this same distributed storage primitive to store sequence data.
+
+```python
+volume = modal.Volume.from_name("example-esm3", create_if_missing=True)
+VOLUME_PATH = Path("/vol")
+MODELS_PATH = VOLUME_PATH / "models"
+DATA_PATH = VOLUME_PATH / "data"
+
+```
+
+### Define dependencies in container images
+
+The container image for structure inference is based on Modal's default slim Debian
+Linux image with `esm` for loading and running the model, `gemmi` for
+managing protein structure file conversions, and setting an environment variable
+for faster downloading of the model weights from Hugging Face.
+
+```python
+esm3_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install(
+        "esm==3.1.1",
+        "torch==2.4.1",
+        "gemmi==0.7.0",
+        "huggingface-hub==0.36.0",
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1", "HF_HOME": str(MODELS_PATH)})
+)
+
+```
+
+We'll also define a separate image, with different dependencies,
+for the part of our app that hosts the dashboard.
+This helps reduce the complexity of Python dependency management
+by "walling off" the different parts, e.g. separating
+functions that depend on finicky ML packages
+from those that depend on pedantic web packages.
+Dependencies include `gradio` for building a web UI in Python and
+`biotite` for extracting sequences from UniProt accession numbers.
+
+You can read more about how to configure container images on Modal in
+[this guide](https://modal.com/docs/guide/images).
+
+```python
+web_app_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install("gradio~=4.44.0", "biotite==0.41.2", "fastapi[standard]==0.115.4")
+    .add_local_dir(Path(__file__).parent / "frontend", remote_path="/assets")
+)
+
+```
+
+Here we "pre-import" libraries that will be used by the functions we run
+on Modal in a given image using the `with image.imports` context manager.
+
+```python
+with esm3_image.imports():
+    import tempfile
+
+    import gemmi
+    import torch
+    from esm.models.esm3 import ESM3
+    from esm.sdk.api import ESMProtein, GenerationConfig
+
+with web_app_image.imports():
+    import biotite.database.entrez as entrez
+    import biotite.sequence.io.fasta as fasta
+    from fastapi import FastAPI
+
+```
+
+## Define a `Model` inference class for ESM3
+
+Next, we map the model's setup and inference code onto Modal.
+
+1. For setup code that only needs to run once, we put it in a method
+decorated with `@enter`, which runs on container start. For details,
+see [this guide](https://modal.com/docs/guide/cold-start).
+2. The rest of the inference code goes in a method decorated with `@method`.
+3. We accelerate the compute-intensive inference with a GPU, specifically an A10G.
+For more on using GPUs on Modal, see [this guide](https://modal.com/docs/guide/gpu).
+
+```python
+@app.cls(
+    image=esm3_image,
+    volumes={VOLUME_PATH: volume},
+    secrets=[modal.Secret.from_name("huggingface-secret")],
+    gpu="A10G",
+    timeout=20 * MINUTES,
+)
+class Model:
+    @modal.enter()
+    def enter(self):
+        self.model = ESM3.from_pretrained("esm3_sm_open_v1")
+        self.model.to("cuda")
+
+        print("using half precision and tensor cores for fast ESM3 inference")
+        self.model = self.model.half()
+        torch.backends.cuda.matmul.allow_tf32 = True
+
+        self.max_steps = 250
+        print(f"setting max ESM steps to: {self.max_steps}")
+
+    def convert_protein_to_MMCIF(self, esm_protein, output_path):
+        structure = gemmi.read_pdb_string(esm_protein.to_pdb_string())
+        doc = structure.make_mmcif_document()
+        doc.write_file(str(output_path), gemmi.cif.WriteOptions())
+
+    def get_generation_config(self, num_steps):
+        return GenerationConfig(track="structure", num_steps=num_steps)
+
+    @modal.method()
+    def inference(self, sequence: str):
+        num_steps = min(len(sequence), self.max_steps)
+
+        print(f"running ESM3 inference with num_steps={num_steps}")
+        esm_protein = self.model.generate(
+            ESMProtein(sequence=sequence), self.get_generation_config(num_steps)
+        )
+
+        print("checking for errors in output")
+        if hasattr(esm_protein, "error_msg"):
+            raise ValueError(esm_protein.error_msg)
+
+        print("converting ESMProtein into MMCIF file")
+        save_path = Path(tempfile.mktemp() + ".mmcif")
+        self.convert_protein_to_MMCIF(esm_protein, save_path)
+
+        print("returning MMCIF bytes")
+        return io.BytesIO(save_path.read_bytes())
+
+```
+
+## Serve a dashboard as an `asgi_app`
+
+In this section we'll create a web interface around the ESM3 model
+that can help scientists and stakeholders understand and interrogate the results of the model.
+
+You can deploy this UI, along with the backing inference endpoint,
+with the following command:
+
+```bash
+modal deploy esm3.py
+```
+
+### Integrating Modal Functions
+
+The integration between our dashboard and our inference backend
+is made simple by the Modal SDK:
+because the definition of the `Model` class is available in the same Python
+context as the defintion of the web UI,
+we can instantiate an instance and call its methods with `.remote`.
+
+The inference runs in a GPU-accelerated container with all of ESM3's
+dependencies, while this code executes in a CPU-only container
+with only our web dependencies.
+
+```python
+def run_esm(sequence: str) -> str:
+    sequence = sequence.strip()
+
+    print("running ESM")
+    mmcif_buffer = Model().inference.remote(sequence)
+
+    print("converting mmCIF bytes to base64 for compatibility with HTML")
+    mmcif_content = mmcif_buffer.read().decode()
+    mmcif_base64 = base64.b64encode(mmcif_content.encode()).decode()
+
+    return get_molstar_html(mmcif_base64)
+
+```
+
+### Building a UI in Python with Gradio
+
+We'll visualize the results using [Mol* ](https://molstar.org/).
+Mol* (pronounced "molstar") is an open-source toolkit for
+visualizing and analyzing large-scale molecular data, including secondary structures
+and residue-specific positions of proteins.
+
+Second, we'll create links to lookup the metadata and structure of known
+proteins using the [Universal Protein Resource](https://www.uniprot.org/)
+database from the UniProt consortium which is supported by the European
+Bioinformatics Institute, the National Human Genome Research
+Institute, and the Swiss Institute of Bioinformatics. UniProt
+is also a hub that links to many other databases, like the RCSB Protein
+Data Bank.
+
+To pull sequence data, we'll use the [Biotite](https://www.biotite-python.org/)
+library to pull [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from
+UniProt which contain labelled sequences.
+
+You should see the URL for this UI in the output of `modal deploy`
+or on your [Modal app dashboard](https://modal.com/apps) for this app.
+
+```python
+@app.function(
+    image=web_app_image,
+    volumes={VOLUME_PATH: volume},
+    max_containers=1,  # Gradio requires sticky sessions
+)
+@modal.concurrent(max_inputs=100)  # Gradio can handle many async inputs
+@modal.asgi_app()
+def ui():
+    import gradio as gr
+    from fastapi.responses import FileResponse
+    from gradio.routes import mount_gradio_app
+
+    web_app = FastAPI()
+
+    # custom styles: an icon, a background, and some CSS
+    @web_app.get("/favicon.ico", include_in_schema=False)
+    async def favicon():
+        return FileResponse("/assets/favicon.svg")
+
+    @web_app.get("/assets/background.svg", include_in_schema=False)
+    async def background():
+        return FileResponse("/assets/background.svg")
+
+    css = Path("/assets/index.css").read_text()
+
+    theme = gr.themes.Default(
+        primary_hue="green", secondary_hue="emerald", neutral_hue="neutral"
+    )
+
+    title = "Predict & Visualize Protein Structures"
+
+    with gr.Blocks(theme=theme, css=css, title=title, js=always_dark()) as interface:
+        gr.Markdown(f"# {title}")
+
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Enter UniProt ID ")
+                uniprot_num_box = gr.Textbox(
+                    label="Enter UniProt ID or select one on the right",
+                    placeholder="e.g. P02768, P69905,  etc.",
+                )
+                get_sequence_button = gr.Button(
+                    "Retrieve Sequence from UniProt ID", variant="primary"
+                )
+
+                uniprot_link_button = gr.Button(value="View protein on UniProt website")
+                uniprot_link_button.click(
+                    fn=None,
+                    inputs=uniprot_num_box,
+                    js=get_js_for_uniprot_link(),
+                )
+
+            with gr.Column():
+                example_uniprots = get_uniprot_examples()
+
+                def extract_uniprot_num(example_idx):
+                    uniprot = example_uniprots[example_idx]
+                    return uniprot[uniprot.index("[") + 1 : uniprot.index("]")]
+
+                gr.Markdown("## Example UniProt Accession Numbers")
+                with gr.Row():
+                    half_len = int(len(example_uniprots) / 2)
+                    with gr.Column():
+                        for i, uniprot in enumerate(example_uniprots[:half_len]):
+                            btn = gr.Button(uniprot, variant="secondary")
+                            btn.click(
+                                fn=lambda j=i: extract_uniprot_num(j),
+                                outputs=uniprot_num_box,
+                            )
+
+                    with gr.Column():
+                        for i, uniprot in enumerate(example_uniprots[half_len:]):
+                            btn = gr.Button(uniprot, variant="secondary")
+                            btn.click(
+                                fn=lambda j=i + half_len: extract_uniprot_num(j),
+                                outputs=uniprot_num_box,
+                            )
+
+        gr.Markdown("## Enter Sequence")
+        sequence_box = gr.Textbox(
+            label="Enter a sequence or retrieve it from a UniProt ID",
+            placeholder="e.g. MVTRLE..., PVTTIMHALL..., etc.",
+        )
+        get_sequence_button.click(
+            fn=get_sequence, inputs=[uniprot_num_box], outputs=[sequence_box]
+        )
+
+        run_esm_button = gr.Button("Run ESM3 Folding", variant="primary")
+
+        gr.Markdown("## ESM3 Predicted Structure")
+        molstar_html = gr.HTML()
+
+        run_esm_button.click(fn=run_esm, inputs=sequence_box, outputs=molstar_html)
+
+    # return a FastAPI app for Modal to serve
+    return mount_gradio_app(app=web_app, blocks=interface, path="/")
+
+```
+
+## Folding from the command line
+
+If you want to quickly run the ESM3 model without the web interface, you can
+run it from the command line like this:
+
+```shell
+modal run esm3
+```
+
+This will run the same inference code above on Modal. The results are
+returned in the [Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File)
+format, which you can render with the online [Molstar Viewer](https://molstar.org/viewer/).
+
+```python
+@app.local_entrypoint()
+def main(sequence: Optional[str] = None, output_dir: Optional[str] = None):
+    if sequence is None:
+        print("using sequence for insulin [P01308]")
+        sequence = "MRTPMLLALLALATLCLAGRADAKPGDAESGKGAAFVSKQEGSEVVKRLRRYLDHWLGAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV"
+
+    if output_dir is None:
+        output_dir = Path("/tmp/esm3")
+        output_dir.mkdir(parents=True, exist_ok=True)
+    output_path = output_dir / "output.mmcif"
+
+    print("starting inference on Modal")
+    results_buffer = Model().inference.remote(sequence)
+
+    print(f"writing results to {output_path}")
+    output_path.write_bytes(results_buffer.read())
+
+```
+
+## Addenda
+
+The remainder of this code is boilerplate.
+
+### Extracting Sequences from UniProt Accession Numbers
+
+To retrieve sequence information we'll utilize the `biotite` library which
+will allow us to fetch [fasta](https://en.wikipedia.org/wiki/FASTA_format)
+sequence files from the [National Center for Biotechnology Information (NCBI) Entrez database](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html).
+
+```python
+def get_sequence(uniprot_num: str) -> str:
+    try:
+        DATA_PATH.mkdir(parents=True, exist_ok=True)
+
+        uniprot_num = uniprot_num.strip()
+        fasta_path = DATA_PATH / f"{uniprot_num}.fasta"
+
+        print(f"Fetching {fasta_path} from the entrez database")
+        entrez.fetch_single_file(
+            uniprot_num, fasta_path, db_name="protein", ret_type="fasta"
+        )
+        fasta_file = fasta.FastaFile.read(fasta_path)
+
+        protein_sequence = fasta.get_sequence(fasta_file)
+        return str(protein_sequence)
+
+    except Exception as e:
+        return f"Error: {e}"
+
+```
+
+### Supporting functions for the Gradio app
+
+The following Python code is used to enhance the Gradio app,
+mostly by generating some extra HTML & JS and handling styling.
+
+```python
+def get_js_for_uniprot_link():
+    url = "https://www.uniprot.org/uniprotkb/"
+    end = "/entry#structure"
+    return f"""(uni_id) => {{ if (!uni_id) return; window.open("{url}" + uni_id + "{end}"); }}"""
+
+def get_molstar_html(mmcif_base64):
+    return f"""
+    <iframe
+        id="molstar_frame"
+        style="width: 100%; height: 600px; border: none;"
+        srcdoc='
+            <!DOCTYPE html>
+            <html>
+                <head>
+                    <script src="https://cdn.jsdelivr.net/npm/@rcsb/rcsb-molstar/build/dist/viewer/rcsb-molstar.js"></script>
+                    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@rcsb/rcsb-molstar/build/dist/viewer/rcsb-molstar.css">
+                </head>
+                <body>
+                    <div id="protein-viewer" style="width: 1200px; height: 400px; position: center"></div>
+                    <script>
+                        console.log("Initializing viewer...");
+                        (async function() {{
+                            // Create plugin instance
+                            const viewer = new rcsbMolstar.Viewer("protein-viewer");
+
+                            // CIF data in base64
+                            const mmcifData = "{mmcif_base64}";
+
+                            // Convert base64 to blob
+                            const blob = new Blob(
+                                [atob(mmcifData)],
+                                {{ type: "text/plain" }}
+                            );
+
+                            // Create object URL
+                            const url = URL.createObjectURL(blob);
+
+                            try {{
+                                // Load structure
+                                await viewer.loadStructureFromUrl(url, "mmcif");
+                            }} catch (error) {{
+                                console.error("Error loading structure:", error);
+                            }}
+                      }})();
+                    </script>
+                </body>
+            </html>
+        '>
+    </iframe>"""
+
+def get_uniprot_examples():
+    return [
+        "Albumin [P02768]",
+        "Insulin [P01308]",
+        "Hemoglobin [P69905]",
+        "Lysozyme [P61626]",
+        "BRCA1 [P38398]",
+        "Immunoglobulin [P01857]",
+        "Actin [P60709]",
+        "Ribonuclease [P07998]",
+    ]
+
+def always_dark():
+    return """
+    function refresh() {
+        const url = new URL(window.location);
+
+        if (url.searchParams.get('__theme') !== 'dark') {
+            url.searchParams.set('__theme', 'dark');
+            window.location.href = url.href;
+        }
+    }
+    """
+
+```
+
+### Fastapi App
+
+# Deploy FastAPI app with Modal
+
+This example shows how you can deploy a [FastAPI](https://fastapi.tiangolo.com/) app with Modal.
+You can serve any app written in an ASGI-compatible web framework (like FastAPI) using this pattern or you can server WSGI-compatible frameworks like Flask with [`wsgi_app`](https://modal.com/docs/guide/webhooks#wsgi).
+
+```python
+from typing import Optional
+
+import modal
+from fastapi import FastAPI, Header
+from pydantic import BaseModel
+
+image = modal.Image.debian_slim().uv_pip_install("fastapi[standard]", "pydantic")
+app = modal.App("example-fastapi-app", image=image)
+web_app = FastAPI()
+
+class Item(BaseModel):
+    name: str
+
+@web_app.get("/")
+async def handle_root(user_agent: Optional[str] = Header(None)):
+    print(f"GET /     - received user_agent={user_agent}")
+    return "Hello World"
+
+@web_app.post("/foo")
+async def handle_foo(item: Item, user_agent: Optional[str] = Header(None)):
+    print(f"POST /foo - received user_agent={user_agent}, item.name={item.name}")
+    return item
+
+@app.function()
+@modal.asgi_app()
+def fastapi_app():
+    return web_app
+
+@app.function()
+@modal.fastapi_endpoint(method="POST")
+def f(item: Item):
+    return "Hello " + item.name
+
+if __name__ == "__main__":
+    app.deploy("webapp")
+
+```
+
+### Fasthtml App
+
+# Deploy a FastHTML app with Modal
+
+This example shows how you can deploy a FastHTML app with Modal.
+[FastHTML](https://www.fastht.ml/) is a Python library built on top of [HTMX](https://htmx.org/)
+which allows you to create entire web applications using only Python.
+
+The integration is pretty simple, thanks to the ASGI standard.
+You just need to define a function returns your FastHTML app
+and is decorated with `app.function` and `modal.asgi_app`.
+
+```python
+import modal
+
+app = modal.App("example-fasthtml-app")
+
+@app.function(
+    image=modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+        "python-fasthtml==0.5.2"
+    )
+)
+@modal.asgi_app()
+def serve():
+    import fasthtml.common as fh
+
+    app = fh.FastHTML()
+
+    @app.get("/")
+    def home():
+        return fh.Div(fh.P("Hello World!"), hx_get="/change")
+
+    return app
+
+```
+
+### Fasthtml Checkboxes
+
+# Deploy 100,000 multiplayer checkboxes on Modal with FastHTML
+
+[![Screenshot of FastHTML Checkboxes UI](./ui.png)](https://modal-labs-examples--example-fasthtml-checkboxes-web.modal.run)
+
+This example shows how you can deploy a multiplayer checkbox game with FastHTML on Modal.
+
+[FastHTML](https://www.fastht.ml/) is a Python library built on top of [HTMX](https://htmx.org/)
+which allows you to create entire web applications using only Python.
+For a simpler template for using FastHTML with Modal, check out
+[this example](https://modal.com/docs/examples/fasthtml_app).
+
+Our example is inspired by [1 Million Checkboxes](https://onemillioncheckboxes.com/).
+
+```python
+import time
+from asyncio import Lock
+from pathlib import Path
+from uuid import uuid4
+
+import modal
+
+from .constants import N_CHECKBOXES
+
+app = modal.App("example-fasthtml-checkboxes")
+db = modal.Dict.from_name("example-fasthtml-checkboxes-db", create_if_missing=True)
+
+css_path_local = Path(__file__).parent / "styles.css"
+css_path_remote = "/assets/styles.css"
+
+@app.function(
+    image=modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install("python-fasthtml==0.12.21", "inflect~=7.4.0")
+    .add_local_file(css_path_local, remote_path=css_path_remote),
+    max_containers=1,  # we currently maintain state in memory, so we restrict the server to one worker
+)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def web():
+    import fasthtml.common as fh
+    import inflect
+
+    # Connected clients are tracked in-memory
+    clients = {}
+    clients_mutex = Lock()
+
+    # We keep all checkbox fasthtml elements in memory during operation, and persist to modal dict across restarts
+    checkboxes = db.get("checkboxes", [])
+    checkbox_mutex = Lock()
+
+    if len(checkboxes) == N_CHECKBOXES:
+        print("Restored checkbox state from previous session.")
+    else:
+        print("Initializing checkbox state.")
+        checkboxes = []
+        for i in range(N_CHECKBOXES):
+            checkboxes.append(
+                fh.Input(
+                    id=f"cb-{i}",
+                    type="checkbox",
+                    checked=False,
+                    # when clicked, that checkbox will send a POST request to the server with its index
+                    hx_post=f"/checkbox/toggle/{i}",
+                    hx_swap_oob="true",  # allows us to later push diffs to arbitrary checkboxes by id
+                )
+            )
+
+    async def on_shutdown():
+        # Handle the shutdown event by persisting current state to modal dict
+        async with checkbox_mutex:
+            db["checkboxes"] = checkboxes
+        print("Checkbox state persisted.")
+
+    style = open(css_path_remote, "r").read()
+    app, _ = fh.fast_app(
+        # FastHTML uses the ASGI spec, which allows handling of shutdown events
+        on_shutdown=[on_shutdown],
+        hdrs=[fh.Style(style)],
+    )
+
+    # handler run on initial page load
+    @app.get("/")
+    async def get():
+        # register a new client
+        client = Client()
+        async with clients_mutex:
+            clients[client.id] = client
+
+        return (
+            fh.Title(f"{N_CHECKBOXES // 1000}k Checkboxes"),
+            fh.Main(
+                fh.H1(
+                    f"{inflect.engine().number_to_words(N_CHECKBOXES).title()} Checkboxes"
+                ),
+                fh.Div(
+                    *checkboxes,
+                    id="checkbox-array",
+                ),
+                cls="container",
+                # use HTMX to poll for diffs to apply
+                hx_trigger="every 1s",  # poll every second
+                hx_get=f"/diffs/{client.id}",  # call the diffs endpoint
+                hx_swap="none",  # don't replace the entire page
+            ),
+        )
+
+    # users submitting checkbox toggles
+    @app.post("/checkbox/toggle/{i}")
+    async def toggle(i: int):
+        async with checkbox_mutex:
+            cb = checkboxes[i]
+            cb.checked = not cb.checked
+            checkboxes[i] = cb
+
+        async with clients_mutex:
+            expired = []
+            for client in clients.values():
+                # clean up old clients
+                if not client.is_active():
+                    expired.append(client.id)
+
+                # add diff to client for when they next poll
+                client.add_diff(i)
+
+            for client_id in expired:
+                del clients[client_id]
+        return
+
+    # clients polling for any outstanding diffs
+    @app.get("/diffs/{client_id}")
+    async def diffs(client_id: str):
+        # we use the `hx_swap_oob='true'` feature to
+        # push updates only for the checkboxes that changed
+        async with clients_mutex:
+            client = clients.get(client_id, None)
+            if client is None or len(client.diffs) == 0:
+                return
+
+            client.heartbeat()
+            diffs = client.pull_diffs()
+
+        async with checkbox_mutex:
+            diff_array = [checkboxes[i] for i in diffs]
+
+        return diff_array
+
+    return app
+
+```
+
+Class for tracking state to push out to connected clients
+
+```python
+class Client:
+    def __init__(self):
+        self.id = str(uuid4())
+        self.diffs = []
+        self.inactive_deadline = time.time() + 30
+
+    def is_active(self):
+        return time.time() < self.inactive_deadline
+
+    def heartbeat(self):
+        self.inactive_deadline = time.time() + 30
+
+    def add_diff(self, i):
+        if i not in self.diffs:
+            self.diffs.append(i)
+
+    def pull_diffs(self):
+        # return a copy of the diffs and clear them
+        diffs = self.diffs
+        self.diffs = []
+        return diffs
+
+```
+
+### Fastrtc Flip Webcam
+
+# Run a FastRTC app on Modal
+
+[FastRTC](https://fastrtc.org/) is a Python library for real-time communication on the web.
+This example demonstrates how to run a simple FastRTC app in the cloud on Modal.
+
+It's intended to help you get up and running with real-time streaming applications on Modal
+as quickly as possible. If you're interested in running a production-grade WebRTC app on Modal,
+see [this example](https://modal.com/docs/examples/webrtc_yolo).
+
+In this example, we stream webcam video from a browser to a container on Modal,
+where the video is flipped, annotated, and sent back with under 100ms of delay.
+You can try it out [here](https://modal-labs-examples--example-fastrtc-flip-webcam-ui.modal.run/)
+or just dive straight into the code to run it yourself.
+
+## Set up FastRTC on Modal
+
+First, we import the `modal` SDK
+and use it to define a [container image](https://modal.com/docs/guide/images)
+with FastRTC and related dependencies.
+
+```python
+import modal
+
+web_image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "fastapi[standard]==0.115.4",
+    "fastrtc==0.0.23",
+    "gradio==5.7.1",
+    "opencv-python-headless==4.11.0.86",
+)
+
+```
+
+Then, we set that as the default Image on our Modal [App](https://modal.com/docs/guide/apps).
+
+```python
+app = modal.App("example-fastrtc-flip-webcam", image=web_image)
+
+```
+
+### Configure WebRTC streaming on Modal
+
+Under the hood, FastRTC uses the WebRTC
+[APIs](https://www.w3.org/TR/webrtc/) and
+[protocols](https://datatracker.ietf.org/doc/html/rfc8825).
+
+WebRTC provides low latency ("real-time") peer-to-peer communication
+for Web applications, focusing on audio and video.
+Considering that the Web is a platform originally designed
+for high-latency, client-server communication of text and images,
+that's no mean feat!
+
+In addition to protocols that implement this communication,
+WebRTC includes APIs for describing and manipulating audio/video streams.
+In this demo, we set a few simple parameters, like the direction of the webcam
+and the minimum frame rate. See the
+[MDN Web Docs for `MediaTrackConstraints`](https://developer.mozilla.org/en-US/docs/Web/API/MediaTrackConstraints)
+for more.
+
+```python
+TRACK_CONSTRAINTS = {
+    "width": {"exact": 640},
+    "height": {"exact": 480},
+    "frameRate": {"min": 30},
+    "facingMode": {  # https://developer.mozilla.org/en-US/docs/Web/API/MediaTrackSettings/facingMode
+        "ideal": "user"
+    },
+}
+
+```
+
+In theory, the Internet is designed for peer-to-peer communication
+all the way down to its heart, the Internet Protocol (IP): just send packets between IP addresses.
+In practice, peer-to-peer communication on the contemporary Internet is fraught with difficulites,
+from restrictive firewalls to finicky work-arounds for
+[the exhaustion of IPv4 addresses](https://www.a10networks.com/glossary/what-is-ipv4-exhaustion/),
+like [Carrier-Grade Network Address Translation (CGNAT)](https://en.wikipedia.org/wiki/Carrier-grade_NAT).
+
+So establishing peer-to-peer connections can be quite involved.
+The protocol for doing so is called Interactive Connectivity Establishment (ICE).
+It is described in [this RFC](https://datatracker.ietf.org/doc/html/rfc8445#section-2).
+
+ICE involves the peers exchanging a list of connections that might be used.
+We use a fairly simple setup here, where our peer on Modal uses the
+[Session Traversal Utilities for NAT (STUN)](https://datatracker.ietf.org/doc/html/rfc5389)
+server provided by Google. A STUN server basically just reflects back to a client what their
+IP address and port number appear to be when they talk to it. The peer on Modal communicates
+that information to the other peer trying to connect to it -- in this case, a browser trying to share a webcam feed.
+Note the use of `stun` and port `19302` in the URL in place of
+something more familiar, like `http` and port `80`.
+
+```python
+RTC_CONFIG = {"iceServers": [{"url": "stun:stun.l.google.com:19302"}]}
+
+```
+
+## Running a FastRTC app on Modal
+
+FastRTC builds on top of the [Gradio](https://www.gradio.app/docs)
+library for defining Web UIs in Python.
+Gradio in turn is compatible with the
+[Asynchronous Server Gateway Interface (ASGI)](https://asgi.readthedocs.io/en/latest/)
+protocol for asynchronous Python web servers, like
+[FastAPI](https://fastrtc.org/userguide/streams/),
+so we can host it on Modal's cloud platform using the
+[`modal.asgi_app` decorator](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps)
+with [Modal Function](https://modal.com/docs/guide/apps).
+
+But before we do that, we need to consider limits:
+on how many peers can connect to one instance on Modal
+and on how long they can stay connected.
+We picked some sensible defaults to show how they interact
+with the deployment parameters of the Modal Function.
+You'll want to tune these for your application!
+
+```python
+MAX_CONCURRENT_STREAMS = 10  # number of peers per instance on Modal
+
+MINUTES = 60  # seconds
+TIME_LIMIT = 10 * MINUTES  # time limit
+
+@app.function(
+    # gradio requires sticky sessions
+    # so we limit the number of concurrent containers to 1
+    # and allow that container to handle concurrent streams
+    max_containers=1,
+    scaledown_window=TIME_LIMIT + 1 * MINUTES,  # add a small buffer to time limit
+)
+@modal.concurrent(max_inputs=MAX_CONCURRENT_STREAMS)  # inputs per container
+@modal.asgi_app()  # ASGI on Modal
+def ui():
+    import fastrtc  # WebRTC in Gradio
+    import gradio as gr  # WebUIs in Python
+    from fastapi import FastAPI  # asynchronous ASGI server framework
+    from gradio.routes import mount_gradio_app  # connects Gradio and FastAPI
+
+    with gr.Blocks() as blocks:  # block-wise UI definition
+        gr.HTML(  # simple HTML header
+            "<h1 style='text-align: center'>"
+            "Streaming Video Processing with Modal and FastRTC"
+            "</h1>"
+        )
+
+        with gr.Column():  # a column of UI elements
+            fastrtc.Stream(  # high-level media streaming UI element
+                modality="video",
+                mode="send-receive",
+                handler=flip_vertically,  # handler -- handle incoming frame, produce outgoing frame
+                ui_args={"title": "Click 'Record' to flip your webcam in the cloud"},
+                rtc_configuration=RTC_CONFIG,
+                track_constraints=TRACK_CONSTRAINTS,
+                concurrency_limit=MAX_CONCURRENT_STREAMS,  # limit simultaneous connections
+                time_limit=TIME_LIMIT,  # limit time per connection
+            )
+
+    return mount_gradio_app(app=FastAPI(), blocks=blocks, path="/")
+
+```
+
+To try this out for yourself, run
+
+```bash
+modal serve 07_web_endpoints/fastrtc_flip_webcam.py
+```
+
+and head to the `modal.run` URL that appears in your terminal.
+You can also check on the application's dashboard
+via the `modal.com` URL thatappears below it.
+
+The `modal serve` command produces a hot-reloading development server --
+try editing the `title` in the `ui_args` above and watch the server redeploy.
+
+This temporary deployment is tied to your terminal session.
+To deploy permanently, run
+
+```bash
+modal deploy 07_web_endponts/fastrtc_flip_webcam.py
+```
+
+Note that Modal is a serverless platform with [usage-based pricing](https://modal.com/pricing),
+so this application will spin down and cost you nothing when it is not in use.
+
+## Addenda
+
+This FastRTC app is very much the "hello world" or "echo server"
+of FastRTC: it just flips the incoming webcam stream and adds a "hello" message.
+That logic appears below.
+
+```python
+def flip_vertically(image):
+    import cv2
+    import numpy as np
+
+    image = image.astype(np.uint8)
+
+    if image is None:
+        print("failed to decode image")
+        return
+
+    # flip vertically and caption to show video was processed on Modal
+    image = cv2.flip(image, 0)
+    lines = ["Hello from Modal!"]
+    caption_image(image, lines)
+
+    return image
+
+def caption_image(
+    img, lines, font_scale=0.8, thickness=2, margin=10, font=None, color=None
+):
+    import cv2
+
+    if font is None:
+        font = cv2.FONT_HERSHEY_SIMPLEX
+    if color is None:
+        color = (127, 238, 100, 128)  # Modal Green
+
+    # get text sizes
+    sizes = [cv2.getTextSize(line, font, font_scale, thickness)[0] for line in lines]
+    if not sizes:
+        return
+
+    # position text in bottom right
+    pos_xs = [img.shape[1] - size[0] - margin for size in sizes]
+
+    pos_ys = [img.shape[0] - margin]
+    for _width, height in reversed(sizes[:-1]):
+        next_pos = pos_ys[-1] - 2 * height
+        pos_ys.append(next_pos)
+
+    for line, pos in zip(lines, zip(pos_xs, reversed(pos_ys))):
+        cv2.putText(img, line, pos, font, font_scale, color, thickness)
+
+```
+
+### Fine Tune Asr
+
+# Fine-tune Whisper to Improve Transcription on Domain-Specific Vocab
+
+This example demonstrates how to fine-tune an ASR model
+([whisper-tiny.en](https://huggingface.co/openai/whisper-tiny.en))
+and deploy it for inference using Modal.
+
+Speech recognition models work well out-of-the-box for general speech transcription,
+but can struggle with examples that are not well represented in the training data -
+like proper nouns, technical jargon, and industry-specific terms. Fine-tuning with
+examples of domain-specific vocabulary can improve transcription of these terms.
+
+For example, here is a sample transcription from the baseline model with no
+fine-tuning:
+
+|                  | Transcription                                                 |
+|------------------|---------------------------------------------------------------|
+| **Ground Truth** | "deuterium you put into one element you make a new element"   |
+| **Prediction**   | "the theorem you put into one element you make a new element" |
+
+After just 1.5 hours of training on a small dataset (~7k samples), the model has
+already improved:
+
+|                  | Transcription                                               |
+|------------------|-------------------------------------------------------------|
+| **Ground Truth** | "deuterium you put into one element you make a new element" |
+| **Prediction**   | "deuterium you put into one element you make a new element" |
+
+We'll use the "small" subset of "Science and Technology" from the
+[GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)
+dataset, which is enough data to see the model improve on scientific terms in just a
+few epochs.
+
+Note: GigaSpeech is a
+[gated model](https://huggingface.co/docs/hub/en/models-gated),
+so you'll need to accept the terms on the
+[dataset card](https://huggingface.co/datasets/speechcolab/gigaspeech)
+and create a
+[Hugging Face Secret](https://modal.com/secrets/)
+to download it.
+
+## Setup
+
+We start by importing our standard library dependencies, `fastapi`, and `modal`.
+
+We also need an [`App`](https://modal.com/docs/guide/apps) object, which we'll use to
+define how our training application will run on Modal's cloud infrastructure.
+
+```python
+import functools
+import io
+import os
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Annotated, Any, Union
+
+import fastapi
+import modal
+
+MINUTES = 60
+HOURS = 60 * MINUTES
+
+app = modal.App(name="example-whisper-fine-tune")
+
+```
+
+### Set up the container image
+
+We define the environment where our functions will run by building up a base
+[container `Image`](https://modal.com/docs/guide/images)
+with our dependencies using `Image.uv_pip_install`. We also set environment variables
+here using `Image.env`, like the Hugging Face cache directory.
+
+```python
+CACHE_DIR = "/cache"
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "accelerate==1.8.1",
+        "datasets==3.6.0",
+        "evaluate==0.4.5",
+        "fastapi[standard]==0.116.1",
+        "huggingface-hub==0.36.0",
+        "jiwer==4.0.0",
+        "librosa==0.11.0",
+        "torch==2.7.1",
+        "torchaudio==2.7.1",
+        "transformers==4.53.2",
+        "whisper_normalizer==0.1.12",
+    )
+    .env(
+        {
+            "HF_XET_HIGH_PERFORMANCE": "1",  # Faster downloads from Hugging Face
+            "HF_HOME": CACHE_DIR,
+        }
+    )
+)
+
+```
+
+Next we'll import the dependencies we need for the code that will run on Modal.
+
+The `image.imports()` context manager ensures these imports are available when our
+Functions run in the cloud, without the need to install the dependencies locally.
+
+```python
+with image.imports():
+    import datasets
+    import evaluate
+    import librosa
+    import torch
+    import transformers
+    from whisper_normalizer.english import EnglishTextNormalizer
+
+```
+
+### Storing data on Modal
+
+We use
+[Modal Volumes](https://modal.com/docs/guide/volumes)
+for data we want to persist across function calls. In this case, we'll create a cache
+Volume for storing Hugging Face downloads for faster subsequent loads, and an output
+Volume for saving our model and metrics after training.
+
+```python
+cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+output_volume = modal.Volume.from_name(
+    "fine-tune-asr-example-volume",
+    create_if_missing=True,
+)
+OUTPUT_DIR = "/outputs"
+volumes = {CACHE_DIR: cache_volume, OUTPUT_DIR: output_volume}
+
+```
+
+## Training
+
+We use a `dataclass` to collect some of the training parameters in one place. Here we
+set `model_output_name` which is the directory on the Volume where our model will be
+saved, and where we'll load it from when deploying the model for inference.
+
+```python
+@dataclass
+class Config:
+    """Training configuration."""
+
+    model_output_name: str = "whisper-fine-tune"  # Name used for saving and loading
+
+    # Model config
+    model_name: str = "openai/whisper-tiny.en"
+
+    # Dataset config
+    dataset_name: str = "speechcolab/gigaspeech"
+    dataset_subset: str = "s"  # "xs" for testing, "m", "l", "xl" for more data
+    dataset_split: str = "train"  # The test and val splits don't have category labels
+    dataset_category: int = 15  # "Science and Technology"
+    max_duration_in_seconds: float = 20.0
+    min_duration_in_seconds: float = 0.0
+
+    # Training config
+    num_train_epochs: int = 5
+    warmup_steps: int = 400
+    max_steps: int = -1
+    batch_size: int = 64
+    learning_rate: float = 1e-5
+    eval_strategy: str = "epoch"
+
+```
+
+### Defining our training Function
+
+The training Function does the following:
+1. Load the pre-trained model, along with the feature extractor and tokenizer
+2. Load the dataset -> select our training category -> extract features for training
+3. Run baseline evals
+4. 🚂 Train!
+5. Save the fine-tuned model to the Volume
+6. Run final evals
+
+We run evals before and after training to establish a baseline and see how much the
+model improved. The most common way to measure the performance of speech recognition
+models is "word error rate" (WER):
+
+`WER = (substitutions + deletions + insertions) / total words`.
+
+The `@app.function` decorator is where we attach infrastructure and define how our
+Function runs on Modal. Here we tell the Function to use our `Image`, specify the GPU,
+attach the Volumes we created earlier, add our access token, and set a timeout.
+
+```python
+@app.function(
+    image=image,
+    gpu="H100",
+    volumes=volumes,
+    secrets=[modal.Secret.from_name("huggingface-secret", required_keys=["HF_TOKEN"])],
+    timeout=3 * HOURS,
+)
+def train(
+    config: Config,
+):
+    """Loads data and trains the model."""
+
+    # Setting args for the Hugging Face trainer
+    training_args = transformers.Seq2SeqTrainingArguments(
+        output_dir=Path(OUTPUT_DIR) / config.model_output_name,
+        num_train_epochs=config.num_train_epochs,
+        per_device_train_batch_size=config.batch_size,
+        per_device_eval_batch_size=config.batch_size,
+        learning_rate=config.learning_rate,
+        warmup_steps=config.warmup_steps,
+        max_steps=config.max_steps,
+        eval_strategy=config.eval_strategy,
+        fp16=True,
+        group_by_length=True,
+        length_column_name="input_length",
+        predict_with_generate=True,
+        generation_max_length=40,
+        generation_num_beams=1,
+    )
+
+    print(f"Loading model: {config.model_name}")
+    feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(
+        pretrained_model_name_or_path=config.model_name,
+    )
+    tokenizer = transformers.WhisperTokenizer.from_pretrained(
+        pretrained_model_name_or_path=config.model_name,
+    )
+    model = transformers.WhisperForConditionalGeneration.from_pretrained(
+        pretrained_model_name_or_path=config.model_name,
+    )
+
+    print(f"Loading dataset: {config.dataset_name} {config.dataset_subset}")
+    dataset = (
+        datasets.load_dataset(
+            config.dataset_name,
+            config.dataset_subset,
+            split=config.dataset_split,
+            num_proc=os.cpu_count(),
+            trust_remote_code=True,
+        )
+        if config.dataset_name is not None
+        else get_test_dataset(config)
+    )
+
+    print("Preparing data")
+    max_input_length = config.max_duration_in_seconds * feature_extractor.sampling_rate
+    min_input_length = config.min_duration_in_seconds * feature_extractor.sampling_rate
+
+    # Remove samples that are not from our target category (Science and Technology)
+    # Remove audio clips that are too short or too long
+    dataset = dataset.filter(
+        functools.partial(
+            filter_dataset,
+            dataset_category=config.dataset_category,
+            max_input_length=max_input_length,
+            min_input_length=min_input_length,
+        ),
+        input_columns=["category", "audio"],
+        num_proc=os.cpu_count(),
+    )
+
+    # Extract audio features and tokenize labels
+    dataset = dataset.map(
+        functools.partial(
+            prepare_dataset,
+            feature_extractor=feature_extractor,
+            tokenizer=tokenizer,
+            model_input_name=feature_extractor.model_input_names[0],
+        ),
+        batched=True,
+        remove_columns=dataset.column_names,
+        num_proc=os.cpu_count(),
+        desc="Feature extract + tokenize",
+    )
+
+    # Split the filtered dataset into train/validation sets
+    dataset = dataset.train_test_split(test_size=0.1, shuffle=True, seed=42)
+
+    # Create a processor that combines the feature extractor and tokenizer
+    processor = transformers.WhisperProcessor(
+        feature_extractor=feature_extractor,
+        tokenizer=tokenizer,
+    )
+
+    # Custom data collator handles batching of variable-length audio sequences
+    data_collator = DataCollatorSpeechSeq2SeqWithPadding(
+        processor=processor,
+        decoder_start_token_id=model.config.decoder_start_token_id,
+    )
+
+    # Set up the Hugging Face trainer with all of our components
+    normalizer = EnglishTextNormalizer()
+    metric = evaluate.load("wer")
+
+    trainer = transformers.Seq2SeqTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset["train"],
+        eval_dataset=dataset["test"],
+        processing_class=feature_extractor,
+        data_collator=data_collator,
+        compute_metrics=functools.partial(
+            compute_metrics,
+            tokenizer=tokenizer,
+            normalizer=normalizer,
+            metric=metric,
+        ),
+    )
+
+    print("Running evals before training to establish a baseline")
+    metrics = trainer.evaluate(
+        metric_key_prefix="baseline",
+        max_length=training_args.generation_max_length,
+        num_beams=training_args.generation_num_beams,
+    )
+    trainer.log_metrics("baseline", metrics)
+    trainer.save_metrics("baseline", metrics)
+
+    print(f"Starting training! Weights will be saved to '{training_args.output_dir}'")
+    train_result = trainer.train()
+
+    # Save the model weights, tokenizer, and feature extractor
+    trainer.save_model()
+    tokenizer.save_pretrained(training_args.output_dir)
+    feature_extractor.save_pretrained(training_args.output_dir)
+
+    # Log training metrics
+    metrics = train_result.metrics
+    metrics["train_samples"] = len(dataset["train"])
+    trainer.log_metrics("train", metrics)
+    trainer.save_metrics("train", metrics)
+    trainer.save_state()
+
+    # Final evaluation to see how much we improved
+    print("Running final evals")
+    metrics = trainer.evaluate(
+        metric_key_prefix="test",
+        max_length=training_args.generation_max_length,
+        num_beams=training_args.generation_num_beams,
+    )
+    metrics["eval_samples"] = len(dataset["test"])
+
+    trainer.log_metrics("test", metrics)
+    trainer.save_metrics("test", metrics)
+    output_volume.commit()  # Ensure everything is saved to the Volume
+
+    print(f"\nTraining complete! Model saved to '{training_args.output_dir}'")
+
+```
+
+## Calling a Modal function from the command line
+
+The easiest way to invoke our training Function is by creating a `local_entrypoint` --
+our `main` function that runs locally and provides a command-line interface to trigger
+training on Modal's cloud infrastructure.
+
+```python
+@app.local_entrypoint()
+def main(test: bool = False):
+    """Run Whisper fine-tuning on Modal."""
+    if test:  # for quick e2e test
+        config = Config(
+            dataset_name=None,
+            num_train_epochs=1.0,
+            warmup_steps=0,
+            max_steps=1,
+        )
+    else:
+        config = Config()
+
+    train.remote(config)
+
+```
+
+This will allow us to run this example with:
+
+```bash
+modal run fine_tune_asr.py
+```
+
+Arguments passed to this function are turned in to CLI arguments automagically. For
+example, adding `--test` will run a single step of training for end-to-end testing.
+
+```bash
+modal run fine_tune_asr.py --test
+```
+
+Training will take ~1.5 hours, and will log WER and other metrics throughout the
+run.
+
+Here are a few more examples of terms the model predicted correctly after fine-tuning:
+
+| **Base Model** | **Fine-tuned**  |
+|----------------|-----------------|
+| and pm package | npm package     |
+| teach them     | tritium         |
+| chromebox      | chromevox       |
+| purposes       | porpoises       |
+| difsoup        | div soup        |
+| would you      | widget          |
+
+## Deploying our fine-tuned model for inference
+
+Once fine-tuning is complete, Modal makes it incredibly easy to deploy our new model.
+We can define both our inference function and an endpoint using a Modal
+[Cls](https://modal.com/docs/reference/modal.Cls).
+This will allow us to take advantage of
+[lifecycle hooks](https://modal.com/docs/guide/lifecycle-functions)
+to load the model just once on container startup using the `@modal.enter` decorator.
+We can use
+[modal.fastapi_endpoint](https://modal.com/docs/reference/modal.fastapi_endpoint)
+to expose our inference function as a web endpoint.
+
+```python
+@app.cls(
+    image=image,
+    gpu="H100",
+    timeout=10 * MINUTES,
+    # scaledown_window=10 * MINUTES,
+    volumes=volumes,
+)
+class Inference:
+    model_name: str = modal.parameter(default=Config().model_output_name)
+
+    @modal.enter()
+    def load_model(self):
+        """Load the model and processor on container startup."""
+
+        model = f"{OUTPUT_DIR}/{self.model_name}"
+        print(f"Loading model from {model}")
+        self.processor = transformers.WhisperProcessor.from_pretrained(model)
+        self.model = transformers.WhisperForConditionalGeneration.from_pretrained(model)
+        self.model.config.forced_decoder_ids = None
+
+    @modal.method()
+    def transcribe(
+        self,
+        audio_bytes: bytes,
+    ) -> str:
+        # Resample audio to match the model's sample rate
+        model_sample_rate = self.processor.feature_extractor.sampling_rate
+        audio_data, sample_rate = librosa.load(io.BytesIO(audio_bytes), sr=None)
+
+        audio_dataset = datasets.Dataset.from_dict(
+            {"audio": [{"array": audio_data, "sampling_rate": sample_rate}]}
+        ).cast_column("audio", datasets.Audio(sampling_rate=model_sample_rate))
+
+        # Audio -> features (log-mel spectrogram)
+        row = next(iter(audio_dataset))
+        input_features = self.processor(
+            row["audio"]["array"],
+            sampling_rate=row["audio"]["sampling_rate"],
+            return_tensors="pt",
+        ).input_features
+
+        # generate tokens -> decode to text
+        predicted_ids = self.model.generate(input_features)
+        transcription = self.processor.batch_decode(
+            predicted_ids, skip_special_tokens=True
+        )[0]
+
+        return transcription
+
+    @modal.fastapi_endpoint(method="POST", docs=True)
+    def web(
+        self,
+        audio_file: Annotated[bytes, fastapi.File()],
+    ) -> dict[str, str]:
+        """Defines an endpoint for calling inference."""
+
+        transcription = self.transcribe.local(  # run in the same container
+            audio_bytes=audio_file,
+        )
+        return {"transcription": transcription}
+
+```
+
+Deploy it with:
+
+```bash
+modal deploy fine_tune_asr.py
+```
+
+Note: you can specify which model to load by passing the `model_name` as a
+query parameter when calling the endpoint. This defaults to `model_output_name`, which
+we set in our `Config` above, and is the name of the directory where our model
+was saved.
+
+Here's an example of how to use this endpoint to transcribe an audio file:
+
+```bash
+curl -X 'POST' \
+'https://your-workspace-name--example-whisper-fine-tune-inference-web.modal.run/?model_name=whisper-fine-tune' \
+-H 'accept: application/json' \
+-H 'Content-Type: multipart/form-data' \
+-F 'audio_file=@your-audio-file.wav;type=audio/wav'
+```
+
+## Support code
+
+```python
+def get_test_dataset(config, length=5):
+    return datasets.Dataset.from_dict(
+        {
+            "text": ["Modal"] * length,
+            "audio": [{"array": [1.0] * 16000, "sampling_rate": 16000}] * length,
+            "category": [config.dataset_category] * length,
+        }
+    )
+
+def filter_dataset(
+    category, audio, dataset_category, max_input_length, min_input_length
+):
+    return (
+        category == dataset_category
+        and len(audio["array"]) > min_input_length
+        and len(audio["array"]) < max_input_length
+    )
+
+def prepare_dataset(batch, feature_extractor, tokenizer, model_input_name):
+    """Batched: convert audio to features and text to token IDs."""
+    audio_arrays = [s["array"] for s in batch["audio"]]
+    inputs = feature_extractor(
+        audio_arrays,
+        sampling_rate=feature_extractor.sampling_rate,
+    )
+    batch[model_input_name] = inputs.get(model_input_name)
+    batch["input_length"] = [len(s["array"]) for s in batch["audio"]]
+
+    normalized = [
+        t.replace(" <COMMA>", ",")
+        .replace(" <PERIOD>", ".")
+        .replace(" <QUESTIONMARK>", "?")
+        .replace(" <EXCLAMATIONPOINT>", "!")
+        .lower()
+        .strip()
+        for t in batch["text"]
+    ]
+    batch["labels"] = tokenizer(normalized).input_ids
+
+    return batch
+
+def compute_metrics(pred, tokenizer, normalizer, metric):
+    """Compute Word Error Rate between predictions and ground truth."""
+    pred_ids = pred.predictions
+
+    # Replace padding tokens with proper pad token ID
+    pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id
+
+    # Decode predictions and labels back to text
+    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
+    norm_pred_str = [normalizer(s).strip() for s in pred_str]
+
+    label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True)
+    norm_label_str = [normalizer(s).strip() for s in label_str]
+
+    # Calculate Word Error Rate
+    wer = metric.compute(predictions=norm_pred_str, references=norm_label_str)
+    return {"wer": wer}
+
+@dataclass
+class DataCollatorSpeechSeq2SeqWithPadding:
+    """
+    Data collator that pads audio features and text labels for batch training.
+
+    Args:
+        processor: WhisperProcessor combining feature extractor and tokenizer
+        decoder_start_token_id: The BOS token ID for the decoder
+    """
+
+    processor: Any
+    decoder_start_token_id: int
+
+    def __call__(
+        self, features: list[dict[str, Union[list[int], "torch.Tensor"]]]
+    ) -> dict[str, "torch.Tensor"]:
+        # Separate audio features and text labels since they need different padding
+        model_input_name = self.processor.model_input_names[0]
+        input_features = [
+            {model_input_name: feature[model_input_name]} for feature in features
+        ]
+        label_features = [{"input_ids": feature["labels"]} for feature in features]
+
+        batch = self.processor.feature_extractor.pad(
+            input_features,
+            return_tensors="pt",
+            return_attention_mask=True,
+            padding=True,
+        )
+
+        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
+
+        # Replace padding tokens with -100 so they're ignored in loss calculation
+        labels = labels_batch["input_ids"].masked_fill(
+            labels_batch.attention_mask.ne(1), -100
+        )
+
+        # Remove start token if tokenizer added it - model will add it during training
+        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
+            labels = labels[:, 1:]
+
+        batch["labels"] = labels
+
+        return batch
+
+```
+
+### Finetune Yolo
+
+# Fine-tune open source YOLO models for object detection
+
+Example by [@Erik-Dunteman](https://github.com/erik-dunteman) and [@AnirudhRahul](https://github.com/AnirudhRahul/).
+
+The popular "You Only Look Once" (YOLO) model line provides high-quality object detection in an economical package.
+In this example, we use the [YOLOv10](https://docs.ultralytics.com/models/yolov10/) model, released on May 23, 2024.
+
+We will:
+
+- Download two custom datasets from the [Roboflow](https://roboflow.com/) computer vision platform: a dataset of cats and a dataset of dogs
+
+- Fine-tune the model on those datasets, in parallel, using the [Ultralytics package](https://docs.ultralytics.com/)
+
+- Run inference with the fine-tuned models on single images and on streaming frames
+
+For commercial use, be sure to consult the [Ultralytics software license options](https://docs.ultralytics.com/#yolo-licenses-how-is-ultralytics-yolo-licensed),
+which include AGPL-3.0.
+
+## Set up the environment
+
+```python
+import warnings
+from dataclasses import dataclass
+from datetime import datetime
+from pathlib import Path
+
+import modal
+
+```
+
+Modal runs your code in the cloud inside containers. So to use it, we have to define the dependencies
+of our code as part of the container's [image](https://modal.com/docs/guide/custom-container).
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.10")
+    .apt_install(  # install system libraries for graphics handling
+        ["libgl1-mesa-glx", "libglib2.0-0"]
+    )
+    .uv_pip_install(  # install python libraries for computer vision
+        ["ultralytics~=8.2.68", "roboflow~=1.1.37", "opencv-python~=4.10.0"]
+    )
+    .uv_pip_install(  # add an optional extra that renders images in the terminal
+        "term-image==0.7.1"
+    )
+)
+
+```
+
+We also create a persistent [Volume](https://modal.com/docs/guide/volumes) for storing datasets, trained weights, and inference outputs. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+volume = modal.Volume.from_name("example-yolo-finetune", create_if_missing=True)
+volume_path = (  # the path to the volume from within the container
+    Path("/root") / "data"
+)
+
+```
+
+We attach both of these to a Modal [App](https://modal.com/docs/guide/apps).
+
+```python
+app = modal.App("example-yolo-finetune", image=image, volumes={volume_path: volume})
+
+```
+
+## Download a dataset
+
+We'll be downloading our data from the [Roboflow](https://roboflow.com/) computer vision platform, so to follow along you'll need to:
+
+- Create a free account on [Roboflow](https://app.roboflow.com/)
+
+- [Generate a Private API key](https://app.roboflow.com/settings/api)
+
+- Set up a Modal [Secret](https://modal.com/docs/guide/secrets) called `roboflow-api-key` in the Modal UI [here](https://modal.com/secrets),
+setting the `ROBOFLOW_API_KEY` to the value of your API key.
+
+You're also free to bring your own dataset with a config in YOLOv10-compatible yaml format.
+
+We'll be training on the medium size model, but you're free to experiment with [other model sizes](https://docs.ultralytics.com/models/yolov10/#model-variants).
+
+```python
+@dataclass
+class DatasetConfig:
+    """Information required to download a dataset from Roboflow."""
+
+    workspace_id: str
+    project_id: str
+    version: int
+    format: str
+    target_class: str
+
+    @property
+    def id(self) -> str:
+        return f"{self.workspace_id}/{self.project_id}/{self.version}"
+
+@app.function(
+    secrets=[
+        modal.Secret.from_name("roboflow-api-key", required_keys=["ROBOFLOW_API_KEY"])
+    ]
+)
+def download_dataset(config: DatasetConfig):
+    import os
+
+    from roboflow import Roboflow
+
+    rf = Roboflow(api_key=os.getenv("ROBOFLOW_API_KEY"))
+    project = (
+        rf.workspace(config.workspace_id)
+        .project(config.project_id)
+        .version(config.version)
+    )
+    dataset_dir = volume_path / "dataset" / config.id
+    project.download(config.format, location=str(dataset_dir))
+
+```
+
+## Train a model
+
+We train the model on a single A100 GPU. Training usually takes only a few minutes.
+
+```python
+MINUTES = 60
+
+TRAIN_GPU_COUNT = 1
+TRAIN_GPU = f"A100:{TRAIN_GPU_COUNT}"
+TRAIN_CPU_COUNT = 4
+
+@app.function(
+    gpu=TRAIN_GPU,
+    cpu=TRAIN_CPU_COUNT,
+    timeout=60 * MINUTES,
+)
+def train(
+    model_id: str,
+    dataset: DatasetConfig,
+    model_size="yolov10m.pt",
+    quick_check=False,
+):
+    from ultralytics import YOLO
+
+    volume.reload()  # make sure volume is synced
+
+    model_path = volume_path / "runs" / model_id
+    model_path.mkdir(parents=True, exist_ok=True)
+
+    data_path = volume_path / "dataset" / dataset.id / "data.yaml"
+
+    model = YOLO(model_size)
+    model.train(
+        # dataset config
+        data=data_path,
+        fraction=0.4
+        if not quick_check
+        else 0.04,  # fraction of dataset to use for training/validation
+        # optimization config
+        device=list(range(TRAIN_GPU_COUNT)),  # use the GPU(s)
+        epochs=8 if not quick_check else 1,  # pass over entire dataset this many times
+        batch=0.95,  # automatic batch size to target fraction of GPU util
+        seed=117,  # set seed for reproducibility
+        # data processing config
+        workers=max(
+            TRAIN_CPU_COUNT // TRAIN_GPU_COUNT, 1
+        ),  # split CPUs evenly across GPUs
+        cache=False,  # cache preprocessed images in RAM?
+        # model saving config
+        project=f"{volume_path}/runs",
+        name=model_id,
+        exist_ok=True,  # overwrite previous model if it exists
+        verbose=True,  # detailed logs
+    )
+
+```
+
+## Run inference on single inputs and on streams
+
+We demonstrate two different ways to run inference -- on single images and on a stream of images.
+
+The images we use for inference are loaded from the test set, which was added to our Volume when we downloaded the dataset.
+Each image read takes ~50ms, and inference can take ~5ms, so the disk read would be our biggest bottleneck if we just looped over the image paths.
+To avoid it, we parallelize the disk reads across many workers using Modal's [`.map`](https://modal.com/docs/guide/scale),
+streaming the images to the model. This roughly mimics the behavior of an interactive object detection pipeline.
+This can increase throughput up to ~60 images/s, or ~17 milliseconds/image, depending on image size.
+
+```python
+@app.function()
+def read_image(image_path: str):
+    import cv2
+
+    source = cv2.imread(image_path)
+    return source
+
+```
+
+We use the `@enter` feature of [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions)
+to load the model only once on container start and reuse it for future inferences.
+We use a generator to stream images to the model.
+
+```python
+@app.cls(gpu="a10g")
+class Inference:
+    weights_path: str = modal.parameter()
+
+    @modal.enter()
+    def load_model(self):
+        from ultralytics import YOLO
+
+        self.model = YOLO(self.weights_path)
+
+    @modal.method()
+    def predict(self, model_id: str, image_path: str, display: bool = False):
+        """A simple method for running inference on one image at a time."""
+        results = self.model.predict(
+            image_path,
+            half=True,  # use fp16
+            save=True,
+            exist_ok=True,
+            project=f"{volume_path}/predictions/{model_id}",
+        )
+        if display:
+            from term_image.image import from_file
+
+            terminal_image = from_file(results[0].path)
+            terminal_image.draw()
+        # you can view the output file via the Volumes UI in the Modal dashboard -- https://modal.com/storage
+
+    @modal.method()
+    def streaming_count(self, batch_dir: str, threshold: float | None = None):
+        """Counts the number of objects in a directory of images.
+
+        Intended as a demonstration of high-throughput streaming inference."""
+        import os
+        import time
+
+        image_files = [os.path.join(batch_dir, f) for f in os.listdir(batch_dir)]
+
+        completed, start = 0, time.monotonic_ns()
+        for image in read_image.map(image_files):
+            # note that we run predict on a single input at a time.
+            # each individual inference is usually done before the next image arrives, so there's no throughput benefit to batching.
+            results = self.model.predict(
+                image,
+                half=True,  # use fp16
+                save=False,  # don't save to disk, as it slows down the pipeline significantly
+                verbose=False,
+            )
+            completed += 1
+            for res in results:
+                for conf in res.boxes.conf:
+                    if threshold is None:
+                        yield 1
+                        continue
+                    if conf.item() >= threshold:
+                        yield 1
+            yield 0
+
+        elapsed_seconds = (time.monotonic_ns() - start) / 1e9
+        print(
+            "Inferences per second:",
+            round(completed / elapsed_seconds, 2),
+        )
+
+```
+
+## Running the example
+
+We'll kick off our parallel training jobs and run inference from the command line.
+
+```bash
+modal run finetune_yolo.py
+```
+
+This runs the training in `quick_check` mode, useful for debugging the pipeline and getting a feel for it.
+To do a longer run that actually meaningfully improves performance, use:
+
+```bash
+modal run finetune_yolo.py --no-quick-check
+```
+
+```python
+@app.local_entrypoint()
+def main(quick_check: bool = True, inference_only: bool = False):
+    """Run fine-tuning and inference on two datasets.
+
+    Args:
+        quick_check: fine-tune on a small subset. Lower quality results, but faster iteration.
+        inference_only: skip fine-tuning and only run inference
+    """
+
+    dogs = DatasetConfig(
+        workspace_id="cv-project-v2",
+        project_id="6-dog-breeds",
+        version=1,
+        format="yolov9",
+        target_class="🐶",
+    )
+    cats = DatasetConfig(
+        workspace_id="jus-workspace",
+        project_id="cats-w7ohy",
+        version=3,
+        format="yolov9",
+        target_class="🐱",
+    )
+    datasets = [dogs, cats]
+
+    # .for_each runs a function once on each element of the input iterators
+    # here, that means download each dataset, in parallel
+    if not inference_only:
+        download_dataset.for_each(datasets)
+
+    today = datetime.now().strftime("%Y-%m-%d")
+    model_ids = [dataset.id + f"/{today}" for dataset in datasets]
+
+    if not inference_only:
+        train.for_each(model_ids, datasets, kwargs={"quick_check": quick_check})
+
+    # let's run inference!
+    for model_id, dataset in zip(model_ids, datasets):
+        inference = Inference(
+            weights_path=str(volume_path / "runs" / model_id / "weights" / "best.pt")
+        )
+
+        # predict on a single image and save output to the volume
+        test_images = volume.listdir(
+            str(Path("dataset") / dataset.id / "test" / "images")
+        )
+        # run inference on the first 5 images
+        for ii, image in enumerate(test_images):
+            print(f"{model_id}: Single image prediction on image", image.path)
+            inference.predict.remote(
+                model_id=model_id,
+                image_path=f"{volume_path}/{image.path}",
+                display=(
+                    ii == 0  # display inference results only on first image
+                ),
+            )
+            if ii >= 4:
+                break
+
+        # streaming inference on images from the test set
+        print(f"{model_id}: Streaming inferences on all images in the test set...")
+        count = 0
+        for detection in inference.streaming_count.remote_gen(
+            batch_dir=f"{volume_path}/dataset/{dataset.id}/test/images"
+        ):
+            if detection:
+                print(f"{dataset.target_class}", end="")
+                count += 1
+            else:
+                print("🎞️", end="", flush=True)
+        print(f"\n{model_id}: Counted {count} {dataset.target_class}s!")
+
+```
+
+## Addenda
+
+The rest of the code in this example is utility code.
+
+```python
+warnings.filterwarnings(  # filter warning from the terminal image library
+    "ignore",
+    message="It seems this process is not running within a terminal. Hence, some features will behave differently or be disabled.",
+    category=UserWarning,
+)
+
+```
+
+### Flan T5 Finetune
+
+# Finetuning Flan-T5
+
+Example by [@anishpdalal](https://github.com/anishpdalal)
+
+[Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) is a highly versatile model that's been instruction-tuned to
+perform well on a variety of text-based tasks such as question answering and summarization. There are smaller model variants available which makes
+Flan-T5 a great base model to use for finetuning on a specific instruction dataset with just a single GPU. In this example, we'll
+finetune Flan-T5 on the [Extreme Sum ("XSum")](https://huggingface.co/datasets/xsum) dataset to summarize news articles.
+
+## Defining dependencies
+
+The example uses the `dataset` package from HuggingFace to load the xsum dataset. It also uses the `transformers`
+and `accelerate` packages with a PyTorch backend to finetune and serve the model. Finally, we also
+install `tensorboard` and serve it via a web app. All packages are installed into a Debian Slim base image
+using the `uv_pip_install` function.
+
+```python
+from pathlib import Path
+
+import modal
+
+VOL_MOUNT_PATH = Path("/vol")
+
+```
+
+Other Flan-T5 models can be found [here](https://huggingface.co/docs/transformers/model_doc/flan-t5)
+
+```python
+BASE_MODEL = "google/flan-t5-base"
+
+image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "accelerate",
+    "transformers",
+    "torch",
+    "datasets",
+    "tensorboard",
+)
+
+app = modal.App(name="example-flan-t5-finetune", image=image)
+output_vol = modal.Volume.from_name("finetune-volume", create_if_missing=True)
+
+```
+
+### Handling preemption
+
+As this finetuning job is long-running it's possible that it experiences a preemption.
+The training code is robust to preemption events by periodically saving checkpoints and restoring
+from checkpoint on restart. But it's also helpful to observe in logs when a preemption restart has occurred,
+so we track restarts with a `modal.Dict`.
+
+See the [guide on preemptions](https://modal.com/docs/guide/preemption#preemption)
+for more details on preemption handling.
+
+```python
+restart_tracker_dict = modal.Dict.from_name(
+    "finetune-restart-tracker", create_if_missing=True
+)
+
+def track_restarts(restart_tracker: modal.Dict) -> int:
+    if not restart_tracker.contains("count"):
+        preemption_count = 0
+        print(f"Starting first time. {preemption_count=}")
+        restart_tracker["count"] = preemption_count
+    else:
+        preemption_count = restart_tracker.get("count") + 1
+        print(f"Restarting after pre-emption. {preemption_count=}")
+        restart_tracker["count"] = preemption_count
+    return preemption_count
+
+```
+
+## Finetuning Flan-T5 on XSum dataset
+
+Each row in the dataset has a `document` (input news article) and `summary` column.
+
+```python
+@app.function(
+    gpu="A10g",
+    timeout=7200,
+    volumes={VOL_MOUNT_PATH: output_vol},
+)
+def finetune(num_train_epochs: int = 1, size_percentage: int = 10):
+    from datasets import load_dataset
+    from transformers import (
+        AutoModelForSeq2SeqLM,
+        AutoTokenizer,
+        DataCollatorForSeq2Seq,
+        Seq2SeqTrainer,
+        Seq2SeqTrainingArguments,
+    )
+
+    restarts = track_restarts(restart_tracker_dict)
+
+    # Use size percentage to retrieve subset of the dataset to iterate faster
+    if size_percentage:
+        xsum_train = load_dataset("xsum", split=f"train[:{size_percentage}%]")
+        xsum_test = load_dataset("xsum", split=f"test[:{size_percentage}%]")
+
+    # Load the whole dataset
+    else:
+        xsum = load_dataset("xsum")
+        xsum_train = xsum["train"]
+        xsum_test = xsum["test"]
+
+    # Load the tokenizer and model
+    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
+    model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL)
+
+    # Replace all padding tokens with a large negative number so that the loss function ignores them in
+    # its calculation
+    padding_token_id = -100
+
+    batch_size = 8
+
+    def preprocess(batch):
+        # prepend summarize: prefix to document to convert the example to a summarization instruction
+        inputs = ["summarize: " + doc for doc in batch["document"]]
+
+        model_inputs = tokenizer(
+            inputs, max_length=512, truncation=True, padding="max_length"
+        )
+
+        labels = tokenizer(
+            text_target=batch["summary"],
+            max_length=128,
+            truncation=True,
+            padding="max_length",
+        )
+
+        labels["input_ids"] = [
+            [l if l != tokenizer.pad_token_id else padding_token_id for l in label]
+            for label in labels["input_ids"]
+        ]
+
+        model_inputs["labels"] = labels["input_ids"]
+        return model_inputs
+
+    tokenized_xsum_train = xsum_train.map(
+        preprocess, batched=True, remove_columns=["document", "summary", "id"]
+    )
+
+    tokenized_xsum_test = xsum_test.map(
+        preprocess, batched=True, remove_columns=["document", "summary", "id"]
+    )
+
+    data_collator = DataCollatorForSeq2Seq(
+        tokenizer,
+        model=model,
+        label_pad_token_id=padding_token_id,
+        pad_to_multiple_of=batch_size,
+    )
+
+    training_args = Seq2SeqTrainingArguments(
+        # Save checkpoints to the mounted volume
+        output_dir=str(VOL_MOUNT_PATH / "model"),
+        per_device_train_batch_size=batch_size,
+        per_device_eval_batch_size=batch_size,
+        predict_with_generate=True,
+        learning_rate=3e-5,
+        num_train_epochs=num_train_epochs,
+        logging_strategy="steps",
+        logging_steps=100,
+        evaluation_strategy="steps",
+        save_strategy="steps",
+        save_steps=100,
+        save_total_limit=2,
+        load_best_model_at_end=True,
+    )
+
+    trainer = Seq2SeqTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=tokenized_xsum_train,
+        eval_dataset=tokenized_xsum_test,
+    )
+
+    try:
+        resume = restarts > 0
+        if resume:
+            print("resuming from checkpoint")
+        trainer.train(resume_from_checkpoint=resume)
+    except KeyboardInterrupt:  # handle possible preemption
+        print("received interrupt; saving state and model")
+        trainer.save_state()
+        trainer.save_model()
+        raise
+
+    # Save the trained model and tokenizer to the mounted volume
+    model.save_pretrained(str(VOL_MOUNT_PATH / "model"))
+    tokenizer.save_pretrained(str(VOL_MOUNT_PATH / "tokenizer"))
+    output_vol.commit()
+    print("✅ done")
+
+```
+
+## Monitoring Finetuning with Tensorboard
+
+Tensorboard is an application for visualizing training loss. In this example we
+serve it as a Modal WSGI app.
+
+```python
+@app.function(volumes={VOL_MOUNT_PATH: output_vol})
+@modal.wsgi_app()
+def monitor():
+    import tensorboard
+
+    board = tensorboard.program.TensorBoard()
+    board.configure(logdir=f"{VOL_MOUNT_PATH}/logs")
+    (data_provider, deprecated_multiplexer) = board._make_data_provider()
+    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
+        board.flags,
+        board.plugin_loaders,
+        data_provider,
+        board.assets_zip_provider,
+        deprecated_multiplexer,
+    )
+    return wsgi_app
+
+```
+
+## Model Inference
+
+```python
+@app.cls(volumes={VOL_MOUNT_PATH: output_vol})
+class Summarizer:
+    @modal.enter()
+    def load_model(self):
+        from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
+
+        # Load saved tokenizer and finetuned from training run
+        tokenizer = AutoTokenizer.from_pretrained(
+            BASE_MODEL, cache_dir=VOL_MOUNT_PATH / "tokenizer/"
+        )
+        model = AutoModelForSeq2SeqLM.from_pretrained(
+            BASE_MODEL, cache_dir=VOL_MOUNT_PATH / "model/"
+        )
+
+        self.summarizer = pipeline("summarization", tokenizer=tokenizer, model=model)
+
+    @modal.method()
+    def generate(self, input: str) -> str:
+        return self.summarizer(input)[0]["summary_text"]
+
+@app.local_entrypoint()
+def main():
+    input = """
+    The 14-time major champion, playing in his first full PGA Tour event for almost 18 months,
+    carded a level-par second round of 72, but missed the cut by four shots after his first-round 76.
+    World number one Jason Day and US Open champion Dustin Johnson also missed the cut at Torrey Pines in San Diego.
+    Overnight leader Rose carded a one-under 71 to put him on eight under. Canada's
+    Adam Hadwin and USA's Brandt Snedeker are tied in second on seven under, while US PGA champion
+    Jimmy Walker missed the cut as he finished on three over. Woods is playing in just his
+    second tournament since 15 months out with a back injury. "It's frustrating not being
+    able to have a chance to win the tournament," said the 41-year-old, who won his last major,
+    the US Open, at the same course in 2008. "Overall today was a lot better than yesterday.
+    I hit it better, I putted well again. I hit a lot of beautiful putts that didn't go in, but
+    I hit it much better today, which was nice." Scotland's Martin Laird and England's Paul Casey
+    are both on two under, while Ireland's Shane Lowry is on level par.
+    """
+    model = Summarizer()
+    response = model.generate.remote(input)
+    print(response)
+
+```
+
+## Run via the CLI
+
+Trigger model finetuning using the following command:
+
+```bash
+modal run --detach flan_t5_finetune.py::finetune --num-train-epochs=1 --size-percentage=10
+View the tensorboard logs at https://<username>--example-flan-t5-finetune-monitor-dev.modal.run
+```
+
+Then, you can invoke inference via the `local_entrypoint` with this command:
+
+```bash
+modal run flan_t5_finetune.py
+World number one Tiger Woods missed the cut at the US Open as he failed to qualify for the final round of the event in Los Angeles.
+```
+
+### Flask App
+
+# Deploy Flask app with Modal
+
+This example shows how you can deploy a [Flask](https://flask.palletsprojects.com/en/3.0.x/) app with Modal.
+You can serve any app written in a WSGI-compatible web framework (like Flask) on Modal with this pattern. You can serve an app written in an ASGI-compatible framework, like FastAPI, with [`asgi_app`](https://modal.com/docs/guide/webhooks#asgi).
+
+```python
+import modal
+
+app = modal.App(
+    "example-flask-app",
+    image=modal.Image.debian_slim().uv_pip_install("flask"),
+)
+
+@app.function()
+@modal.wsgi_app()
+def flask_app():
+    from flask import Flask, request
+
+    web_app = Flask(__name__)
+
+    @web_app.get("/")
+    def home():
+        return "Hello Flask World!"
+
+    @web_app.post("/foo")
+    def foo():
+        return request.json
+
+    return web_app
+
+```
+
+### Flask Streaming
+
+# Deploy Flask app with streaming results with Modal
+
+This example shows how you can deploy a [Flask](https://flask.palletsprojects.com/en/3.0.x/) app with Modal that streams results back to the client.
+
+```python
+import modal
+
+app = modal.App(
+    "example-flask-streaming",
+    image=modal.Image.debian_slim().uv_pip_install("flask"),
+)
+
+@app.function()
+def generate_rows():
+    """
+    This creates a large CSV file, about 10MB, which will be streaming downloaded
+    by a web client.
+    """
+    for i in range(10_000):
+        line = ",".join(str((j + i) * i) for j in range(128))
+        yield f"{line}\n"
+
+@app.function()
+@modal.wsgi_app()
+def flask_app():
+    from flask import Flask
+
+    web_app = Flask(__name__)
+
+    # These web handlers follow the example from
+    # https://flask.palletsprojects.com/en/2.2.x/patterns/streaming/
+
+    @web_app.route("/")
+    def generate_large_csv():
+        # Run the function locally in the web app's container.
+        return generate_rows.local(), {"Content-Type": "text/csv"}
+
+    @web_app.route("/remote")
+    def generate_large_csv_in_container():
+        # Run the function remotely in a separate container,
+        # which will stream back results to the web app container,
+        # which will stream back to the web client.
+        #
+        # This is less efficient, but demonstrates how web serving
+        # containers can be separated from and cooperate with other
+        # containers.
+        return generate_rows.remote(), {"Content-Type": "text/csv"}
+
+    return web_app
+
+```
+
+### Flux
+
+# Run Flux fast on H100s with `torch.compile`
+
+_Update: To speed up inference by another >2x, check out the additional optimization
+techniques we tried in [this blog post](https://modal.com/blog/flux-3x-faster)!_
+
+In this guide, we'll run Flux as fast as possible on Modal using open source tools.
+We'll use `torch.compile` and NVIDIA H100 GPUs.
+
+## Setting up the image and dependencies
+
+```python
+import time
+from io import BytesIO
+from pathlib import Path
+
+import modal
+
+```
+
+We'll make use of the full [CUDA toolkit](https://modal.com/docs/guide/cuda)
+in this example, so we'll build our container image off of the `nvidia/cuda` base.
+
+```python
+cuda_version = "12.4.0"  # should be no greater than host CUDA version
+flavor = "devel"  # includes full CUDA toolkit
+operating_sys = "ubuntu22.04"
+tag = f"{cuda_version}-{flavor}-{operating_sys}"
+
+cuda_dev_image = modal.Image.from_registry(
+    f"nvidia/cuda:{tag}", add_python="3.11"
+).entrypoint([])
+
+```
+
+Now we install most of our dependencies with `apt` and `pip`.
+For Hugging Face's [Diffusers](https://github.com/huggingface/diffusers) library
+we install from GitHub source and so pin to a specific commit.
+
+PyTorch added faster attention kernels for Hopper GPUs in version 2.5.
+
+```python
+diffusers_commit_sha = "81cf3b2f155f1de322079af28f625349ee21ec6b"
+
+flux_image = (
+    cuda_dev_image.apt_install(
+        "git",
+        "libglib2.0-0",
+        "libsm6",
+        "libxrender1",
+        "libxext6",
+        "ffmpeg",
+        "libgl1",
+    )
+    .uv_pip_install(
+        "invisible_watermark==0.2.0",
+        "transformers==4.44.0",
+        "huggingface-hub==0.36.0",
+        "accelerate==0.33.0",
+        "safetensors==0.4.4",
+        "sentencepiece==0.2.0",
+        "torch==2.5.0",
+        f"git+https://github.com/huggingface/diffusers.git@{diffusers_commit_sha}",
+        "numpy<2",
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1", "HF_HUB_CACHE": "/cache"})
+)
+
+```
+
+Later, we'll also use `torch.compile` to increase the speed further.
+Torch compilation needs to be re-executed when each new container starts,
+so we turn on some extra caching to reduce compile times for later containers.
+
+```python
+flux_image = flux_image.env(
+    {
+        "TORCHINDUCTOR_CACHE_DIR": "/root/.inductor-cache",
+        "TORCHINDUCTOR_FX_GRAPH_CACHE": "1",
+    }
+)
+
+```
+
+Finally, we construct our Modal [App](https://modal.com/docs/reference/modal.App),
+set its default image to the one we just constructed,
+and import `FluxPipeline` for downloading and running Flux.1.
+
+```python
+app = modal.App("example-flux", image=flux_image)
+
+with flux_image.imports():
+    import torch
+    from diffusers import FluxPipeline
+
+```
+
+## Defining a parameterized `Model` inference class
+
+Next, we map the model's setup and inference code onto Modal.
+
+1. We run the model setup in the method decorated with `@modal.enter()`. This includes loading the
+weights and moving them to the GPU, along with an optional `torch.compile` step (see details below).
+The `@modal.enter()` decorator ensures that this method runs only once, when a new container starts,
+instead of in the path of every call.
+
+2. We run the actual inference in methods decorated with `@modal.method()`.
+
+*Note: Access to the Flux.1-schnell model on Hugging Face is
+[gated by a license agreement](https://huggingface.co/docs/hub/en/models-gated)
+which you must agree to
+[here](https://huggingface.co/black-forest-labs/FLUX.1-schnell).
+After you have accepted the license,
+[create a Modal Secret](https://modal.com/secrets)
+with the name `huggingface-secret` following the instructions in the template.*
+
+```python
+MINUTES = 60  # seconds
+VARIANT = "schnell"  # or "dev"
+NUM_INFERENCE_STEPS = 4  # use ~50 for [dev], smaller for [schnell]
+
+@app.cls(
+    gpu="H100",  # fast GPU with strong software support
+    scaledown_window=20 * MINUTES,
+    timeout=60 * MINUTES,  # leave plenty of time for compilation
+    volumes={  # add Volumes to store serializable compilation artifacts, see section on torch.compile below
+        "/cache": modal.Volume.from_name("hf-hub-cache", create_if_missing=True),
+        "/root/.nv": modal.Volume.from_name("nv-cache", create_if_missing=True),
+        "/root/.triton": modal.Volume.from_name("triton-cache", create_if_missing=True),
+        "/root/.inductor-cache": modal.Volume.from_name(
+            "inductor-cache", create_if_missing=True
+        ),
+    },
+    secrets=[modal.Secret.from_name("huggingface-secret")],
+)
+class Model:
+    compile: bool = (  # see section on torch.compile below for details
+        modal.parameter(default=False)
+    )
+
+    @modal.enter()
+    def enter(self):
+        pipe = FluxPipeline.from_pretrained(
+            f"black-forest-labs/FLUX.1-{VARIANT}", torch_dtype=torch.bfloat16
+        ).to("cuda")  # move model to GPU
+        self.pipe = optimize(pipe, compile=self.compile)
+
+    @modal.method()
+    def inference(self, prompt: str) -> bytes:
+        print("🎨 generating image...")
+        out = self.pipe(
+            prompt,
+            output_type="pil",
+            num_inference_steps=NUM_INFERENCE_STEPS,
+        ).images[0]
+
+        byte_stream = BytesIO()
+        out.save(byte_stream, format="JPEG")
+        return byte_stream.getvalue()
+
+```
+
+## Calling our inference function
+
+To generate an image we just need to call the `Model`'s `generate` method
+with `.remote` appended to it.
+You can call `.generate.remote` from any Python environment that has access to your Modal credentials.
+The local environment will get back the image as bytes.
+
+Here, we wrap the call in a Modal [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint)
+so that it can be run with `modal run`:
+
+```bash
+modal run flux.py
+```
+
+By default, we call `generate` twice to demonstrate how much faster
+the inference is after cold start. In our tests, clients received images in about 1.2 seconds.
+We save the output bytes to a temporary file.
+
+```python
+@app.local_entrypoint()
+def main(
+    prompt: str = "a computer screen showing ASCII terminal art of the"
+    " word 'Modal' in neon green. two programmers are pointing excitedly"
+    " at the screen.",
+    twice: bool = True,
+    compile: bool = False,
+):
+    t0 = time.time()
+    image_bytes = Model(compile=compile).inference.remote(prompt)
+    print(f"🎨 first inference latency: {time.time() - t0:.2f} seconds")
+
+    if twice:
+        t0 = time.time()
+        image_bytes = Model(compile=compile).inference.remote(prompt)
+        print(f"🎨 second inference latency: {time.time() - t0:.2f} seconds")
+
+    output_path = Path("/tmp") / "flux" / "output.jpg"
+    output_path.parent.mkdir(exist_ok=True, parents=True)
+    print(f"🎨 saving output to {output_path}")
+    output_path.write_bytes(image_bytes)
+
+```
+
+## Speeding up Flux with `torch.compile`
+
+By default, we do some basic optimizations, like adjusting memory layout
+and re-expressing the attention head projections as a single matrix multiplication.
+But there are additional speedups to be had!
+
+PyTorch 2 added a compiler that optimizes the
+compute graphs created dynamically during PyTorch execution.
+This feature helps close the gap with the performance of static graph frameworks
+like TensorRT and TensorFlow.
+
+Here, we follow the suggestions from Hugging Face's
+[guide to fast diffusion inference](https://huggingface.co/docs/diffusers/en/tutorials/fast_diffusion),
+which we verified with our own internal benchmarks.
+Review that guide for detailed explanations of the choices made below.
+
+The resulting compiled Flux `schnell` deployment returns images to the client in under a second (~700 ms), according to our testing.
+_Super schnell_!
+
+Compilation takes up to twenty minutes on first iteration.
+As of time of writing in late 2024,
+the compilation artifacts cannot be fully serialized,
+so some compilation work must be re-executed every time a new container is started.
+That includes when scaling up an existing deployment or the first time a Function is invoked with `modal run`.
+
+We cache compilation outputs from `nvcc`, `triton`, and `inductor`,
+which can reduce compilation time by up to an order of magnitude.
+For details see [this tutorial](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html).
+
+You can turn on compilation with the `--compile` flag.
+Try it out with:
+
+```bash
+modal run flux.py --compile
+```
+
+The `compile` option is passed by a [`modal.parameter`](https://modal.com/docs/reference/modal.parameter#modalparameter) on our class.
+Each different choice for a `parameter` creates a [separate auto-scaling deployment](https://modal.com/docs/guide/parameterized-functions).
+That means your client can use arbitrary logic to decide whether to hit a compiled or eager endpoint.
+
+```python
+def optimize(pipe, compile=True):
+    # fuse QKV projections in Transformer and VAE
+    pipe.transformer.fuse_qkv_projections()
+    pipe.vae.fuse_qkv_projections()
+
+    # switch memory layout to Torch's preferred, channels_last
+    pipe.transformer.to(memory_format=torch.channels_last)
+    pipe.vae.to(memory_format=torch.channels_last)
+
+    if not compile:
+        return pipe
+
+    # set torch compile flags
+    config = torch._inductor.config
+    config.disable_progress = False  # show progress bar
+    config.conv_1x1_as_mm = True  # treat 1x1 convolutions as matrix muls
+    # adjust autotuning algorithm
+    config.coordinate_descent_tuning = True
+    config.coordinate_descent_check_all_directions = True
+    config.epilogue_fusion = False  # do not fuse pointwise ops into matmuls
+
+    # tag the compute-intensive modules, the Transformer and VAE decoder, for compilation
+    pipe.transformer = torch.compile(
+        pipe.transformer, mode="max-autotune", fullgraph=True
+    )
+    pipe.vae.decode = torch.compile(
+        pipe.vae.decode, mode="max-autotune", fullgraph=True
+    )
+
+    # trigger torch compilation
+    print("🔦 running torch compilation (may take up to 20 minutes)...")
+
+    pipe(
+        "dummy prompt to trigger torch compilation",
+        output_type="pil",
+        num_inference_steps=NUM_INFERENCE_STEPS,  # use ~50 for [dev], smaller for [schnell]
+    ).images[0]
+
+    print("🔦 finished torch compilation")
+
+    return pipe
+
+```
+
+### Generate Music
+
+# Make music with ACE-Step
+
+In this example, we show you how you can run [ACE Studio](https://acestudio.ai/)'s
+[ACE-Step](https://github.com/ace-step/ACE-Step) music generation model
+on Modal.
+
+We'll set up both a serverless music generation service
+and a web user interface.
+
+## Setting up dependencies
+
+```python
+from pathlib import Path
+from typing import Optional
+from uuid import uuid4
+
+import modal
+
+```
+
+We start by defining the environment our generation runs in.
+This takes some explaining since, like most cutting-edge ML environments, it is a bit fiddly.
+
+This environment is captured by a
+[container image](https://modal.com/docs/guide/images),
+which we build step-by-step by calling methods to add dependencies,
+like `apt_install` to add system packages and `pip_install` to add
+Python packages.
+
+Note that we don't have to install anything with "CUDA"
+in the name -- the drivers come for free with the Modal environment
+and the rest gets installed `pip`. That makes our life a lot easier!
+If you want to see the details, check out [this guide](https://modal.com/docs/guide/gpu)
+in our docs.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install("git", "ffmpeg")
+    .uv_pip_install(
+        "torch==2.8.0",
+        "torchaudio==2.8.0",
+        "git+https://github.com/ace-step/ACE-Step.git@6ae0852b1388de6dc0cca26b31a86d711f723cb3",  # we can install directly from GitHub!
+    )
+)
+
+```
+
+In addition to source code, we'll also need the model weights.
+
+ACE-Step integrates with the Hugging Face ecosystem, so setting up the models
+is straightforward. `ACEStepPipeline` internally uses the Hugging Face model hub
+to download the weights if not already present.
+
+```python
+def load_model(and_return=False):
+    from acestep.pipeline_ace_step import ACEStepPipeline
+
+    model = ACEStepPipeline(dtype="bfloat16", cpu_offload=False, overlapped_decode=True)
+    if and_return:
+        return model
+
+```
+
+But Modal Functions are serverless: instances spin down when they aren't being used.
+If we want to avoid downloading the weights every time we start a new instance,
+we need to store the weights somewhere besides our local filesystem.
+
+So we add a Modal [Volume](https://modal.com/docs/guide/volumes)
+to store the weights in the cloud. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+cache_dir = "/root/.cache/ace-step/checkpoints"
+model_cache = modal.Volume.from_name("ACE-Step-model-cache", create_if_missing=True)
+
+```
+
+We don't need to change any of the model loading code --
+we just need to make sure the model gets stored in the right directory.
+
+To do that, we set an environment variable that Hugging Face expects
+(and another one that speeds up downloads, for good measure)
+and then run the `load_model` Python function.
+
+```python
+image = image.env(
+    {"HF_HUB_CACHE": cache_dir, "HF_HUB_ENABLE_HF_TRANSER": "1"}
+).run_function(load_model, volumes={cache_dir: model_cache})
+
+```
+
+While we're at it, let's also define the environment for our UI.
+We'll stick with Python and so use FastAPI and Gradio.
+
+```python
+web_image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "fastapi[standard]==0.115.4", "gradio==4.44.1", "pydantic==2.10.1"
+)
+
+```
+
+This is a totally different environment from the one we run our model in.
+Say goodbye to Python dependency conflict hell!
+
+## Running music generation on Modal
+
+Now, we write our music generation logic.
+
+- We make an [App](https://modal.com/docs/guide/apps) to organize our deployment.
+- We load the model at start, instead of during inference, with `modal.enter`,
+which requires that we use a Modal [`Cls`](https://modal.com/docs/guide/lifecycle-functions).
+- In the `app.cls` decorator, we specify the Image we built and attach the Volume.
+We also pick a GPU to run on -- here, an NVIDIA L40S.
+
+```python
+app = modal.App("example-generate-music")
+
+@app.cls(gpu="l40s", image=image, volumes={cache_dir: model_cache})
+class MusicGenerator:
+    @modal.enter()
+    def init(self):
+        from acestep.pipeline_ace_step import ACEStepPipeline
+
+        self.model: ACEStepPipeline = load_model(and_return=True)
+
+    @modal.method()
+    def run(
+        self,
+        prompt: str,
+        lyrics: str,
+        duration: float = 60.0,
+        format: str = "wav",  # or mp3
+        manual_seeds: Optional[int] = 1,
+    ) -> bytes:
+        import uuid
+
+        output_path = f"/dev/shm/output_{uuid.uuid4().hex}.{format}"
+        print("Generating music...")
+        self.model(
+            audio_duration=duration,
+            prompt=prompt,
+            lyrics=lyrics,
+            format=format,
+            save_path=output_path,
+            manual_seeds=manual_seeds,
+            # for samples, see https://github.com/ace-step/ACE-Step/tree/6ae0852b1388de6dc0cca26b31a86d711f723cb3/examples/
+            # note that the parameters below are fixed in all of the samples in the default folder
+            infer_step=60,
+            guidance_scale=15,
+            scheduler_type="euler",
+            cfg_type="apg",
+            omega_scale=10,
+            guidance_interval=0.5,
+            guidance_interval_decay=0,
+            min_guidance_scale=3,
+            use_erg_tag=True,
+            use_erg_lyric=True,
+            use_erg_diffusion=True,
+        )
+        return Path(output_path).read_bytes()
+
+```
+
+We can then generate music from anywhere by running code like what we have in the `local_entrypoint` below.
+
+```python
+@app.local_entrypoint()
+def main(
+    prompt: Optional[str] = None,
+    lyrics: Optional[str] = None,
+    duration: Optional[float] = None,
+    format: str = "wav",  # or mp3
+    manual_seeds: Optional[int] = 1,
+):
+    if lyrics is None:
+        lyrics = "[inst]"
+    if prompt is None:
+        prompt = "Korean pop music, bright energetic electronic music, catchy melody, female vocals"
+        lyrics = """[intro][intro]
+            [chorus]
+            We're goin' up, up, up, it's our moment
+            You know together we're glowing
+            Gonna be, gonna be golden
+            Oh, up, up, up with our voices
+            영원히 깨질 수 없는
+            Gonna be, gonna be golden"""
+    if duration is None:
+        duration = 30.0  # seconds
+    print(
+        f"🎼 generating {duration} seconds of music from prompt '{prompt[:32] + ('...' if len(prompt) > 32 else '')}'"
+        f" and lyrics '{lyrics[:32] + ('...' if len(lyrics) > 32 else '')}'"
+    )
+
+    music_generator = MusicGenerator()  # outside of this file, use modal.Cls.from_name
+    clip = music_generator.run.remote(
+        prompt, lyrics, duration=duration, format=format, manual_seeds=manual_seeds
+    )
+
+    dir = Path("/tmp/generate-music")
+    dir.mkdir(exist_ok=True, parents=True)
+
+    output_path = dir / f"{slugify(prompt)[:64]}.{format}"
+    print(f"🎼 Saving to {output_path}")
+    output_path.write_bytes(clip)
+
+def slugify(string):
+    return (
+        string.lower()
+        .replace(" ", "-")
+        .replace("/", "-")
+        .replace("\\", "-")
+        .replace(":", "-")
+    )
+
+```
+
+You can execute it with a command like:
+
+``` shell
+modal run generate_music.py
+```
+
+Pass in `--help` to see options and how to use them.
+
+## Hosting a web UI for the music generator
+
+With the Gradio library, we can create a simple web UI in Python
+that calls out to our music generator,
+then host it on Modal for anyone to try out.
+
+To deploy both the music generator and the UI, run
+
+``` shell
+modal deploy generate_music.py
+```
+
+```python
+@app.function(
+    image=web_image,
+    # Gradio requires sticky sessions
+    # so we limit the number of concurrent containers to 1
+    # and allow it to scale to 1000 concurrent inputs
+    max_containers=1,
+)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def ui():
+    import gradio as gr
+    from fastapi import FastAPI
+    from gradio.routes import mount_gradio_app
+
+    api = FastAPI()
+
+    # Since this Gradio app is running from its own container,
+    # we make a `.remote` call to the music generator
+    music_generator = MusicGenerator()
+    generate = music_generator.run.remote
+
+    temp_dir = Path("/dev/shm")
+
+    async def generate_music(
+        prompt: str, lyrics: str, duration: float = 30.0, format: str = "wav"
+    ):
+        audio_bytes = await generate.aio(
+            prompt, lyrics, duration=duration, format=format
+        )
+
+        audio_path = temp_dir / f"{uuid4()}.{format}"
+        audio_path.write_bytes(audio_bytes)
+
+        return audio_path
+
+    with gr.Blocks(theme="soft") as demo:
+        gr.Markdown("# Generate Music")
+        with gr.Row():
+            with gr.Column():
+                prompt = gr.Textbox(label="Prompt")
+                lyrics = gr.Textbox(label="Lyrics")
+                duration = gr.Number(
+                    label="Duration (seconds)", value=10.0, minimum=1.0, maximum=300.0
+                )
+                format = gr.Radio(["wav", "mp3"], label="Format", value="wav")
+                btn = gr.Button("Generate")
+            with gr.Column():
+                clip_output = gr.Audio(label="Generated Music", autoplay=True)
+
+        btn.click(
+            generate_music,
+            inputs=[prompt, lyrics, duration, format],
+            outputs=[clip_output],
+        )
+
+    return mount_gradio_app(app=api, blocks=demo, path="/")
+
+```
+
+### Generators
+
+# Run a generator function on Modal
+
+This example shows how you can run a generator function on Modal. We define a
+function that `yields` values and then call it with the [`remote_gen`](https://modal.com/docs/reference/modal.Function#remote_gen) method. The
+`remote_gen` method returns a generator object that can be used to iterate over
+the values produced by the function.
+
+```python
+import modal
+
+app = modal.App("example-generators")
+
+@app.function()
+def f(i):
+    for j in range(i):
+        yield j
+
+@app.local_entrypoint()
+def main():
+    for r in f.remote_gen(10):
+        print(r)
+
+```
+
+### Generators Async
+
+# Run async generator function on Modal
+
+This example shows how you can run an async generator function on Modal.
+Modal natively supports async/await syntax using asyncio.
+
+```python
+import modal
+
+app = modal.App("example-generators-async")
+
+@app.function()
+def f(i):
+    for j in range(i):
+        yield j
+
+@app.local_entrypoint()
+async def run_async():
+    async for r in f.remote_gen.aio(10):
+        print(r)
+
+```
+
+### Get Started
+
+# Example (get_started.py)
+
+This is the source code for **01_getting_started.get_started**.
+```python
+import modal
+
+app = modal.App("example-get-started")
+
+@app.function()
+def square(x):
+    print("This code is running on a remote worker!")
+    return x**2
+
+@app.local_entrypoint()
+def main():
+    print("the square is", square.remote(42))
+
+```
+
+### Gpt Oss Inference
+
+# Run OpenAI's gpt-oss model with vLLM
+
+## Background
+
+[gpt-oss](https://openai.com/index/introducing-gpt-oss/) is a reasoning model
+that comes in two flavors: `gpt-oss-120B` and `gpt-oss-20B`. They are both Mixture
+of Experts (MoE) models with a low number of active parameters, ensuring they
+combine good world knowledge and capabilities with fast inference.
+
+We describe a few of its notable features below.
+
+### MXFP4
+
+OpenAI's gpt-oss models use a fairly uncommon 4bit [`mxfp4`](https://arxiv.org/abs/2310.10537) floating point
+format for the MoE layers. This "block" quantization format combines `e2m1` floating point numbers
+with blockwise scaling factors. The attention operations are not quantized.
+
+### Attention Sinks
+
+Attention sink models allow for longer context lengths without sacrificing output quality. The vLLM team
+added [attention sink support](https://huggingface.co/kernels-community/vllm-flash-attn3)
+for Flash Attention 3 (FA3) in preparation for this release.
+
+### Response Format
+
+GPT-OSS is trained with the [harmony response format](https://github.com/openai/harmony) which enables models
+to output to multiple channels for chain-of-thought (CoT) and input tool-calling preambles along with regular text responses.
+We'll stick to a simpler format here, but see [this cookbook](https://cookbook.openai.com/articles/openai-harmony)
+for details on the new format.
+
+## Set up the container image
+
+We'll start by defining a [custom container `Image`](https://modal.com/docs/guide/custom-container) that
+installs all the necessary dependencies to run vLLM and the model.
+
+```python
+import json
+import time
+from datetime import datetime, timezone
+from typing import Any
+
+import aiohttp
+import modal
+
+vllm_image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:12.8.1-devel-ubuntu22.04",
+        add_python="3.12",
+    )
+    .entrypoint([])
+    .uv_pip_install(
+        "vllm==0.11.0",
+        "huggingface_hub[hf_transfer]==0.35.0",
+        "flashinfer-python==0.3.1",
+    )
+)
+
+```
+
+## Download the model weights
+
+We'll be downloading OpenAI's model from Hugging Face. We're running
+the 20B parameter model by default but you can easily switch to [the 120B model](https://huggingface.co/openai/gpt-oss-120b),
+which also fits in a single H100 or H200 GPU.
+
+```python
+MODEL_NAME = "openai/gpt-oss-20b"
+MODEL_REVISION = "d666cf3b67006cf8227666739edf25164aaffdeb"
+
+```
+
+Although vLLM will download weights from Hugging Face on-demand, we want to
+cache them so we don't do it every time our server starts. We'll use [Modal Volumes](https://modal.com/docs/guide/volumes)
+for our cache. Modal Volumes are essentially a "shared disk" that all Modal
+Functions can access like it's a regular disk. For more on storing model
+weights on Modal, see [this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
+
+```
+
+The first time you run a new model or configuration with vLLM on a fresh machine,
+a number of artifacts are created. We also cache these artifacts.
+
+```python
+vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
+
+```
+
+There are a number of compilation settings for vLLM. Compilation improves inference performance
+but incur extra latency at engine start time. We offer a high-level variable for controlling this trade-off.
+
+```python
+FAST_BOOT = False  # slower boots but faster inference
+
+```
+
+Among the artifacts that are created at startup are CUDA graphs,
+which allow the replay of several kernel launches for the price of one,
+reducing CPU overhead. We over-ride the defaults with a smaller number of sizes
+that we think better balances latency from future JIT CUDA graph generation
+and startup latency.
+
+```python
+MAX_INPUTS = 32  # how many requests can one replica handle? tune carefully!
+CUDA_GRAPH_CAPTURE_SIZES = [  # 1, 2, 4, ... MAX_INPUTS
+    1 << i for i in range((MAX_INPUTS).bit_length())
+]
+
+```
+
+## Build a vLLM engine and serve it
+
+The function below spawns a vLLM instance listening at port 8000, serving requests to our model.
+
+```python
+app = modal.App("example-gpt-oss-inference")
+
+N_GPU = 1
+MINUTES = 60  # seconds
+VLLM_PORT = 8000
+
+@app.function(
+    image=vllm_image,
+    gpu=f"B200:{N_GPU}",
+    scaledown_window=15 * MINUTES,  # how long should we stay up with no requests?
+    timeout=30 * MINUTES,  # how long should we wait for container start?
+    volumes={
+        "/root/.cache/huggingface": hf_cache_vol,
+        "/root/.cache/vllm": vllm_cache_vol,
+    },
+)
+@modal.concurrent(max_inputs=MAX_INPUTS)
+@modal.web_server(port=VLLM_PORT, startup_timeout=30 * MINUTES)
+def serve():
+    import subprocess
+
+    cmd = [
+        "vllm",
+        "serve",
+        "--uvicorn-log-level=info",
+        MODEL_NAME,
+        "--revision",
+        MODEL_REVISION,
+        "--served-model-name",
+        MODEL_NAME,
+        "llm",
+        "--host",
+        "0.0.0.0",
+        "--port",
+        str(VLLM_PORT),
+    ]
+
+    # enforce-eager disables both Torch compilation and CUDA graph capture
+    # default is no-enforce-eager. see the --compilation-config flag for tighter control
+    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]
+
+    if not FAST_BOOT:  # CUDA graph capture is only used with `--enforce-eager`
+        cmd += [
+            "-O.cudagraph_capture_sizes="
+            + str(CUDA_GRAPH_CAPTURE_SIZES).replace(" ", "")
+        ]
+
+    # assume multiple GPUs are for splitting up large matrix multiplications
+    cmd += ["--tensor-parallel-size", str(N_GPU)]
+
+    print(cmd)
+
+    subprocess.Popen(" ".join(cmd), shell=True)
+
+```
+
+## Deploy the server
+
+To deploy the API on Modal, just run
+
+```bash
+modal deploy gpt_oss_inference.py
+```
+
+This will create a new app on Modal, build the container image for it if it hasn't been built yet,
+and deploy the app.
+
+## Test the server
+
+To make it easier to test the server setup, we also include a `local_entrypoint`
+that does a healthcheck and then hits the server.
+
+If you execute the command
+
+```bash
+modal run gpt_oss_inference.py
+```
+
+a fresh replica of the server will be spun up on Modal while
+the code below executes on your local machine.
+
+We set up the system prompt with low reasoning effort to run
+inference a bit faster. For the best ergonomics we recommend using
+the [harmony API](https://cookbook.openai.com/articles/openai-harmony#example-system-message),
+which can be installed with `pip install openai-harmony`.
+
+```python
+@app.local_entrypoint()
+async def test(test_timeout=30 * MINUTES, user_content=None, twice=True):
+    url = serve.get_web_url()
+    system_prompt = {
+        "role": "system",
+        "content": f"""You are ChatModal, a large language model trained by Modal.
+        Knowledge cutoff: 2024-06
+        Current date: {datetime.now(timezone.utc).date()}
+        Reasoning: low
+        \\# Valid channels: analysis, commentary, final. Channel must be included for every message.
+        Calls to these tools must go to the commentary channel: 'functions'.""",
+    }
+
+    if user_content is None:
+        user_content = "Explain what the Singular Value Decomposition is."
+
+    messages = [  # OpenAI chat format
+        system_prompt,
+        {"role": "user", "content": user_content},
+    ]
+
+    async with aiohttp.ClientSession(base_url=url) as session:
+        print(f"Running health check for server at {url}")
+        async with session.get("/health", timeout=test_timeout - 1 * MINUTES) as resp:
+            up = resp.status == 200
+        assert up, f"Failed health check for server at {url}"
+        print(f"Successful health check for server at {url}")
+
+        print(f"Sending messages to {url}:", *messages, sep="\n\t")
+        await _send_request(session, "llm", messages)
+
+        if twice:
+            messages[0]["content"] += "\nTalk like a pirate, matey."
+            print(f"Re-sending messages to {url}:", *messages, sep="\n\t")
+            await _send_request(session, "llm", messages)
+
+async def _send_request(
+    session: aiohttp.ClientSession, model: str, messages: list
+) -> None:
+    # `stream=True` tells an OpenAI-compatible backend to stream chunks
+    payload: dict[str, Any] = {"messages": messages, "model": model, "stream": True}
+
+    headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}
+
+    t = time.perf_counter()
+    async with session.post(
+        "/v1/chat/completions", json=payload, headers=headers, timeout=10 * MINUTES
+    ) as resp:
+        async for raw in resp.content:
+            resp.raise_for_status()
+            # extract new content and stream it
+            line = raw.decode().strip()
+            if not line or line == "data: [DONE]":
+                continue
+            if line.startswith("data: "):  # SSE prefix
+                line = line[len("data: ") :]
+
+            chunk = json.loads(line)
+            assert (
+                chunk["object"] == "chat.completion.chunk"
+            )  # or something went horribly wrong
+            delta = chunk["choices"][0]["delta"]
+
+            if "content" in delta:
+                print(delta["content"], end="")  # print the content as it comes in
+            elif "reasoning_content" in delta:
+                print(delta["reasoning_content"], end="")
+            elif not delta:
+                print()
+            else:
+                raise ValueError(f"Unsupported response delta: {delta}")
+    print("")
+    print(f"Time to Last Token: {time.perf_counter() - t:.2f} seconds")
+
+```
+
+### Gpu Fallbacks
+
+# Set "fallback" GPUs
+
+GPU availabilities on Modal can fluctuate, especially for
+tightly-constrained requests, like for eight co-located GPUs
+in a specific region.
+
+If your code can run on multiple different GPUs, you can specify
+your GPU request as a list, in order of preference, and whenever
+your Function scales up, we will try to schedule it on each requested GPU type in order.
+
+The code below demonstrates the usage of the `gpu` parameter with a list of GPUs.
+
+```python
+import subprocess
+
+import modal
+
+app = modal.App("example-gpu-fallbacks")
+
+@app.function(
+    gpu=["h100", "a100", "any"],  # "any" means any of L4, A10, or T4
+    single_use_containers=True,  # new container each input, so we re-roll the GPU dice every time
+)
+async def remote(_idx):
+    gpu = subprocess.run(
+        ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
+        check=True,
+        text=True,
+        stdout=subprocess.PIPE,
+    ).stdout.strip()
+    print(gpu)
+    return gpu
+
+@app.local_entrypoint()
+def local(count: int = 32):
+    from collections import Counter
+
+    gpu_counter = Counter(remote.map([i for i in range(count)], order_outputs=False))
+    print(f"ran {gpu_counter.total()} times")
+    print(f"on the following {len(gpu_counter.keys())} GPUs:", end="\n")
+    print(
+        *[f"{gpu.rjust(32)}: {'🔥' * ct}" for gpu, ct in gpu_counter.items()],
+        sep="\n",
+    )
+
+```
+
+### Gpu Packing
+
+# Run multiple instances of a model on a single GPU
+
+Many models are small enough to fit multiple instances onto a single GPU.
+Doing so can dramatically reduce the number of GPUs needed to handle demand.
+
+We use `@modal.concurrent` to allow multiple connections into the container
+We load the model instances into a FIFO queue to ensure only one http handler can access it at once
+
+```python
+import asyncio
+import time
+from contextlib import asynccontextmanager
+
+import modal
+
+MODEL_PATH = "/model.bge"
+
+def download_model():
+    from sentence_transformers import SentenceTransformer
+
+    model = SentenceTransformer("BAAI/bge-small-en-v1.5")
+    model.save(MODEL_PATH)
+
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install("sentence-transformers==3.2.0")
+    .run_function(download_model)
+)
+
+app = modal.App("example-gpu-packing", image=image)
+
+```
+
+ModelPool holds multiple instances of the model, using a queue
+
+```python
+class ModelPool:
+    def __init__(self):
+        self.pool: asyncio.Queue = asyncio.Queue()
+
+    async def put(self, model):
+        await self.pool.put(model)
+
+    # We provide a context manager to easily acquire and release models from the pool
+    @asynccontextmanager
+    async def acquire_model(self):
+        model = await self.pool.get()
+        try:
+            yield model
+        finally:
+            await self.pool.put(model)
+
+with image.imports():
+    from sentence_transformers import SentenceTransformer
+
+@app.cls(
+    gpu="A10G",
+    max_containers=1,  # Max one container for this app, for the sake of demoing concurrent_inputs
+)
+@modal.concurrent(max_inputs=100)  # Allow concurrent inputs into our single container.
+class Server:
+    n_models: int = modal.parameter(default=10)
+
+    @modal.enter()
+    def init(self):
+        self.model_pool = ModelPool()
+
+    @modal.enter()
+    async def load_models(self):
+        # Boot N models onto the gpu, and place into the pool
+        t0 = time.time()
+        for i in range(self.n_models):
+            model = SentenceTransformer("/model.bge", device="cuda")
+            await self.model_pool.put(model)
+
+        print(f"Loading {self.n_models} models took {time.time() - t0:.4f}s")
+
+    @modal.method()
+    def prewarm(self):
+        pass
+
+    @modal.method()
+    async def predict(self, sentence):
+        # Block until a model is available
+        async with self.model_pool.acquire_model() as model:
+            # We now have exclusive access to this model instance
+            embedding = model.encode(sentence)
+            await asyncio.sleep(
+                0.2
+            )  # Simulate extra inference latency, for demo purposes
+        return embedding.tolist()
+
+@app.local_entrypoint()
+async def main(n_requests: int = 100):
+    # We benchmark with 100 requests in parallel.
+    # Thanks to @modal.concurrent(), 100 requests will enter .predict() at the same time.
+
+    sentences = ["Sentence {}".format(i) for i in range(n_requests)]
+
+    # Baseline: a server with a pool size of 1 model
+    print("Testing Baseline (1 Model)")
+    t0 = time.time()
+    server = Server(n_models=1)
+    server.prewarm.remote()
+    print("Container boot took {:.4f}s".format(time.time() - t0))
+
+    t0 = time.time()
+    async for result in server.predict.map.aio(sentences):
+        pass
+    print(f"Inference took {time.time() - t0:.4f}s\n")
+
+    # Packing: a server with a pool size of 10 models
+    # Note: this increases boot time, but reduces inference time
+    print("Testing Packing (10 Models)")
+    t0 = time.time()
+    server = Server(n_models=10)
+    server.prewarm.remote()
+    print("Container boot took {:.4f}s".format(time.time() - t0))
+
+    t0 = time.time()
+    async for result in server.predict.map.aio(sentences):
+        pass
+    print(f"Inference took {time.time() - t0:.4f}s\n")
+
+```
+
+### Gpu Snapshot
+
+# Snapshot GPU memory to speed up cold starts
+
+This example demonstrates how to use GPU memory snapshots to speed up model loading.
+Note that GPU memory snapshotting is an experimental feature,
+so test carefully before using in production!
+You can read more about GPU memory snapshotting, and its caveats,
+[here](https://modal.com/docs/guide/memory-snapshot).
+
+GPU snapshots can only be used with deployed Functions, so first deploy the App:
+
+```bash
+modal deploy -m 06_gpu_and_ml.gpu_snapshot
+```
+
+Next, invoke the Function:
+
+```bash
+python -m 06_gpu_and_ml.gpu_snapshot
+```
+
+The full code is below:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().uv_pip_install("sentence-transformers<6")
+app_name = "example-gpu-snapshot"
+app = modal.App(app_name, image=image)
+
+snapshot_key = "v1"  # change this to invalidate the snapshot cache
+
+with image.imports():  # import in the global scope so imports can be snapshot
+    from sentence_transformers import SentenceTransformer
+
+@app.cls(
+    gpu="a10",
+    enable_memory_snapshot=True,
+    experimental_options={"enable_gpu_snapshot": True},
+)
+class SnapshotEmbedder:
+    @modal.enter(snap=True)
+    def load(self):
+        # during enter phase of container lifecycle,
+        # load the model onto the GPU so it can be snapshot
+        print("loading model")
+        self.model = SentenceTransformer("BAAI/bge-small-en-v1.5", device="cuda")
+        print(f"snapshotting {snapshot_key}")
+
+    @modal.method()
+    def run(self, sentences: list[str]) -> list[list[float]]:
+        # later invocations of the Function will start here
+        embeddings = self.model.encode(sentences, normalize_embeddings=True)
+        return embeddings.tolist()
+
+if __name__ == "__main__":
+    # after deployment, we can use the class from anywhere
+    SnapshotEmbedder = modal.Cls.from_name(app_name, "SnapshotEmbedder")
+    embedder = SnapshotEmbedder()
+    try:
+        print("calling Modal Function")
+        print(embedder.run.remote(sentences=["what is the meaning of life?"]))
+    except modal.exception.NotFoundError:
+        raise Exception(
+            f"To take advantage of GPU snapshots, deploy first with modal deploy {__file__}"
+        )
+
+```
+
+### Grpo Trl
+
+# Train a model to solve coding problems using GRPO and TRL
+
+This example demonstrates how to run [GRPO](https://arxiv.org/pdf/2402.03300) on Modal using the TRL [GRPO trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)
+GRPO is a reinforcement learning algorithm introduced by DeepSeek, and was used to train DeepSeek R1.
+TRL is a reinforcement learning training library by Huggingface.
+
+First we perform the imports and then define the app.
+
+```python
+from __future__ import annotations
+
+import os
+import re
+import subprocess
+from pathlib import Path
+from typing import Iterable, Sequence
+
+import modal
+
+app: modal.App = modal.App("example-grpo-trl")
+
+```
+
+We define an image where we install the TRL library.
+We also install vLLM for the next part of this example. We also use Weights & Biases for logging.
+
+```python
+image: modal.Image = modal.Image.debian_slim().uv_pip_install(
+    "trl[vllm]==0.19.1", "datasets==3.5.1", "wandb==0.17.6"
+)
+
+```
+
+We import the necessary libraries needed in the context of the image.
+
+```python
+with image.imports():
+    from datasets import Dataset, load_dataset
+    from trl import GRPOConfig, GRPOTrainer
+
+```
+
+We also define a [Modal Volume](https://modal.com/docs/guide/volumes#volumes) for storing model checkpoints.
+
+```python
+MODELS_DIR = Path("/models")
+checkpoints_volume: modal.Volume = modal.Volume.from_name(
+    "example-grpo-trl-checkpoints", create_if_missing=True
+)
+
+```
+
+## Defining the reward function
+
+In this example, we use the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset to train a model to solve coding problems.
+
+In reinforcement learning, we define a reward function for the model. Since we are evaluating code that is generated by
+a model, we use [Modal Sandboxes](https://modal.com/docs/guide/sandbox) to evaluate the code securely.
+
+For each completion from the model and a test case to test the completion, we define a simple reward function.
+The function returns 1 if there are no errors, and 0 otherwise. You might want to adjust this reward function
+as the model is unlikely to learn well with this function.
+
+```python
+@app.function()
+def compute_reward(completion: str, testcase: Sequence[str]) -> int:
+    sb, score = None, 0
+    sb: modal.Sandbox = modal.Sandbox.create(app=app)
+    code_to_execute: str = get_generated_code_and_test_cases(completion, testcase)
+
+    try:
+        p = sb.exec("python", "-c", code_to_execute, timeout=30)
+        p.wait()
+        return_code = p.returncode
+        if return_code == 0:
+            score = 1
+    except Exception as e:
+        print(e)
+    finally:
+        sb.terminate()
+        return score
+
+```
+
+We write a function that constructs a program from the model completion. This is determined based on the format of the data.
+The completions are supposed to follow the format "```python ...".
+The test cases are a list of assert statements.
+More details [here](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2).
+
+```python
+def get_generated_code_and_test_cases(completion: str, testcase: Sequence[str]) -> str:
+    if "```python" in completion:
+        # Find the start and end of the code block
+        start_idx: int = completion.find("```python") + len("```python")
+        end_idx: int = completion.find("```", start_idx)
+        if end_idx != -1:
+            code: str = completion[start_idx:end_idx].strip()
+        else:
+            code: str = completion[start_idx:].strip()
+    else:
+        code: str = completion.strip()
+
+    test_cases: str = "\n".join(testcase)
+    full_code: str = f"{code}\n\n{test_cases}"
+    return full_code
+
+```
+
+Finally, we define the function that is passed into the GRPOTrainer, which takes in a list of completions.
+Custom reward functions must conform to a [specific signature](https://huggingface.co/docs/trl/main/en/grpo_trainer#using-a-custom-reward-function).
+
+```python
+def reward_helper_function(
+    completions: Sequence[str], testcases: Sequence[Sequence[str]], **kwargs: object
+) -> Iterable[int]:
+    return compute_reward.starmap(zip(completions, testcases))
+
+```
+
+## Kicking off a training run
+
+Preprocess the data, preparing the columns that `GRPOTrainer` expects.
+We use the OpenCoder-LLM educational instruct dataset, which has (instruction, code, test case) triples validated through a Python compiler.
+More details [here](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2).
+
+```python
+def start_grpo_trainer(use_vllm=False, vllm_mode=None):
+    dataset: Dataset = load_dataset(
+        "OpenCoder-LLM/opc-sft-stage2", "educational_instruct", split="train"
+    )
+    dataset = dataset.rename_column(
+        "instruction", "prompt"
+    )  # Needed for the GRPO trainer
+    dataset = dataset.rename_column("testcase", "testcases")
+    dataset = dataset.select(range(128))  # To simplify testing.
+    training_args: GRPOConfig = GRPOConfig(
+        output_dir=str(MODELS_DIR),
+        report_to="wandb",
+        use_vllm=use_vllm,
+        vllm_mode=vllm_mode,
+        save_steps=1,
+        max_steps=5,  # To simplify testing. Remove for production use cases.
+    )
+    trainer = GRPOTrainer(
+        model="Qwen/Qwen2-0.5B-Instruct",
+        reward_funcs=reward_helper_function,
+        args=training_args,
+        train_dataset=dataset,
+    )
+    trainer.train()
+
+```
+
+We use Weights & Biases for logging, hence we use a [Modal Secret](https://modal.com/docs/guide/secrets#secrets) with wandb credentials.
+
+```python
+@app.function(
+    image=image,
+    gpu="H100",
+    timeout=60 * 60 * 24,  # 24 hours
+    secrets=[modal.Secret.from_name("wandb-secret")],
+    volumes={"/models": checkpoints_volume},
+)
+def train() -> None:
+    start_grpo_trainer()
+
+```
+
+To run: `modal run --detach grpo_trl.py::train`.
+
+## Speeding up training with vLLM
+
+vLLM can be used either in server mode (run vLLM server on separate gpu) or colocate mode (within the training process).
+In server mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP.
+This is ideal if you have dedicated GPUs for inference. More details [here](https://huggingface.co/docs/trl/main/en/grpo_trainer#-option-1-server-mode).
+Here, we use 2 GPUs. We run the GRPOTrainer on 1 of them, and the vLLM process on another.
+
+```python
+@app.function(
+    image=image,
+    gpu="H100:2",
+    timeout=60 * 60 * 24,  # 24 hours
+    secrets=[modal.Secret.from_name("wandb-secret")],
+    volumes={str(MODELS_DIR): checkpoints_volume},
+)
+def train_vllm_server_mode() -> None:
+    env_copy = os.environ.copy()
+    env_copy["CUDA_VISIBLE_DEVICES"] = "0"  # Run serve vLLM process on GPU 0
+
+    # Start vllm-serve in the background
+    subprocess.Popen(
+        ["trl", "vllm-serve", "--model", "Qwen/Qwen2-0.5B-Instruct"],
+        env=env_copy,
+    )
+    os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # Run training process on GPU 1
+    start_grpo_trainer(use_vllm=True, vllm_mode="server")
+
+```
+
+You can execute this using `modal run --detach grpo_trl.py::train_vllm_server_mode`.
+
+In colocate mode, vLLM runs inside the trainer process and shares GPU memory with the training model.
+This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
+More details [here](https://huggingface.co/docs/trl/main/en/grpo_trainer#-option-2-colocate-mode).
+
+```python
+@app.function(
+    image=image,
+    gpu="H100",
+    timeout=60 * 60 * 24,  # 24 hours
+    secrets=[modal.Secret.from_name("wandb-secret")],
+    volumes={"/models": checkpoints_volume},
+)
+def train_vllm_colocate_mode() -> None:
+    # Rank of the current process (0 for single-process training)
+    os.environ["RANK"] = "0"
+    # Local rank of the process on the node (0 for single-process training)
+    os.environ["LOCAL_RANK"] = "0"
+    # Total number of processes (1 for single-process training)
+    os.environ["WORLD_SIZE"] = "1"
+    # Address of the master node (localhost for single node)
+    os.environ["MASTER_ADDR"] = "localhost"
+    # Port for communication between processes
+    os.environ["MASTER_PORT"] = "12355"
+    start_grpo_trainer(use_vllm=True, vllm_mode="colocate")
+
+```
+
+You can execute this using `modal run --detach grpo_trl.py::train_vllm_colocate_mode`.
+
+## Performing inference on the trained model
+
+We use vLLM to perform inference on the trained model.
+
+```python
+VLLM_PORT: int = 8000
+
+```
+
+Once you have the model checkpoints in your Modal Volume, you can load the weights and perform inference using vLLM. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+The weights path is as follows: `global_step_n/actor/huggingface` where n is the checkpoint you want (eg `global_step_5/actor/huggingface`).
+The `latest_checkpointed_iteration.txt` file stores the most recent checkpoint index.
+
+```python
+def get_latest_checkpoint_file_path():
+    checkpoint_dirs = [
+        d.name
+        for d in MODELS_DIR.iterdir()
+        if d.is_dir() and re.match(r"^checkpoint-(\d+)$", d.name)
+    ]
+    if not checkpoint_dirs:
+        raise FileNotFoundError("No checkpoint directories found in models dir")
+    latest_checkpoint_index = max(
+        int(re.match(r"^checkpoint-(\d+)$", d).group(1)) for d in checkpoint_dirs
+    )
+    return str(MODELS_DIR / f"checkpoint-{latest_checkpoint_index}")
+
+```
+
+We provide the code for setting up an OpenAI compatible inference endpoint here. For more details re. serving models on vLLM, check out [this example.](https://modal.com/docs/examples/vllm_inference#deploy-the-server)
+
+```python
+vllm_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "vllm==0.9.1",
+        "flashinfer-python==0.2.6.post1",
+        extra_index_url="https://download.pytorch.org/whl/cu128",
+        extra_options="--index-strategy unsafe-best-match",
+    )
+    .env({"VLLM_USE_V1": "1"})
+)
+
+vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
+
+@app.function(
+    image=vllm_image,
+    gpu="H100",
+    scaledown_window=15 * 60,  # How long should we stay up with no requests?
+    timeout=10 * 60,  # How long should we wait for container start?
+    volumes={"/root/.cache/vllm": vllm_cache_vol, MODELS_DIR: checkpoints_volume},
+)
+@modal.concurrent(
+    max_inputs=32
+)  # How many requests can one replica handle? tune carefully!
+@modal.web_server(port=VLLM_PORT, startup_timeout=10 * 60)
+def serve():
+    latest_checkpoint_file_path = get_latest_checkpoint_file_path()
+
+    cmd = [
+        "vllm",
+        "serve",
+        "--uvicorn-log-level=info",
+        latest_checkpoint_file_path,
+        "--host",
+        "0.0.0.0",
+        "--port",
+        str(VLLM_PORT),
+    ]
+    subprocess.Popen(" ".join(cmd), shell=True)
+
+```
+
+You can then deploy the server using `modal deploy grpo_trl.py`, which gives you a custom url. You can then query it using the following curl command:
+
+```bash
+curl -X POST <url>/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant for solving math problems."},
+      {"role": "user", "content": "James had 4 apples. Mary gave him 2 and he ate 1. How many does he have left?"}
+    ],
+    "temperature": 0.7
+  }'
+```
+
+or in the [following ways](https://modal.com/docs/examples/vllm_inference#interact-with-the-server).
+
+### Grpo Verl
+
+# Train a model to solve math problems using GRPO and verl
+
+This example demonstrates how to train with [GRPO](https://arxiv.org/pdf/2402.03300) on Modal using the [verl](https://github.com/volcengine/verl) framework.
+GRPO is a reinforcement learning algorithm introduced by DeepSeek, and was used to train DeepSeek R1.
+verl is a reinforcement learning training library that is an implementation of [HybridFlow](https://arxiv.org/abs/2409.19256v2), an RLHF framework.
+
+The training process works as follows:
+- Each example in the dataset corresponds to a math problem.
+- In each training step, the model attempts to solve the math problems showing its steps.
+- We then compute a reward for the model's solution using the reward function defined below.
+- That reward value is then used to update the model's parameters according to the GRPO training algorithm.
+
+## Setup
+
+Import the necessary modules for Modal deployment.
+
+```python
+import re
+import subprocess
+from pathlib import Path
+from typing import Literal, Optional
+
+import modal
+
+```
+
+## Defining the image and app
+
+```python
+app = modal.App("example-grpo-verl")
+
+```
+
+We define an image where we clone the verl repo and install its dependencies. We use a base verl image as a starting point.
+
+```python
+VERL_REPO_PATH: Path = Path("/root/verl")
+image = (
+    modal.Image.from_registry("verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1")
+    .apt_install("git")
+    .run_commands(f"git clone https://github.com/volcengine/verl {VERL_REPO_PATH}")
+    .uv_pip_install("verl[vllm]==0.4.1")
+)
+
+```
+
+## Defining the dataset
+
+In this example, we'll use reinforcement learning to train a model to solve math problems.
+We use the [GSM8K](https://huggingface.co/datasets/openai/gsm8k) dataset of math problems and a [Modal Volume](https://modal.com/docs/guide/volumes#volumes) to store the data.
+
+```python
+DATA_PATH: Path = Path("/data")
+data_volume: modal.Volume = modal.Volume.from_name(
+    "grpo-verl-example-data", create_if_missing=True
+)
+
+```
+
+We write a Modal Function to populate the Volume with the data. This downloads the dataset and stores it in the Volume.
+You will need to run this step if you don't already have data you'd like to use for this example.
+
+```python
+@app.function(image=image, volumes={DATA_PATH: data_volume})
+def prep_dataset() -> None:
+    subprocess.run(
+        [
+            "python",
+            VERL_REPO_PATH / "examples" / "data_preprocess" / "gsm8k.py",
+            "--local_dir",
+            DATA_PATH,
+        ],
+        check=True,
+    )
+
+```
+
+You can kick off the dataset download with
+`modal run <filename.py>::prep_dataset`
+
+## Defining a reward function
+
+In reinforcement learning, we define a reward function for the model.
+We can define this in a separate file, or in the same file as in this case that we then pass as an argument to verl.
+We use a `default` reward function for GSM8K from the [verl repo](https://github.com/volcengine/verl/blob/v0.1/verl/utils/reward_score/gsm8k.py), modified to return 1.0 if it's a correct answer and 0 otherwise.
+
+```python
+def extract_solution(
+    solution_str: str, method: Literal["strict", "flexible"] = "strict"
+) -> Optional[str]:
+    assert method in ["strict", "flexible"]
+
+    if method == "strict":
+        # This also tests the formatting of the model
+        solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str)
+        if solution is None:
+            final_answer: Optional[str] = None
+        else:
+            final_answer = solution.group(0)
+            final_answer = (
+                final_answer.split("#### ")[1].replace(",", "").replace("$", "")
+            )
+    elif method == "flexible":
+        answer = re.findall("(\\-?[0-9\\.\\,]+)", solution_str)
+        final_answer: Optional[str] = None
+        if len(answer) == 0:
+            # No reward if there is no answer.
+            pass
+        else:
+            invalid_str: list[str] = ["", "."]
+            # Find the last number that is not '.'
+            for final_answer in reversed(answer):
+                if final_answer not in invalid_str:
+                    break
+    return final_answer
+
+```
+
+Reward functions need to follow a [predefined signature.](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)
+
+```python
+def compute_reward(
+    data_source: str, solution_str: str, ground_truth: str, extra_info: dict
+) -> float:
+    answer = extract_solution(solution_str=solution_str, method="strict")
+    if answer is None:
+        return 0.0
+    else:
+        if answer == ground_truth:
+            return 1.0
+        else:
+            return 0.0
+
+```
+
+We then define constants to pass into verl during the training run.
+
+```python
+PATH_TO_REWARD_FUNCTION: Path = Path("/root/grpo_verl.py")
+REWARD_FUNCTION_NAME: str = "compute_reward"
+
+```
+
+## Kicking off a training run
+
+We define some more constants for the training run.
+
+```python
+MODELS_PATH: Path = Path("/models")
+MINUTES: int = 60
+
+```
+
+We also define a Volume for storing model checkpoints.
+
+```python
+checkpoints_volume: modal.Volume = modal.Volume.from_name(
+    "grpo-verl-example-checkpoints", create_if_missing=True
+)
+
+```
+
+Now, we write a Modal Function for kicking off the training run.
+If you wish to use Weights & Biases, as we do in this code, you'll need to create a Weights & Biases [Secret.](https://modal.com/docs/guide/secrets#secrets)
+
+verl uses Ray under the hood. It creates Ray workers for each step where each Ray worker is a python process and each step is a step in the RL dataflow pipeline.
+verl also keeps a separate control flow process that's independent of this, responsible for figuring out what step in the RL pipeline to execute.
+Each Ray worker gets mapped onto 1 or more GPUs. Depending on the number of GPUs available, Ray will decide what workers go where, or to hold off scheduling workers
+if there are no available GPUs. Generally, more VRAM = less hot-swapping of Ray workers, which means less waiting around for memory copying each iteration.
+In this example we have chosen a configuration that allows for easy automated testing, but you may wish to use more GPUs or more powerful GPU types.
+More details [here](https://verl.readthedocs.io/en/latest/hybrid_flow.html).
+
+```python
+@app.function(
+    image=image,
+    gpu="H100:2",
+    volumes={
+        MODELS_PATH: checkpoints_volume,
+        DATA_PATH: data_volume,
+    },
+    secrets=[modal.Secret.from_name("wandb-secret")],
+    timeout=24 * 60 * MINUTES,
+)
+def train(*arglist) -> None:
+    data_volume.reload()
+
+    cmd: list[str] = [
+        "python",
+        "-m",
+        "verl.trainer.main_ppo",
+        "algorithm.adv_estimator=grpo",
+        f"data.train_files={DATA_PATH / 'train.parquet'}",
+        f"data.val_files={DATA_PATH / 'test.parquet'}",
+        "data.train_batch_size=128",
+        "data.max_prompt_length=64",
+        "data.max_response_length=1024",
+        "data.filter_overlong_prompts=True",
+        "data.truncation=error",
+        "actor_rollout_ref.model.path=Qwen/Qwen2-0.5B",
+        "actor_rollout_ref.actor.optim.lr=1e-6",
+        "actor_rollout_ref.model.use_remove_padding=False",
+        "actor_rollout_ref.actor.ppo_mini_batch_size=128",
+        "actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16",
+        "actor_rollout_ref.actor.checkpoint.save_contents='model,optimizer,extra,hf_model'",
+        "actor_rollout_ref.actor.use_kl_loss=True",
+        "actor_rollout_ref.actor.entropy_coeff=0",
+        "actor_rollout_ref.actor.kl_loss_coef=0.001",
+        "actor_rollout_ref.actor.kl_loss_type=low_var_kl",
+        "actor_rollout_ref.model.enable_gradient_checkpointing=True",
+        "actor_rollout_ref.actor.fsdp_config.param_offload=False",
+        "actor_rollout_ref.actor.fsdp_config.optimizer_offload=False",
+        "actor_rollout_ref.rollout.tensor_model_parallel_size=2",
+        "actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16",
+        "actor_rollout_ref.rollout.name=vllm",
+        "actor_rollout_ref.rollout.gpu_memory_utilization=0.4",
+        "actor_rollout_ref.rollout.n=5",
+        "actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16",
+        "actor_rollout_ref.ref.fsdp_config.param_offload=True",
+        "algorithm.use_kl_in_reward=False",
+        "trainer.critic_warmup=0",
+        "trainer.logger=['console', 'wandb']",
+        "trainer.project_name=verl_grpo_example_qwen2-0.5b",
+        "trainer.experiment_name=qwen2-0.5b_example",
+        "trainer.n_gpus_per_node=2",
+        "trainer.nnodes=1",
+        "trainer.test_freq=5",
+        f"trainer.default_local_dir={MODELS_PATH}",
+        "trainer.resume_mode=auto",
+        # Parameters chosen to ensure easy automated testing. Remove if needed.
+        "trainer.save_freq=1",
+        "trainer.total_training_steps=1",
+        "trainer.total_epochs=1",
+        # For the custom reward function.
+        f"custom_reward_function.path={str(PATH_TO_REWARD_FUNCTION)}",
+        f"custom_reward_function.name={REWARD_FUNCTION_NAME}",
+    ]
+    if arglist:
+        cmd.extend(arglist)
+
+    subprocess.run(cmd, check=True)
+
+```
+
+You can now run the training using `modal run --detach grpo_verl.py::train`, or pass in any [additional args from the CLI](https://modal.com/docs/guide/apps#argument-parsing) like this `modal run --detach grpo.py::train -- trainer.total_epochs=20 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16`.
+
+## Performing inference on the trained model
+
+We use vLLM to perform inference on the trained model.
+
+```python
+VLLM_PORT: int = 8000
+
+```
+
+Once you have the model checkpoints in your Modal Volume, you can load the weights and perform inference using vLLM. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+The weights path is as follows: `global_step_n/actor/huggingface` where n is the checkpoint you want (e.g. `global_step_5/actor/huggingface`).
+The `latest_checkpointed_iteration.txt` file stores the most recent checkpoint index.
+
+```python
+def get_latest_checkpoint_file_path():
+    with open(MODELS_PATH / "latest_checkpointed_iteration.txt") as f:
+        latest_checkpoint_index = int(f.read())
+    return str(
+        MODELS_PATH / f"global_step_{latest_checkpoint_index}" / "actor" / "huggingface"
+    )
+
+```
+
+We provide the code for setting up an OpenAI compatible inference endpoint here. For more details re. serving models on vLLM, check out [this example.](https://modal.com/docs/examples/vllm_inference#deploy-the-server)
+
+```python
+vllm_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "vllm==0.9.1",
+        "flashinfer-python==0.2.6.post1",
+        extra_index_url="https://download.pytorch.org/whl/cu128",
+        extra_options="--index-strategy unsafe-best-match",
+    )
+    .env({"VLLM_USE_V1": "1"})
+)
+
+vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
+
+@app.function(
+    image=vllm_image,
+    gpu="H100:2",
+    scaledown_window=15 * MINUTES,  # How long should we stay up with no requests?
+    timeout=10 * MINUTES,  # How long should we wait for container start?
+    volumes={"/root/.cache/vllm": vllm_cache_vol, MODELS_PATH: checkpoints_volume},
+)
+@modal.concurrent(
+    max_inputs=32
+)  # How many requests can one replica handle? Tune carefully!
+@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
+def serve():
+    import subprocess
+
+    latest_checkpoint_file_path = get_latest_checkpoint_file_path()
+
+    cmd = [
+        "vllm",
+        "serve",
+        "--uvicorn-log-level=info",
+        latest_checkpoint_file_path,
+        "--host",
+        "0.0.0.0",
+        "--port",
+        str(VLLM_PORT),
+        "--tensor-parallel-size",
+        "2",
+    ]
+    subprocess.Popen(" ".join(cmd), shell=True)
+
+```
+
+You can then deploy the server using `modal deploy grpo_verl.py`, which gives you a custom URL. You can then query it using the following curl command:
+
+```bash
+curl -X POST <url>/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant for solving math problems."},
+      {"role": "user", "content": "James had 4 apples. Mary gave him 2 and he ate 1. How many does he have left?"}
+    ],
+    "temperature": 0.7
+  }'
+```
+
+or in the [following ways](https://modal.com/docs/examples/vllm_inference#interact-with-the-server).
+
+### Hackernews Alerts
+
+# Run cron jobs in the cloud to search Hacker News
+
+In this example, we use Modal to deploy a cron job that periodically queries Hacker News for
+new posts matching a given search term, and posts the results to Slack.
+
+## Import and define the app
+
+Let's start off with imports, and defining a Modal app.
+
+```python
+import os
+from datetime import datetime, timedelta
+
+import modal
+
+app = modal.App("example-hackernews-alerts")
+
+```
+
+Now, let's define an image that has the `slack-sdk` package installed, in which we can run a function
+that posts a slack message.
+
+```python
+slack_sdk_image = modal.Image.debian_slim().uv_pip_install("slack-sdk")
+
+```
+
+## Defining the function and importing the secret
+
+Our Slack bot will need access to a bot token.
+We can use Modal's [Secrets](https://modal.com/secrets) interface to accomplish this.
+To quickly create a Slack bot secret, click the "Create new secret" button.
+Then, select the Slack secret template from the list options,
+and follow the instructions in the "Where to find the credentials?" panel.
+Name your secret `hn-bot-slack.`
+
+Now, we define the function `post_to_slack`, which simply instantiates the Slack client using our token,
+and then uses it to post a message to a given channel name.
+
+```python
+@app.function(
+    image=slack_sdk_image,
+    secrets=[modal.Secret.from_name("hn-bot-slack", required_keys=["SLACK_BOT_TOKEN"])],
+)
+async def post_to_slack(message: str):
+    import slack_sdk
+
+    client = slack_sdk.WebClient(token=os.environ["SLACK_BOT_TOKEN"])
+    client.chat_postMessage(channel="hn-alerts", text=message)
+
+```
+
+## Searching Hacker News
+
+We are going to use Algolia's [Hacker News Search API](https://hn.algolia.com/api) to query for posts
+matching a given search term in the past X days. Let's define our search term and query period.
+
+```python
+QUERY = "serverless"
+WINDOW_SIZE_DAYS = 1
+
+```
+
+Let's also define an image that has the `requests` package installed, so we can query the API.
+
+```python
+requests_image = modal.Image.debian_slim().uv_pip_install("requests")
+
+```
+
+We can now define our main entrypoint, that queries Algolia for the term, and calls `post_to_slack`
+on all the results. We specify a [schedule](https://modal.com/docs/guide/cron)
+in the function decorator, which means that our function will run automatically at the given interval.
+
+```python
+@app.function(image=requests_image)
+def search_hackernews():
+    import requests
+
+    url = "http://hn.algolia.com/api/v1/search"
+
+    threshold = datetime.utcnow() - timedelta(days=WINDOW_SIZE_DAYS)
+
+    params = {
+        "query": QUERY,
+        "numericFilters": f"created_at_i>{threshold.timestamp()}",
+    }
+
+    response = requests.get(url, params, timeout=10).json()
+    urls = [item["url"] for item in response["hits"] if item.get("url")]
+
+    print(f"Query returned {len(urls)} items.")
+
+    post_to_slack.for_each(urls)
+
+```
+
+## Test running
+
+We can now test run our scheduled function as follows: `modal run hackernews_alerts.py::app.search_hackernews`
+
+## Defining the schedule and deploying
+
+Let's define a function that will be called by Modal every day
+
+```python
+@app.function(schedule=modal.Period(days=1))
+def run_daily():
+    search_hackernews.remote()
+
+```
+
+In order to deploy this as a persistent cron job, you can run `modal deploy hackernews_alerts.py`,
+
+Once the job is deployed, visit the [apps page](https://modal.com/apps) page to see
+its execution history, logs and other stats.
+
+### Hello World
+
+# Hello, world!
+
+This tutorial demonstrates some core features of Modal:
+
+* You can run functions on Modal just as easily as you run them locally.
+* Running functions in parallel on Modal is simple and fast.
+* Logs and errors show up immediately, even for functions running on Modal.
+
+## Importing Modal and setting up
+
+We start by importing `modal` and creating a `App`.
+We build up this `App` to [define our application](https://modal.com/docs/guide/apps).
+
+```python
+import sys
+
+import modal
+
+app = modal.App("example-hello-world")
+
+```
+
+## Defining a function
+
+Modal takes code and runs it in the cloud.
+
+So first we've got to write some code.
+
+Let's write a simple function that takes in an input,
+prints a log or an error to the console,
+and then returns an output.
+
+To make this function work with Modal, we just wrap it in a decorator,
+[`@app.function`](https://modal.com/docs/reference/modal.App#function).
+
+```python
+@app.function()
+def f(i):
+    if i % 2 == 0:
+        print("hello", i)
+    else:
+        print("world", i, file=sys.stderr)
+
+    return i * i
+
+```
+
+## Running our function locally, remotely, and in parallel
+
+Now let's see three different ways we can call that function:
+
+1. As a regular call on your `local` machine, with `f.local`
+
+2. As a `remote` call that runs in the cloud, with `f.remote`
+
+3. By `map`ping many copies of `f` in the cloud over many inputs, with `f.map`
+
+We call `f` in each of these ways inside the `main` function below.
+
+```python
+@app.local_entrypoint()
+def main():
+    # run the function locally
+    print(f.local(1000))
+
+    # run the function remotely on Modal
+    print(f.remote(1000))
+
+    # run the function in parallel and remotely on Modal
+    total = 0
+    for ret in f.map(range(200)):
+        total += ret
+
+    print(total)
+
+```
+
+Enter `modal run hello_world.py` in a shell, and you'll see a Modal app initialize.
+You'll then see the `print`ed logs of
+the `main` function and, mixed in with them, all the logs of `f` as it is run
+locally, then remotely, and then remotely and in parallel.
+
+That's all triggered by adding the
+[`@app.local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint)
+decorator on `main`, which defines it as the function to start from locally when we invoke `modal run`.
+
+## What just happened?
+
+When we called `.remote` on `f`, the function was executed
+_in the cloud_, on Modal's infrastructure, not on the local machine.
+
+In short, we took the function `f`, put it inside a container,
+sent it the inputs, and streamed back the logs and outputs.
+
+## But why does this matter?
+
+Try one of these things next to start seeing the full power of Modal!
+
+### You can change the code and run it again
+
+For instance, change the `print` statement in the function `f`
+to print `"spam"` and `"eggs"` instead and run the app again.
+You'll see that that your new code is run with no extra work from you --
+and it should even run faster!
+
+Modal's goal is to make running code in the cloud feel like you're
+running code locally. That means no waiting for long image builds when you've just moved a comma,
+no fiddling with container image pushes, and no context-switching to a web UI to inspect logs.
+
+### You can map over more data
+
+Change the `map` range from `200` to some large number, like `1170`. You'll see
+Modal create and run even more containers in parallel this time.
+
+And it'll happen lightning fast!
+
+### You can run a more interesting function
+
+The function `f` is a bit silly and doesn't do much, but in its place
+imagine something that matters to you, like:
+
+* Running [language model inference](https://modal.com/docs/examples/vllm_inference)
+or [fine-tuning](https://modal.com/docs/examples/slack-finetune)
+* Manipulating [audio](https://modal.com/docs/examples/musicgen)
+or [images](https://modal.com/docs/examples/diffusers_lora_finetune)
+* [Embedding huge text datasets](https://modal.com/docs/examples/amazon_embeddings) at lightning fast speeds
+
+Modal lets you parallelize that operation effortlessly by running hundreds or
+thousands of containers in the cloud.
+
+### Hello World Async
+
+# Async functions
+
+Modal natively supports async/await syntax using asyncio.
+
+First, let's import some global stuff.
+
+```python
+import sys
+
+import modal
+
+app = modal.App("example-hello-world-async")
+
+```
+
+## Defining a function
+
+Now, let's define a function. The wrapped function can be synchronous or
+asynchronous, but calling it in either context will still work.
+Let's stick to a normal synchronous function
+
+```python
+@app.function()
+def f(i):
+    if i % 2 == 0:
+        print("hello", i)
+    else:
+        print("world", i, file=sys.stderr)
+
+    return i * i
+
+```
+
+## Running the app with asyncio
+
+Let's make the main entrypoint asynchronous. In async contexts, we should
+call the function using `await` or iterate over the map using `async for`.
+Otherwise we would block the event loop while our call is being run.
+
+```python
+@app.local_entrypoint()
+async def run_async():
+    # Call the function using .remote.aio() in order to run it asynchronously
+    print(await f.remote.aio(1000))
+
+    # Parallel map.
+    total = 0
+    # Call .map asynchronously using using f.map.aio(...)
+    async for ret in f.map.aio(range(20)):
+        total += ret
+
+    print(total)
+
+```
+
+### Hp Sweep Gpt
+
+# Train an SLM from scratch with early-stopping grid search over hyperparameters
+
+![Split-Panel Image. Left: AI generated picture of Shakespeare. Right: SLM generated text](./shakespeare.jpg)
+
+When you want a language model that performs well on your task, there are three options,
+ordered by the degree of customization:
+
+- [**Prompt Engineering**](https://en.wikipedia.org/wiki/Prompt_engineering):
+large and capable language models understand tasks in natural language, so you can
+carefully design a natural language "prompt" to elicit the desired behavior.
+
+- [**Fine-Tuning**](https://modal.com/docs/examples/llm-finetuning):
+those same language models were trained by gradient descent on data sets representing tasks,
+and they can be further trained by gradient descent on data sets representative of your task.
+
+- **Training from Scratch**:
+if you have enough data for your task, you can throw the pretrained model away and make your own.
+
+Each step adds additional engineering complexity, but also leads to a superior cost-performance Pareto frontier
+for your tasks. Fine-tuned models at one-tenth the size regularly outperform more generic models,
+and models trained from scratch outperform them.
+
+Because these models are so much smaller than the Large Language Models that power generic
+assistant chatbots like ChatGPT and Claude, they are often called _Small Language Models_ (SLMs).
+
+In this example, we will explore training an SLM from scratch on Modal.
+
+In fact, we'll train 8 SLMs in parallel with different hyperparameters
+and then select the best one for additional training.
+
+We'll monitor this training live and serve our training and trained models
+as web endpoints and simple browser UIs.
+
+Along the way we'll use many features of the Modal platform:
+[distributed volumes](https://modal.com/docs/guide/volumes),
+multiple [web endpoints](https://modal.com/docs/guide/webhooks),
+and [parallel container execution](https://modal.com/docs/guide/scale#parallel-execution-of-inputs).
+
+Together, these features give every machine learning and AI team
+the same infrastructural capabilities that the most sophisticated companies
+have in their internal platforms.
+
+## Basic Setup
+
+```python
+import logging as L
+import urllib.request
+from dataclasses import dataclass
+from pathlib import Path, PosixPath
+from typing import Optional
+
+import modal
+from pydantic import BaseModel
+
+MINUTES = 60  # seconds
+HOURS = 60 * MINUTES
+
+app_name = "example-hp-sweep-gpt"
+app = modal.App(app_name)
+
+```
+
+We'll use A10G GPUs for training, which are able to train the model to recognizably improved performance
+in ~15 minutes while keeping costs under ~$1.
+
+```python
+gpu = "A10G"
+
+```
+
+### Create a Volume to store data, weights, and logs
+
+Since we'll be coordinating training across multiple machines we'll use a
+distributed [Volume](https://modal.com/docs/guide/volumes)
+to store the data, checkpointed models, and TensorBoard logs.
+
+```python
+volume = modal.Volume.from_name("example-hp-sweep-gpt-volume", create_if_missing=True)
+volume_path = PosixPath("/vol/data")
+model_filename = "nano_gpt_model.pt"
+best_model_filename = "best_nano_gpt_model.pt"
+tb_log_path = volume_path / "tb_logs"
+model_save_path = volume_path / "models"
+
+```
+
+### Define dependencies in container images
+
+The container image for training  is based on Modal's default slim Debian Linux image with `torch`
+for defining and running our neural network and `tensorboard` for monitoring training.
+
+```python
+base_image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "pydantic==2.9.1"
+)
+
+torch_image = base_image.uv_pip_install(
+    "torch==2.1.2",
+    "tensorboard==2.17.1",
+    "numpy<2",
+)
+
+```
+
+We also have some local dependencies that we'll need to import into the remote environment.
+We add them into the remote container.
+
+```python
+torch_image = torch_image.add_local_dir(
+    Path(__file__).parent / "src", remote_path="/root/src"
+)
+
+```
+
+We'll serve a simple web endpoint:
+
+```python
+web_image = base_image.uv_pip_install("fastapi[standard]==0.115.4", "starlette==0.41.2")
+
+```
+
+And we'll deploy a web UI for interacting with our trained models using Gradio.
+
+```python
+assets_path = Path(__file__).parent / "assets"
+ui_image = web_image.uv_pip_install("gradio~=4.44.0").add_local_dir(
+    assets_path, remote_path="/assets"
+)
+
+```
+
+We can also "pre-import" libraries that will be used by the functions we run on Modal in a given image
+using the `with image.imports` context manager.
+
+```python
+with torch_image.imports():
+    import glob
+    import os
+    from timeit import default_timer as timer
+
+    import tensorboard
+    import torch
+    from src.dataset import Dataset
+    from src.logs_manager import LogsManager
+    from src.model import AttentionModel
+    from src.tokenizer import Tokenizer
+
+```
+
+## Running SLM training on Modal
+
+Here we define the training function, wrapping it in a decorator
+that specifies the infrastructural parameters, like the container `image` we want to use,
+which `volume` to mount where, the `gpu` we're using, and so on.
+
+Training consists of specifying optimization parameters, loading the
+`dataset`, building the `model`, setting up TensorBoard logging &
+checkpointing, and then finally executing the `training_loop` itself.
+
+```python
+@app.function(
+    image=torch_image,
+    volumes={volume_path: volume},
+    gpu=gpu,
+    timeout=1 * HOURS,
+)
+def train_model(
+    node_rank,
+    n_nodes,
+    hparams,
+    experiment_name,
+    run_to_first_save=False,
+    n_steps=3000,
+    n_steps_before_eval=None,
+    n_steps_before_checkpoint=None,
+):
+    # optimizer, data, and model prep
+    batch_size = 64
+    learning_rate = 3e-4
+
+    n_eval_steps = 100
+    if n_steps_before_eval is None:
+        n_steps_before_eval = int(n_steps / 8)  # eval eight times per run
+    if n_steps_before_checkpoint is None:
+        n_steps_before_checkpoint = int(n_steps / 4)  # save four times per run
+
+    train_percent = 0.9
+
+    L.basicConfig(
+        level=L.INFO,
+        format=f"\033[0;32m%(asctime)s %(levelname)s [%(filename)s.%(funcName)s:%(lineno)d] [Node {node_rank + 1}] %(message)s\033[0m",
+        datefmt="%b %d %H:%M:%S",
+    )
+
+    # use GPU if available
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    L.info("Remote Device: %s // GPU: %s", device, gpu)
+
+    input_file_path = volume_path / "shakespeare_char.txt"
+    text = prepare_data(input_file_path, volume)
+
+    # construct tokenizer & dataset
+    tokenizer = Tokenizer(text)
+    dataset = Dataset(
+        tokenizer.encode(text),
+        train_percent,
+        batch_size,
+        hparams.context_size,
+        device,
+    )
+
+    # build the model
+    model = build_model(hparams, tokenizer.vocab_size, device)
+    num_parameters = sum(p.numel() for p in model.parameters())
+    L.info(f"Num parameters: {num_parameters}")
+
+    optimizer = setup_optimizer(model, learning_rate)
+
+    # TensorBoard logging & checkpointing prep
+    logs_manager = LogsManager(experiment_name, hparams, num_parameters, tb_log_path)
+    L.info(f"Model name: {logs_manager.model_name}")
+
+    model_save_dir = model_save_path / experiment_name / logs_manager.model_name
+    if model_save_dir.exists():
+        L.info("Loading model from checkpoint...")
+        checkpoint = torch.load(str(model_save_dir / model_filename))
+        is_best_model = not run_to_first_save
+        if is_best_model:
+            make_best_symbolic_link(model_save_dir, model_filename, experiment_name)
+        model.load_state_dict(checkpoint["model"])
+        start_step = checkpoint["steps"] + 1
+    else:
+        model_save_dir.mkdir(parents=True, exist_ok=True)
+        start_step = 0
+        checkpoint = init_checkpoint(model, tokenizer, optimizer, start_step, hparams)
+
+    checkpoint_path = model_save_dir / model_filename
+
+    out = training_loop(
+        start_step,
+        n_steps,
+        n_steps_before_eval,
+        n_steps_before_checkpoint,
+        n_eval_steps,
+        dataset,
+        tokenizer,
+        model,
+        optimizer,
+        logs_manager,
+        checkpoint,
+        checkpoint_path,
+        run_to_first_save,
+    )
+
+    return node_rank, float(out["val"]), hparams
+
+```
+
+## Launch a hyperparameter sweep from a `local_entrypoint`
+
+The main entry point coordinates the hyperparameter optimization.
+First we specify the default hyperparameters for the model, taken from
+[Andrej Karpathy's walkthrough](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5976s).
+For better performance, you can increase the `context_size` and scale up the GPU accordingly.
+
+```python
+@dataclass
+class ModelHyperparameters:
+    n_heads: int = 6
+    n_embed: int = 384
+    n_blocks: int = 6
+    context_size: int = 256
+    dropout: float = 0.2
+
+```
+
+Next we define the local entrypoint: the code we run locally to coordinate training.
+
+It will train 8 models in parallel across 8 containers, each
+with different hyperparameters, varying the number of heads (`n_heads`), the
+`context_size` (called the "block size" by Karpathy), and the dropout rate (`dropout`). To run in
+parallel we need to use the [`starmap` method](https://modal.com/docs/guide/scale#parallel-execution-of-inputs).
+
+We train all of the models until the first checkpoint and then stop early so we
+can compare the validation losses.
+
+Then we restart training for the best model and train it to completion.
+
+You can kick off training with the following command:
+
+```bash
+modal run 06_gpu_and_ml/hyperparameter-sweep/hp_sweep_gpt.py
+```
+
+The output will look something like this:
+
+```
+Sep 16 21:20:39 INFO [hp_sweep_gpt.py.train_model:127] [Node 1]  Remote Device: cuda // GPU: A10G
+Sep 16 21:20:40 INFO [hp_sweep_gpt.py.train_model:149] [Node 1]  Num parameters: 10693697
+Sep 16 21:20:40 INFO [hp_sweep_gpt.py.train_model:156] [Node 1]  Model Name: E2024-0916-142031.618259_context_size=8_n_heads=1_dropout=0.1
+Sep 16 21:20:41 INFO [hp_sweep_gpt.py.train_model:225] [Node 1]      0) //  1.03s // Train Loss: 3.58 // Val Loss: 3.60
+Sep 16 21:20:41 INFO [hp_sweep_gpt.py.train_model:127] [Node 2]  Remote Device: cuda // GPU: A10G
+...
+```
+
+The `local_entrypoint` code is below. Note that the arguments to it can also be passed via the command line.
+Use `--help` for details.
+
+```python
+@app.local_entrypoint()
+def main(
+    n_steps: int = 3000,
+    n_steps_before_checkpoint: Optional[int] = None,
+    n_steps_before_eval: Optional[int] = None,
+):
+    from datetime import datetime
+    from itertools import product
+
+    experiment_name = f"E{datetime.now().strftime('%Y-%m-%d-%H%M%S.%f')}"
+    default_hparams = ModelHyperparameters()
+
+    # build list of hyperparameters to train & validate
+    nheads_options = (1, default_hparams.n_heads)
+    context_size_options = (8, default_hparams.context_size)
+    dropout_options = (0.1, default_hparams.dropout)
+
+    hparams_list = [
+        ModelHyperparameters(n_heads=h, context_size=c, dropout=d)
+        for h, c, d in product(nheads_options, context_size_options, dropout_options)
+    ]
+
+    # run training for each hyperparameter setting
+    results = []
+    stop_early = True  # stop early so we can compare val losses
+    print(f"Testing {len(hparams_list)} hyperparameter settings")
+    n_nodes = len(hparams_list)
+    static_params = (
+        experiment_name,
+        stop_early,
+        n_steps,
+        n_steps_before_eval,
+        n_steps_before_checkpoint,
+    )
+    for result in train_model.starmap(
+        [(i, n_nodes, h, *static_params) for i, h in enumerate(hparams_list)],
+        order_outputs=False,
+    ):
+        # result = (node_rank, val_loss, hparams)
+        node_rank = result[0]
+        results.append(result)
+        print(
+            f"[Node {node_rank + 1}/{n_nodes}] Finished. Early stop val loss result: {result[1:]}"
+        )
+
+    # find the model and hparams with the lowest validation loss
+    best_result = min(results, key=lambda x: x[1])
+    print(f"Best early stop val loss result: {best_result}")
+    best_hparams = best_result[-1]
+
+    # finish training with best hparams
+    node_rank = 0
+    n_nodes = 1  # only one node for final training run
+    train_model.remote(
+        node_rank,
+        n_nodes,
+        best_hparams,
+        experiment_name,
+        not stop_early,
+        n_steps,
+        n_steps_before_eval,
+        n_steps_before_checkpoint,
+    )
+
+```
+
+### Monitor experiments with TensorBoard
+
+To monitor our training we will create a TensorBoard WSGI web app, which will
+display the progress of our training across all 8 models. We'll use the latest
+logs for the most recent experiment written to the Volume.
+
+To ensure we have the latest data we add some
+[WSGI Middleware](https://peps.python.org/pep-3333/)
+that checks the Modal Volume for updates when the page is reloaded.
+
+```python
+class VolumeMiddleware:
+    def __init__(self, app):
+        self.app = app
+
+    def __call__(self, environ, start_response):
+        if (route := environ.get("PATH_INFO")) in ["/", "/modal-volume-reload"]:
+            try:
+                volume.reload()
+            except Exception as e:
+                print("Exception while re-loading traces: ", e)
+            if route == "/modal-volume-reload":
+                environ["PATH_INFO"] = "/"  # redirect
+        return self.app(environ, start_response)
+
+```
+
+To ensure a unique color per experiment you can click the palette (🎨) icon
+under TensorBoard > Time Series > Run and use the Regex:
+`E(\d{4})-(\d{2})-(\d{2})-(\d{6})\.(\d{6})`
+
+You can deploy this TensorBoard service by running
+
+```
+modal deploy 06_gpu_and_ml/hyperparameter-sweep/hp_sweep_gpt.py
+```
+
+and visit it at the URL that ends with `-monitor-training.modal.run`.
+
+After training finishes, your TensorBoard UI will look something like this:
+
+![8 lines on a graph, validation loss on y-axis, time step on x-axis. All lines go down over the first 1000 time steps, and one goes to 5000 time steps with a final loss of 1.52](./tensorboard.png)
+
+You can also find some sample text generated by the model in the "Text" tab.
+
+```python
+@app.function(
+    image=torch_image,
+    volumes={volume_path: volume},
+)
+@modal.concurrent(max_inputs=100)
+@modal.wsgi_app()
+def monitor_training():
+    board = tensorboard.program.TensorBoard()
+    board.configure(logdir=str(tb_log_path))
+    (data_provider, deprecated_multiplexer) = board._make_data_provider()
+    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
+        board.flags,
+        board.plugin_loaders,
+        data_provider,
+        board.assets_zip_provider,
+        deprecated_multiplexer,
+        experimental_middlewares=[VolumeMiddleware],
+    )
+    return wsgi_app
+
+```
+
+Notice that there are 8 models training, and the one with the lowest
+validation loss at step 600 continues training to 3000 steps.
+
+## Serving SLMs on Modal during and after training
+
+Because our weights are stored in a distributed Volume,
+we can deploy an inference endpoint based off of them without any extra work --
+and we can even check in on models while we're still training them! # For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+### Remote inference with Modal `Cls`es
+
+We wrap our inference in a Modal `Cls` called `ModelInference`.
+The user of `ModelInference` can control which model is used by providing the
+`experiment_name`.  Each unique choice creates a separate
+[auto-scaling deployment](https://modal.com/docs/guide/parameterized-functions).
+If the user does not specify an `experiment_name`, the latest experiment
+is used.
+
+```python
+@app.cls(image=torch_image, volumes={volume_path: volume}, gpu=gpu)
+class ModelInference:
+    experiment_name: str = modal.parameter(default="")
+
+    def get_latest_available_model_dirs(self, n_last):
+        """Find the latest models that have a best model checkpoint saved."""
+        save_model_dirs = glob.glob(f"{model_save_path}/*")
+        sorted_model_dirs = sorted(save_model_dirs, key=os.path.getctime, reverse=True)
+
+        valid_model_dirs = []
+        for latest_model_dir in sorted_model_dirs:
+            if Path(f"{latest_model_dir}/{best_model_filename}").exists():
+                valid_model_dirs.append(Path(latest_model_dir))
+            if len(valid_model_dirs) >= n_last:
+                return valid_model_dirs
+        return valid_model_dirs
+
+    @modal.method()
+    def get_latest_available_experiment_names(self, n_last):
+        return [d.name for d in self.get_latest_available_model_dirs(n_last)]
+
+    def load_model_impl(self):
+        from .src.model import AttentionModel
+        from .src.tokenizer import Tokenizer
+
+        if self.experiment_name != "":  # user selected model
+            use_model_dir = f"{model_save_path}/{self.experiment_name}"
+        else:  # otherwise, pick latest
+            try:
+                use_model_dir = self.get_latest_available_model_dirs(1)[0]
+            except IndexError:
+                raise ValueError("No models available to load.")
+
+        if self.use_model_dir == use_model_dir and self.is_fully_trained:
+            return  # already loaded fully trained model.
+
+        print(f"Loading experiment: {Path(use_model_dir).name}...")
+        checkpoint = torch.load(f"{use_model_dir}/{best_model_filename}")
+
+        self.use_model_dir = use_model_dir
+        hparams = checkpoint["hparams"]
+        key = (  # for backwards compatibility
+            "unique_chars" if "unique_chars" in checkpoint else "chars"
+        )
+        unique_chars = checkpoint[key]
+        steps = checkpoint["steps"]
+        val_loss = checkpoint["val_loss"]
+        self.is_fully_trained = checkpoint["finished_training"]
+
+        print(
+            f"Loaded model with {steps} train steps"
+            f" and val loss of {val_loss:.2f}"
+            f" (fully_trained={self.is_fully_trained})"
+        )
+
+        self.tokenizer = Tokenizer(unique_chars)
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+
+        self.model = AttentionModel(self.tokenizer.vocab_size, hparams, self.device)
+        self.model.load_state_dict(checkpoint["model"])
+        self.model.to(self.device)
+
+    @modal.enter()
+    def load_model(self):
+        self.use_model_dir = None
+        self.is_fully_trained = False
+        self.load_model_impl()
+
+    @modal.method()
+    def generate(self, prompt):
+        self.load_model_impl()  # load updated model if available
+
+        n_new_tokens = 1000
+        return self.model.generate_from_text(self.tokenizer, prompt, n_new_tokens)
+
+```
+
+### Adding a simple web endpoint
+
+The `ModelInference` class above is available for use
+from any other Python environment with the right Modal credentials
+and the `modal` package installed -- just use [`lookup`](https://modal.com/docs/reference/modal.Cls#lookup).
+
+But we can also expose it as a web endpoint for easy access
+from anywhere, including other programming languages or the command line.
+
+```python
+class GenerationRequest(BaseModel):
+    prompt: str
+
+@app.function(image=web_image)
+@modal.fastapi_endpoint(method="POST", docs=True)
+def web_generate(request: GenerationRequest):
+    output = ModelInference().generate.remote(request.prompt)
+    return {"output": output}
+
+```
+
+This endpoint can be deployed on Modal with `modal deploy`.
+That will allow us to generate text via a simple `curl` command like this:
+
+```bash
+curl -X POST -H 'Content-Type: application/json' --data-binary '{"prompt": "\n"}' https://your-workspace-name--modal-nano-gpt-web-generate.modal.run
+```
+
+which will return something like:
+
+```json
+{
+"output":
+   "BRUTUS:
+    The broy trefore anny pleasory to
+    wip me state of villoor so:
+    Fortols listhey for brother beat the else
+    Be all, ill of lo-love in igham;
+    Ah, here all that queen and hould you father offer"
+}
+```
+
+It's not exactly Shakespeare, but at least it shows our model learned something!
+
+You can choose which model to use by specifying the `experiment_name` in the query parameters of the request URL.
+
+### Serving a Gradio UI with `asgi_app`
+
+Second, we create a Gradio web app for generating text via a graphical user interface in the browser.
+That way our fellow team members and stakeholders can easily interact with the model and give feedback,
+even when we're still training the model.
+
+You should see the URL for this UI in the output of `modal deploy`
+or on your [Modal app dashboard](https://modal.com/apps) for this app.
+
+The Gradio UI will look something like this:
+
+![Image of Gradio Web App. Top shows model selection dropdown. Left side shows input prompt textbox. Right side shows SLM generated output. Bottom has button for starting generation process](./gradio.png)
+
+```python
+@app.function(
+    image=ui_image,
+    max_containers=1,
+    volumes={volume_path: volume},
+)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def ui():
+    import gradio as gr
+    from fastapi import FastAPI
+    from fastapi.responses import FileResponse
+    from gradio.routes import mount_gradio_app
+
+    # call out to the inference in a separate Modal environment with a GPU
+    def generate(text="", experiment_name=""):
+        if not text:
+            text = "\n"
+        generated = ModelInference(experiment_name=experiment_name).generate.remote(
+            text
+        )
+        return text + generated
+
+    example_prompts = [
+        "DUKE OF YORK:\nWhere art thou Lucas?",
+        "ROMEO:\nWhat is a man?",
+        "CLARENCE:\nFair is foul and foul is fair, but who are you?",
+        "Brevity is the soul of wit, so what is the soul of foolishness?",
+    ]
+
+    web_app = FastAPI()
+
+    # custom styles: an icon, a background, and a theme
+    @web_app.get("/favicon.ico", include_in_schema=False)
+    async def favicon():
+        return FileResponse("/assets/favicon.svg")
+
+    @web_app.get("/assets/background.svg", include_in_schema=False)
+    async def background():
+        return FileResponse("/assets/background.svg")
+
+    with open("/assets/index.css") as f:
+        css = f.read()
+
+    n_last = 20
+    experiment_names = ModelInference().get_latest_available_experiment_names.remote(
+        n_last
+    )
+    theme = gr.themes.Default(
+        primary_hue="green", secondary_hue="emerald", neutral_hue="neutral"
+    )
+
+    # add a Gradio UI around inference
+    with gr.Blocks(theme=theme, css=css, title="SLM") as interface:
+        # title
+        gr.Markdown("# GPT-style Shakespeare text generation.")
+
+        # Model Selection
+        with gr.Row():
+            gr.Markdown("## Model Version")
+        with gr.Row():
+            experiment_dropdown = gr.Dropdown(
+                experiment_names, label="Select Model Version"
+            )
+
+        # input and output
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Input:")
+                input_box = gr.Textbox(  # input text component
+                    label="",
+                    placeholder="Write some Shakespeare like text or keep it empty!",
+                    lines=10,
+                )
+            with gr.Column():
+                gr.Markdown("## Output:")
+                output_box = gr.Textbox(  # output text component
+                    label="",
+                    lines=10,
+                )
+
+        # button to trigger inference and a link to Modal
+        with gr.Row():
+            generate_button = gr.Button("Generate", variant="primary", scale=2)
+            generate_button.click(
+                fn=generate,
+                inputs=[input_box, experiment_dropdown],
+                outputs=output_box,
+            )  # connect inputs and outputs with inference function
+
+            gr.Button(  # shameless plug
+                " Powered by Modal",
+                variant="secondary",
+                link="https://modal.com",
+            )
+
+        # example prompts
+        with gr.Column(variant="compact"):
+            # add in a few examples to inspire users
+            for ii, prompt in enumerate(example_prompts):
+                btn = gr.Button(prompt, variant="secondary")
+                btn.click(fn=lambda idx=ii: example_prompts[idx], outputs=input_box)
+
+    # mount for execution on Modal
+    return mount_gradio_app(
+        app=web_app,
+        blocks=interface,
+        path="/",
+    )
+
+```
+
+## Addenda
+
+The remainder of this code is boilerplate.
+
+### Training Loop
+
+There's quite a lot of code for just the training loop! If you'd rather not write this stuff yourself,
+consider a training framework like [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable)
+or [Hugging Face](https://huggingface.co/transformers/main_classes/trainer.html).
+
+```python
+def training_loop(
+    start_step,
+    n_steps,
+    n_steps_before_eval,
+    n_steps_before_checkpoint,
+    n_eval_steps,
+    dataset,
+    tokenizer,
+    model,
+    optimizer,
+    logs_manager,
+    checkpoint,
+    checkpoint_path,
+    run_to_first_save,
+):
+    @torch.no_grad()
+    def eval_model(model, dataset, tokenizer, n_eval_steps):
+        """Evaluate model on train and validation data."""
+        out = {}
+        model.eval()  # Turn off gradients
+        for split in ("train", "val"):
+            losses = torch.zeros(n_eval_steps)
+            for k in range(n_eval_steps):
+                xb, yb = dataset.get_batch(split)
+                logits, loss = model.forward(xb, yb)
+                losses[k] = loss
+            out[split] = losses.mean()
+
+        # Generate some output samples
+        out["sample"] = model.generate_from_text(tokenizer, "\n", 1000)
+
+        model.train()  # Turn on gradients
+        return out
+
+    t_last = timer()
+    for step in range(start_step, n_steps + 1):
+        # sample a batch of data
+        xb, yb = dataset.get_batch("train")
+
+        # evaluate the loss, calculate & apply gradients
+        logits, loss = model.forward(xb, yb)
+        optimizer.zero_grad(set_to_none=True)
+        loss.backward()
+        optimizer.step()
+
+        # log training loss
+        logs_manager.add_train_scalar("Cross Entropy Loss", loss.item(), step)
+
+        # evaluate model on validation set
+        if step % n_steps_before_eval == 0:
+            out = eval_model(model, dataset, tokenizer, n_eval_steps)
+            log_evals(out, step, t_last, logs_manager)
+            t_last = timer()
+
+        # save model with checkpoint information
+        if step > 0 and step % n_steps_before_checkpoint == 0:
+            checkpoint["steps"] = step
+            checkpoint["val_loss"] = out["val"]
+
+            # mark as finished if we hit n steps.
+            checkpoint["finished_training"] = step >= n_steps
+
+            L.info(
+                f"Saving checkpoint to {checkpoint_path}\t {checkpoint['finished_training']})"
+            )
+            save_checkpoint(checkpoint, checkpoint_path)
+
+            if run_to_first_save:
+                L.info("Stopping early...")
+                break
+    return out
+
+def save_checkpoint(checkpoint, checkpoint_path):
+    torch.save(checkpoint, checkpoint_path)
+    volume.commit()
+
+def build_model(hparams, vocab_size, device):
+    """Initialize the model and move it to the device."""
+    model = AttentionModel(vocab_size, hparams, device)
+    model.to(device)
+    return model
+
+def setup_optimizer(model, learning_rate):
+    """Set up the optimizer for the model."""
+    return torch.optim.AdamW(model.parameters(), lr=learning_rate)
+
+```
+
+### Miscellaneous
+The remaining code includes small helper functions for training the model.
+
+```python
+def prepare_data(input_file_path: Path, volume: modal.Volume) -> str:
+    """Download and read the dataset."""
+    volume.reload()
+    if not input_file_path.exists():
+        L.info("Downloading Shakespeare dataset...")
+        data_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
+        urllib.request.urlretrieve(data_url, input_file_path)
+        volume.commit()
+    return input_file_path.read_text()
+
+def make_best_symbolic_link(model_save_dir, model_filename, experiment_name):
+    # create symlink to the best model so it's easy to find for web serving
+    os.symlink(
+        str(model_save_dir / model_filename),
+        str(model_save_path / experiment_name / best_model_filename),
+    )
+    volume.commit()  # commit the symlink
+
+def init_checkpoint(model, tokenizer, optimizer, start_step, hparams):
+    return {
+        "model": model.state_dict(),
+        "unique_chars": tokenizer.unique_chars,
+        "optimizer": optimizer.state_dict(),
+        "val_loss": float("inf"),
+        "steps": start_step,
+        "hparams": hparams,
+        "finished_training": False,
+    }
+
+def log_evals(result, step, t_last, logs_manager):
+    runtime_s = timer() - t_last
+    L.info(
+        f"{step:5d}) // {runtime_s:>5.2f}s // Train Loss: {result['train']:.2f} // Val Loss: {result['val']:.2f}"
+    )
+    logs_manager.add_val_scalar("Cross Entropy Loss", result["val"], step)
+    logs_manager.add_val_text("Sample Output", result["sample"], step)
+    logs_manager.flush()
+    volume.commit()  # Make sure TensorBoard container will see it.
+
+    return result
+
+```
+
+### Image Embeddings Infinity
+
+# Modal Cookbook: Recipe for Inference Throughput Maximization
+In certain applications, the bottom line comes to throughput: process a set of inputs as fast as possible.
+Let's explore how to maximize throughput by using Modal on an embedding example, and see just how fast
+we can encode the [Microsoft Cats & Dogs dataset](https://huggingface.co/datasets/microsoft/cats_vs_dogs)
+using the [Infinity inference engine](https://github.com/michaelfeil/infinity "github/michaelfeil/infinity").
+
+## Conclusions
+### BLUF (Bottom Line Up Front)
+Set concurrency (`max_concurrent_inputs`) to 4, and set `batch_size` between 50-500.
+To set `max_containers`, divide the total number of inputs by `max_concurrent_inputs*batchsize`
+(note: if you have a massive dataset, keep an eye out for diminishing returns on `max_containers`; but
+Modal should handle that for you!).
+Be sure to preprocess your data in the same manner that the model is expecting (e.g., resizing images).
+If you only want to use one container, increase `batch_size` until you are maxing
+out the GPU (but keep concurrency, `max_concurrent_inputs`, capped around 4). The example herein achieves
+upward of 750 images / second overall throughput (not including initial Volume setup time).
+
+### Why?
+While batchsize maximizes GPU utilization, the time to form a batch (ie reading images)
+will ultimately overtake inference, whether due to I/O, sending data across a wire, etc.
+We can make up for this by using idle GPU cores to store additional copies of the model:
+this _GPU packing_ is achieved via an async queue and the [`@modal.concurrent(max_inputs:int)`](https://modal.com/docs/guide/concurrent-inputs#input-concurrency)
+decorator. Once you nail down `batch_size` you can crank up the number of containers to distribute the
+computational load. High values of concurrency has diminishing returns, we believe,
+because we are already throttling the CPU with multi-threaded dataloading. The demo herein
+achieves upward of 750 images / second, and that will increase for larger datasets where the model loading
+time becomes increasingly negligable.
+
+## Local env imports
+Import everything we need for the locally-run Python (everything in our local_entrypoint function at the bottom).
+
+```python
+import asyncio
+import os
+from concurrent.futures import ThreadPoolExecutor
+from pathlib import Path
+from time import perf_counter
+from typing import Iterator, TypeVar
+
+import modal
+
+```
+
+## Key Parameters
+There are three ways to parallelize inference for this usecase: via batching (which happens internal to Infinity),
+by packing individual GPU(s) with multiple copies of the model, and by fanning out across multiple containers.
+Here are some parameters for controlling these factors:
+* `max_concurrent_inputs` sets the [@modal.concurrent(max_inputs:int) ](https://modal.com/docs/guide/concurrent-inputs#input-concurrency "Modal: input concurrency") argument for the inference app. This takes advantage of the asynchronous nature of the Infinity embedding inference app.
+* `gpu` is a string specifying the GPU to be used.
+* `max_containers` caps the number of containers allowed to spin-up.
+* `memory_request` amount of RAM requested per container
+* `core_request` number of logical cores requested per container
+* `threads_per_core` oversubscription factor for parallelized I/O (image reading)
+* `batch_size` is a parameter passed to the [Infinity inference engine](https://github.com/michaelfeil/infinity "github/michaelfeil/infinity"), and it means the usual thing for machine learning inference: a group of images are processed through the neural network together.
+* `image_cap` caps the number of images used in this example (e.g. for debugging/testing)
+
+```python
+max_concurrent_inputs: int = 4
+gpu: str = "L4"
+max_containers: int = 50
+memory_request: float = 5 * 1024  # MB->GB
+core_request: float = 4
+threads_per_core: int = 8
+batch_size: int = 100
+image_cap: int = -1
+
+```
+
+This timeout caps the maximum time a single function call is allowed to take. In this example, that
+includes reading a batch-worth of data and running inference on it. When `batch_size` is large (e.g. 5000)
+and with a large value of `max_concurrent_inputs`, where a batch may sit in a queue for a while,
+this could take several minutes.
+
+```python
+timeout_seconds: int = 10 * 60
+
+```
+
+## Data and Model Specification
+This model parameter should point to a model on HuggingFace that is supported by Infinity.
+Note that your selected model might require specialized imports when
+designing the image in the next section. This [OpenAI model](https://huggingface.co/openai/clip-vit-base-patch16 "OpenAI ViT")
+takes about 4-10s to load into memory.
+
+```python
+model_name = "openai/clip-vit-base-patch16"  # 599 MB
+model_input_shape = (224, 224)
+
+```
+
+We will use a high-performance [Modal Volume](https://modal.com/docs/guide/volumes#volumes "Modal.Volume")
+both to cache model weights and to store images we want to encode. The details of
+setting this volume up are below. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+Here, we just need to name it so that we can instantiate
+the Modal application.
+You may need to [set up a secret](https://modal.com/secrets/) to access HuggingFace datasets
+
+```python
+hf_secret = modal.Secret.from_name("huggingface-secret")
+```
+
+Change this global variable to use a different HF dataset:
+
+```python
+hf_dataset_name = "microsoft/cats_vs_dogs"
+```
+
+This name is important for referencing the volume in other apps or for [browsing](https://modal.com/storage):
+
+```python
+vol_name = "example-embedding-data"
+```
+
+This is the location within the container that this Volume will be mounted:
+
+```python
+vol_mnt = Path("/data")
+```
+
+Finally, the Volume object can be created:
+
+```python
+data_volume = modal.Volume.from_name(vol_name, create_if_missing=True)
+
+```
+
+## Define the image
+
+```python
+infinity_image = (
+    modal.Image.debian_slim(python_version="3.10")
+    .uv_pip_install(
+        [
+            "pillow==11.3.0",  # for Infinity input typehint
+            "datasets==4.0.0",  # for huggingface data download
+            "huggingface-hub==0.36.0",  # for fast huggingface data download
+            "tqdm==4.67.1",  # progress bar for dataset download
+            "sentencepiece==0.2.0",  # for this particular chosen model
+            "torchvision==0.22.1",  # for fast image loading
+            "infinity_emb[all]==0.0.76",  # for Infinity inference lib
+            "optimum==1.26.1",  # need to pin this because newer version requires
+        ]
+    )
+    .env(
+        {
+            "HF_HOME": vol_mnt.as_posix(),  # For model and data caching in our Volume
+            "HF_XET_HIGH_PERFORMANCE": "1",  # For fast data transfer
+        }
+    )
+)
+
+```
+
+Initialize the app
+
+```python
+app = modal.App(
+    "example-image-embeddings-infinity",
+    image=infinity_image,
+    volumes={vol_mnt: data_volume},
+    secrets=[hf_secret],
+)
+
+```
+
+Imports inside the container
+
+```python
+with infinity_image.imports():
+    from infinity_emb import AsyncEmbeddingEngine, EngineArgs
+    from infinity_emb.primitives import Dtype, InferenceEngine
+    from PIL.Image import Image
+    from torchvision.io import read_image
+    from torchvision.transforms.functional import to_pil_image
+
+## Dataset Downloading and Setup
+```
+
+## Data setup
+We use a [Modal Volume](https://modal.com/docs/guide/volumes#volumes "Modal.Volume")
+to store images we want to encode. We download them from Huggingface into a Volume and then preprocess
+them to 224 x 224 JPEGs. The selected model, `openai/clip-vit-base-patch16`, was trained on 224 x 224
+sized images. If you skip this preprocess resize step, Infinity will handle image resizing for you-
+at a severe penalty to inference throughput.
+
+Note that Modal Volumes are optimized for datasets on the order of 50,000 - 500,000
+files and directories. If you have a larger dataset, you may need to consider other storage
+options such as a [CloudBucketMount](https://modal.com/docs/examples/rosettafold).
+
+```python
+@app.function(
+    image=infinity_image,
+    volumes={vol_mnt: data_volume},
+    max_containers=1,  # We only want one container to handle volume setup
+    cpu=core_request,  # HuggingFace will use multi-process parallelism to download
+    timeout=timeout_seconds,  # if using a large HF dataset, this may need to be longer
+)
+def catalog_jpegs(dataset_namespace: str, cache_dir: str, image_cap: int):
+    """
+    This function checks the volume for JPEGs and, if needed, calls `download_to_volume`
+    which pulls a HuggingFace dataset into the mounted volume.
+    """
+
+    def download_to_volume(dataset_namespace: str, cache_dir: str):
+        """
+        This function caches a hugginface dataset to the path specified in your `HF_HOME` environment
+        variable, which we set when creating the image so as to point to a Modal Volume.
+        """
+        from datasets import load_dataset
+        from torchvision.io import write_jpeg
+        from torchvision.transforms import Compose, PILToTensor, Resize
+        from tqdm import tqdm
+
+        # Load cache to HF_HOME
+        ds = load_dataset(
+            dataset_namespace,
+            split="train",
+            num_proc=os.cpu_count(),  # this will be capped by huggingface based on the number of shards
+        )
+
+        # Create an `extraction` cache dir where we will create explicit JPEGs
+        mounted_cache_dir = vol_mnt / cache_dir
+        mounted_cache_dir.mkdir(exist_ok=True, parents=True)
+
+        # Preprocessing pipeline: resize now instead of on-the-fly
+        preprocessor = Compose(
+            [
+                Resize(model_input_shape),
+                PILToTensor(),
+            ]
+        )
+
+        def preprocess_img(idx, example):
+            """
+            Applies preprocessor and write as jpeg with TurboJPEG (via torchvision).
+            """
+            # Define output path
+            write_path = mounted_cache_dir / f"img{idx:07d}.jpg"
+            if write_path.is_file():
+                return
+
+            # Here, `example["image"]` is a `PIL.Image.Image`
+            preprocessed = preprocessor(example["image"].convert("RGB"))
+
+            # Write to modal.Volume
+            write_jpeg(preprocessed, write_path)
+
+        # This is a parallelized pre-processing loop that opens compressed images,
+        # preprocesses them to the size expected by our model, and writes as a JPEG.
+        for idx, ex in tqdm(enumerate(ds), total=len(ds), desc="Caching images"):
+            if (image_cap > 0) and (idx >= image_cap):
+                break
+            preprocess_img(idx, ex)
+
+        data_volume.commit()
+
+    ds_preptime_st = perf_counter()
+
+    def list_all_jpegs(subdir: os.PathLike = "/") -> list[os.PathLike]:
+        """
+        Searches a subdir within your volume for all JPEGs.
+        """
+        return [
+            x.path
+            for x in data_volume.listdir(subdir.as_posix())
+            if x.path.endswith(".jpg")
+        ]
+
+    # Check for extracted-JPEG cache dir within the volume
+    if (vol_mnt / cache_dir).is_dir():
+        im_path_list = list_all_jpegs(cache_dir)
+        n_ims = len(im_path_list)
+    else:
+        n_ims = 0
+        print("The cache dir was not found...")
+
+    # If needed, download dataset to a vol
+    if (n_ims < image_cap) or (n_ims == 0):
+        print(f"Found {n_ims} JPEGs; checking for more on HuggingFace.")
+        download_to_volume(dataset_namespace, cache_dir)
+        # Try again
+        im_path_list = list_all_jpegs(cache_dir)
+        n_ims = len(im_path_list)
+
+    # [optional] Cap the number of images to process
+    print(f"Found {n_ims} JPEGs in the Volume.", end="")
+    if image_cap > 0:
+        im_path_list = im_path_list[: min(image_cap, len(im_path_list))]
+    print(f"using {len(im_path_list)}.")
+
+    # Time it
+    ds_time_elapsed = perf_counter() - ds_preptime_st
+    return im_path_list, ds_time_elapsed
+
+T = TypeVar("T")  # generic type for chunked typehints
+
+def chunked(seq: list[T], subseq_size: int) -> Iterator[list[T]]:
+    """
+    Helper function that chunks a sequence into subsequences of length `subseq_size`.
+    """
+    for i in range(0, len(seq), subseq_size):
+        yield seq[i : i + subseq_size]
+
+```
+
+## Inference app
+Here we define an app.cls that wraps Infinity's AsyncEmbeddingEngine.
+Note that the variable `max_concurrent_inputs` is used to set `max_inputs`
+in (1) the [modal.concurrent](https://modal.com/docs/guide/concurrent-inputs#input-concurrency)
+decorator, and (2) the `n_engines` class property.
+In `init_engines`, we are creating exactly one inference
+engine for each concurrently-passed batch of data. This is critical for packing a GPU with
+multiple simultaneously operating models. The [@modal.enter](https://modal.com/docs/reference/modal.enter#modalenter)
+decorator ensures that this method is called once per container, on startup (and `exit` is
+run once, on shutdown).
+
+```python
+@app.cls(
+    gpu=gpu,
+    cpu=core_request,
+    memory=5 * 1024,  # MB -> GB
+    image=infinity_image,
+    volumes={vol_mnt: data_volume},
+    timeout=timeout_seconds,
+    max_containers=max_containers,
+)
+@modal.concurrent(max_inputs=max_concurrent_inputs)
+class InfinityEngine:
+    n_engines: int = max_concurrent_inputs
+
+    @modal.enter()
+    async def init_engines(self):
+        """
+        On container start, starts `self.n_engines` copies of the selected model
+        and puts them in an async queue.
+        """
+        print(f"Loading {self.n_engines} models... ", end="")
+        self.engine_queue: asyncio.Queue[AsyncEmbeddingEngine] = asyncio.Queue()
+        start = perf_counter()
+        for _ in range(self.n_engines):
+            engine = AsyncEmbeddingEngine.from_args(
+                EngineArgs(
+                    model_name_or_path=model_name,
+                    batch_size=batch_size,
+                    model_warmup=False,
+                    engine=InferenceEngine.torch,
+                    dtype=Dtype.float16,
+                    device="cuda",
+                )
+            )
+            await engine.astart()
+            await self.engine_queue.put(engine)
+        print(f"Took {perf_counter() - start:.4}s.")
+
+    def read_batch(self, im_path_list: list[os.PathLike]) -> list["Image"]:
+        """
+        Read a batch of data. Infinity is expecting PIL.Image.Image type
+        inputs, but it's faster to read from disk with torchvision's `read_image`
+        and convert to PIL than it is to read directly with PIL.
+
+        This process is parallelized over the batch with multithreaded data reading.
+        The number of threads is 4 per core, which is based on the batchsize.
+        """
+
+        def readim(impath: os.PathLike):
+            """Read with torch, convert back to PIL for Infinity"""
+            return to_pil_image(read_image(str(vol_mnt / impath)))
+
+        with ThreadPoolExecutor(
+            max_workers=os.cpu_count() * threads_per_core
+        ) as executor:
+            images = list(executor.map(readim, im_path_list))
+
+        return images
+
+    @modal.method()
+    async def embed(self, images: list[os.PathLike]) -> tuple[float, float]:
+        """
+        This is the workhorse function. We select a model, prepare a batch,
+        execute inference, and return the time elapsed. You probably want
+        to return the embeddings in your usecase.
+        """
+        # (0) Grab an engine from the queue
+        engine = await self.engine_queue.get()
+
+        try:
+            # (1) Load batch of image data
+            images = self.read_batch(images)
+
+            # (2) Encode the batch
+            st = perf_counter()
+            embedding, _ = await engine.image_embed(images=images)
+            embed_elapsed = perf_counter() - st
+        finally:
+            # No matter what happens, return the engine to the queue
+            await self.engine_queue.put(engine)
+
+        # (3) You may wish to return the embeddings themselves here
+        return embed_elapsed, len(images)
+
+    @modal.exit()
+    async def exit(self) -> None:
+        """
+        Shut down each of the engines.
+        """
+        for _ in range(self.n_engines):
+            engine = await self.engine_queue.get()
+            await engine.astop()
+
+```
+
+## Local Entrypoint
+This backbone code is run on your machine. It starts up the app,
+catalogs the data, and via the remote `map` call, parses the data
+with the Infinity embedding engine. The embedder.embed executions
+across the batches are autoscaled depending on the app parameters
+`max_containers` and `max_concurrent_inputs`.
+
+```python
+@app.local_entrypoint()
+def main():
+    start_time = perf_counter()
+
+    # (1) Catalog data: modify `catalog_jpegs` to fetch batches of your data.
+    extracted_path = Path("extracted") / hf_dataset_name
+    im_path_list, vol_setup_time = catalog_jpegs.remote(
+        dataset_namespace=hf_dataset_name, cache_dir=extracted_path, image_cap=image_cap
+    )
+    print(f"Took {vol_setup_time:.2f}s to setup volume.")
+    n_ims = len(im_path_list)
+
+    # (2) Init the model inference app
+    start_time = perf_counter()
+    embedder = InfinityEngine()
+
+    # (3) Embed batches via remote `map` call
+    times, batchsizes = [], []
+    for time, batchsize in embedder.embed.map(chunked(im_path_list, batch_size)):
+        times.append(time)
+        batchsizes.append(batchsize)
+
+    # (4) Log
+    if n_ims > 0:
+        total_duration = perf_counter() - start_time
+        total_throughput = n_ims / total_duration
+        embed_throughputs = [
+            batchsize / time for batchsize, time in zip(batchsizes, times)
+        ]
+        avg_throughput = sum(embed_throughputs) / len(embed_throughputs)
+
+        log_msg = (
+            f"EmbeddingRacetrack{gpu}::batch_size={batch_size}::"
+            f"n_ims={n_ims}::concurrency={max_concurrent_inputs}::"
+            f"max_containers={max_containers}::cores={core_request}\n"
+            f"\tTotal time:\t{total_duration / 60:.2f} min\n"
+            f"\tVolume setup time:\t{vol_setup_time / 60:.2f} min\n"
+            f"\tOverall throughput:\t{total_throughput:.2f} im/s\n"
+            f"\tEmbedding-only throughput (avg):\t{avg_throughput:.2f} im/s\n"
+        )
+
+        print(log_msg)
+
+```
+
+### Image To Image
+
+# Edit images with Flux Kontext
+
+In this example, we run the Flux Kontext model in _image-to-image_ mode:
+the model takes in a prompt and an image and edits the image to better match the prompt.
+
+For example, the model edited the first image into the second based on the prompt
+"_A cute dog wizard inspired by Gandalf from Lord of the Rings, featuring detailed fantasy elements in Studio Ghibli style_".
+
+ <img src="https://modal-cdn.com/dog-wizard-ghibli-flux-kontext.jpg" alt="A photo of a dog transformed into a cartoon of a cute dog wizard" />
+
+The model is Black Forest Labs' [FLUX.1-Kontext-dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev).
+Learn more about the model [here](https://bfl.ai/announcements/flux-1-kontext-dev).
+
+## Define a container image
+
+First, we define the environment the model inference will run in,
+the [container image](https://modal.com/docs/guide/custom-container).
+
+We start from an NVIDIA CUDA base image and install the necessary Python packages.
+We use a specific commit of the `diffusers` library to ensure compatibility with the Flux Kontext model.
+
+```python
+from io import BytesIO
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-image-to-image")
+
+diffusers_commit_sha = "00f95b9755718aabb65456e791b8408526ae6e76"
+
+image = (
+    modal.Image.from_registry("nvidia/cuda:12.8.1-devel-ubuntu22.04", add_python="3.12")
+    .entrypoint([])  # remove verbose logging by base image on entry
+    .apt_install("git")
+    .uv_pip_install(
+        "Pillow~=11.2.1",
+        "accelerate~=1.8.1",
+        f"git+https://github.com/huggingface/diffusers.git@{diffusers_commit_sha}",
+        "huggingface-hub==0.36.0",
+        "optimum-quanto==0.2.7",
+        "safetensors==0.5.3",
+        "sentencepiece==0.2.0",
+        "torch==2.7.1",
+        "transformers~=4.53.0",
+        extra_options="--index-strategy unsafe-best-match",
+        extra_index_url="https://download.pytorch.org/whl/cu128",
+    )
+)
+
+```
+
+## Download the model
+
+We'll be using the FLUX.1-Kontext-dev model from Black Forest Labs.
+This model specializes in image-to-image editing with strong prompt adherence.
+
+```python
+MODEL_NAME = "black-forest-labs/FLUX.1-Kontext-dev"
+MODEL_REVISION = "f9fdd1a95e0dfd7653cb0966cda2486745122695"
+
+```
+
+Note that access to the FLUX.1-Kontext-dev model on Hugging Face is
+[gated by a license agreement](https://huggingface.co/docs/hub/en/models-gated) which
+you must agree to [here](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev).
+After you have accepted the license, [create a Modal Secret](https://modal.com/secrets)
+with the name `huggingface-secret` following the instructions in the template.
+
+## Cache the model weights
+
+The model weights are large (tens of gigabytes), so we want to cache them
+to avoid downloading them every time a container starts.
+We use a [Modal Volume](https://modal.com/docs/guide/volumes) to persist the Hugging Face cache.
+Modal Volumes act like a shared disk that all Modal Functions can access.
+For more on storing model weights on Modal, see [this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+CACHE_DIR = Path("/cache")
+cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+volumes = {CACHE_DIR: cache_volume}
+
+```
+
+We reference the Hugging Face secret we created earlier to authenticate when downloading the model.
+
+```python
+secrets = [modal.Secret.from_name("huggingface-secret")]
+
+```
+
+We configure environment variables to enable faster downloads from Hugging Face
+and point the Hugging Face cache to our Modal Volume.
+
+```python
+image = image.env({"HF_XET_HIGH_PERFORMANCE": "1", "HF_HOME": str(CACHE_DIR)})
+
+```
+
+Finally, we import packages we'll be using in our inference function,
+but not locally.
+
+```python
+with image.imports():
+    import torch
+    from diffusers import FluxKontextPipeline
+    from diffusers.utils import load_image
+    from PIL import Image
+
+```
+
+## Set up and run Flux Kontext
+
+The Modal `Cls` defined below contains all the logic to set up and run Flux Kontext inference.
+
+We define our Python class as a Modal `Cls` using the `app.cls` decorator.
+We provide a few arguments to describe the infrastructure our inference should run on:
+
+- the Image, Volume, and Secret we defined above
+- a [`gpu`](https://modal.com/docs/guide/gpu), in particular a [B200](https://modal.com/blog/introducing-b200-h200)
+
+The [container lifecycle](https://modal.com/docs/guide/lifecycle-functions) decorator,
+`@modal.enter`, ensures that the model is loaded into memory when a container starts, before it picks up any inputs.
+This is useful for managing tail latencies (see [this guide](https://modal.com/docs/guide/cold-start) for details).
+
+The `inference` method runs the actual model inference. It takes in an image (as raw `bytes`) and a string `prompt` and returns
+a new image (also as raw `bytes`).
+
+```python
+@app.cls(image=image, gpu="B200", volumes=volumes, secrets=secrets)
+class Model:
+    @modal.enter()
+    def enter(self):
+        print(f"Loading {MODEL_NAME}...")
+
+        self.pipe = FluxKontextPipeline.from_pretrained(
+            MODEL_NAME,
+            revision=MODEL_REVISION,
+            torch_dtype=torch.bfloat16,
+            cache_dir=CACHE_DIR,
+        ).to("cuda")
+
+    @modal.method()
+    def inference(
+        self,
+        image_bytes: bytes,
+        prompt: str,
+        guidance_scale: float = 3.5,
+        num_inference_steps: int = 20,
+        seed: int | None = None,
+    ) -> bytes:
+        init_image = load_image(Image.open(BytesIO(image_bytes))).resize((512, 512))
+
+        generator = None
+        if seed is not None:
+            generator = torch.Generator(device="cuda").manual_seed(seed)
+
+        image = self.pipe(
+            image=init_image,
+            prompt=prompt,
+            guidance_scale=guidance_scale,
+            num_inference_steps=num_inference_steps,
+            output_type="pil",
+            generator=generator,
+        ).images[0]
+
+        byte_stream = BytesIO()
+        image.save(byte_stream, format="PNG")
+
+        return byte_stream.getvalue()
+
+```
+
+## Running the model from the command line
+
+You can run the model from the command line with
+
+```bash
+modal run image_to_image.py
+```
+
+Use `--help` for additional details.
+
+```python
+@app.local_entrypoint()
+def main(
+    image_path=Path(__file__).parent / "demo_images/dog.png",
+    output_path=Path("/tmp/stable-diffusion/output.png"),
+    prompt: str = "A cute dog wizard inspired by Gandalf from Lord of the Rings, featuring detailed fantasy elements in Studio Ghibli style",
+):
+    print(f"🎨 reading input image from {image_path}")
+    input_image_bytes = Path(image_path).read_bytes()
+    print(f"🎨 editing image with prompt '{prompt}'")
+    output_image_bytes = Model().inference.remote(input_image_bytes, prompt)
+
+    if isinstance(output_path, str):
+        output_path = Path(output_path)
+
+    dir = output_path.parent
+    dir.mkdir(exist_ok=True, parents=True)
+
+    print(f"🎨 saving output image to {output_path}")
+    output_path.write_bytes(output_image_bytes)
+
+```
+
+### Image To Video
+
+# Animate images with Lightricks LTX-Video via CLI, API, and web UI
+
+This example shows how to run [LTX-Video](https://huggingface.co/Lightricks/LTX-Video) on Modal
+to generate videos from your local command line, via an API, and in a web UI.
+
+Generating a 5 second video takes ~1 minute from cold start.
+Once the container is warm, a 5 second video takes ~15 seconds.
+
+Here is a sample we generated:
+
+<center>
+<video controls autoplay loop muted>
+<source src="https://modal-cdn.com/example_image_to_video.mp4" type="video/mp4" />
+</video>
+</center>
+
+## Basic setup
+
+```python
+import io
+import random
+import time
+from pathlib import Path
+from typing import Annotated, Optional
+
+import fastapi
+import modal
+
+```
+
+All Modal programs need an [`App`](https://modal.com/docs/reference/modal.App) —
+an object that acts as a recipe for the application.
+
+```python
+app = modal.App("example-image-to-video")
+
+```
+
+### Configuring dependencies
+
+The model runs remotely, on Modal's cloud, which means we need to
+[define the environment it runs in](https://modal.com/docs/guide/images).
+
+Below, we start from a lightweight base Linux image
+and then install our system and Python dependencies,
+like Hugging Face's `diffusers` library and `torch`.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install("python3-opencv")
+    .uv_pip_install(
+        "accelerate==1.4.0",
+        "diffusers==0.32.2",
+        "fastapi[standard]==0.115.8",
+        "huggingface-hub==0.36.0",
+        "imageio==2.37.0",
+        "imageio-ffmpeg==0.6.0",
+        "opencv-python==4.11.0.86",
+        "pillow==11.1.0",
+        "sentencepiece==0.2.0",
+        "torch==2.6.0",
+        "torchvision==0.21.0",
+        "transformers==4.49.0",
+    )
+)
+
+```
+
+## Storing model weights on Modal
+
+We also need the parameters of the model remotely.
+They can be loaded at runtime from Hugging Face,
+based on a repository ID and a revision (aka a commit SHA).
+
+```python
+MODEL_ID = "Lightricks/LTX-Video"
+MODEL_REVISION_ID = "a6d59ee37c13c58261aa79027d3e41cd41960925"
+
+```
+
+Hugging Face will also cache the weights to disk once they're downloaded.
+But Modal Functions are serverless, and so even disks are ephemeral,
+which means the weights would get re-downloaded every time we spin up a new instance.
+
+We can fix this -- without any modifications to Hugging Face's model loading code! --
+by pointing the Hugging Face cache at a [Modal Volume](https://modal.com/docs/guide/volumes). For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+model_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+
+MODEL_PATH = "/models"  # where the Volume will appear on our Functions' filesystems
+
+image = image.env(
+    {
+        "HF_XET_HIGH_PERFORMANCE": "1",  # faster downloads
+        "HF_HUB_CACHE": MODEL_PATH,
+    }
+)
+
+```
+
+## Storing model outputs on Modal
+
+Contemporary video models can take a long time to run and they produce large outputs.
+That makes them a great candidate for storage on Modal Volumes as well.
+Python code running outside of Modal can also access this storage, as we'll see below.
+
+```python
+OUTPUT_PATH = "/outputs"
+output_volume = modal.Volume.from_name("outputs", create_if_missing=True)
+
+```
+
+## Implementing LTX-Video inference on Modal
+
+We wrap the inference logic in a Modal [Cls](https://modal.com/docs/guide/lifecycle-functions)
+that ensures models are loaded and then moved to the GPU once when a new instance
+starts, rather than every time we run it.
+
+The `run` function just wraps a `diffusers` pipeline.
+It saves the generated video to a Modal Volume, and returns the filename.
+
+We also include a `web` wrapper that makes it possible
+to trigger inference via an API call.
+For details, see the `/docs` route of the URL ending in `inference-web.modal.run`
+that appears when you deploy the app.
+
+```python
+with image.imports():  # loaded on all of our remote Functions
+    import diffusers
+    import torch
+    from PIL import Image
+
+MINUTES = 60
+
+@app.cls(
+    image=image,
+    gpu="H100",
+    timeout=10 * MINUTES,
+    scaledown_window=10 * MINUTES,
+    volumes={MODEL_PATH: model_volume, OUTPUT_PATH: output_volume},
+)
+class Inference:
+    @modal.enter()
+    def load_pipeline(self):
+        self.pipe = diffusers.LTXImageToVideoPipeline.from_pretrained(
+            MODEL_ID,
+            revision=MODEL_REVISION_ID,
+            torch_dtype=torch.bfloat16,
+        ).to("cuda")
+
+    @modal.method()
+    def run(
+        self,
+        image_bytes: bytes,
+        prompt: str,
+        negative_prompt: Optional[str] = None,
+        num_frames: Optional[int] = None,
+        num_inference_steps: Optional[int] = None,
+        seed: Optional[int] = None,
+    ) -> str:
+        negative_prompt = (
+            negative_prompt
+            or "worst quality, inconsistent motion, blurry, jittery, distorted"
+        )
+        width = 768
+        height = 512
+        num_frames = num_frames or 25
+        num_inference_steps = num_inference_steps or 50
+        seed = seed or random.randint(0, 2**32 - 1)
+        print(f"Seeding RNG with: {seed}")
+        torch.manual_seed(seed)
+
+        image = diffusers.utils.load_image(Image.open(io.BytesIO(image_bytes)))
+
+        video = self.pipe(
+            image=image,
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            width=width,
+            height=height,
+            num_frames=num_frames,
+            num_inference_steps=num_inference_steps,
+        ).frames[0]
+
+        mp4_name = (
+            f"{seed}_{''.join(c if c.isalnum() else '-' for c in prompt[:100])}.mp4"
+        )
+        diffusers.utils.export_to_video(
+            video, f"{Path(OUTPUT_PATH) / mp4_name}", fps=24
+        )
+        output_volume.commit()
+        torch.cuda.empty_cache()  # reduce fragmentation
+        return mp4_name
+
+    @modal.fastapi_endpoint(method="POST", docs=True)
+    def web(
+        self,
+        image_bytes: Annotated[bytes, fastapi.File()],
+        prompt: str,
+        negative_prompt: Optional[str] = None,
+        num_frames: Optional[int] = None,
+        num_inference_steps: Optional[int] = None,
+        seed: Optional[int] = None,
+    ) -> fastapi.Response:
+        mp4_name = self.run.local(  # run in the same container
+            image_bytes=image_bytes,
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            num_frames=num_frames,
+            num_inference_steps=num_inference_steps,
+            seed=seed,
+        )
+        return fastapi.responses.FileResponse(
+            path=f"{Path(OUTPUT_PATH) / mp4_name}",
+            media_type="video/mp4",
+            filename=mp4_name,
+        )
+
+```
+
+## Generating videos from the command line
+
+We add a [local entrypoint](https://modal.com/docs/reference/modal.App#local_entrypoint)
+that calls the `Inference.run` method to run inference from the command line.
+The function's parameters are automatically turned into a CLI.
+
+Run it with
+
+```bash
+modal run image_to_video.py --prompt "A cat looking out the window at a snowy mountain" --image-path /path/to/cat.jpg
+```
+
+You can also pass `--help` to see the full list of arguments.
+
+```python
+@app.local_entrypoint()
+def entrypoint(
+    image_path: str,
+    prompt: str,
+    negative_prompt: Optional[str] = None,
+    num_frames: Optional[int] = None,
+    num_inference_steps: Optional[int] = None,
+    seed: Optional[int] = None,
+    twice: bool = True,
+):
+    import os
+    import urllib.request
+
+    print(f"🎥 Generating a video from the image at {image_path}")
+    print(f"🎥 using the prompt {prompt}")
+
+    if image_path.startswith(("http://", "https://")):
+        image_bytes = urllib.request.urlopen(image_path).read()
+    elif os.path.isfile(image_path):
+        image_bytes = Path(image_path).read_bytes()
+    else:
+        raise ValueError(f"{image_path} is not a valid file or URL.")
+
+    inference_service = Inference()
+
+    for _ in range(1 + twice):
+        start = time.time()
+        mp4_name = inference_service.run.remote(
+            image_bytes=image_bytes,
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            num_frames=num_frames,
+            seed=seed,
+        )
+        duration = time.time() - start
+        print(f"🎥 Generated video in {duration:.3f}s")
+
+        output_dir = Path("/tmp/image_to_video")
+        output_dir.mkdir(exist_ok=True, parents=True)
+        output_path = output_dir / mp4_name
+        # read in the file from the Modal Volume, then write it to the local disk
+        output_path.write_bytes(b"".join(output_volume.read_file(mp4_name)))
+        print(f"🎥 Video saved to {output_path}")
+
+```
+
+## Generating videos via an API
+
+The Modal `Cls` above also included a [`fastapi_endpoint`](https://modal.com/docs/examples/basic_web),
+which adds a simple web API to the inference method.
+
+To try it out, run
+
+```bash
+modal deploy image_to_video.py
+```
+
+copy the printed URL ending in `inference-web.modal.run`,
+and add `/docs` to the end. This will bring up the interactive
+Swagger/OpenAPI docs for the endpoint.
+
+## Generating videos in a web UI
+
+Lastly, we add a simple front-end web UI (written in Alpine.js) for
+our image to video backend.
+
+This is also deployed when you run
+
+```bash
+modal deploy image_to_video.py.
+```
+
+The `Inference` class will serve multiple users from its own auto-scaling pool of warm GPU containers automatically,
+and they will spin down when there are no requests.
+
+```python
+frontend_path = Path(__file__).parent / "frontend"
+
+web_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install("jinja2==3.1.5", "fastapi[standard]==0.115.8")
+    .add_local_dir(  # mount frontend/client code
+        frontend_path, remote_path="/assets"
+    )
+)
+
+@app.function(image=web_image)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def ui():
+    import fastapi.staticfiles
+    import fastapi.templating
+
+    web_app = fastapi.FastAPI()
+    templates = fastapi.templating.Jinja2Templates(directory="/assets")
+
+    @web_app.get("/")
+    async def read_root(request: fastapi.Request):
+        return templates.TemplateResponse(
+            "index.html",
+            {
+                "request": request,
+                "inference_url": Inference().web.get_web_url(),
+                "model_name": "LTX-Video Image to Video",
+                "default_prompt": "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background.",
+            },
+        )
+
+    web_app.mount(
+        "/static",
+        fastapi.staticfiles.StaticFiles(directory="/assets"),
+        name="static",
+    )
+
+    return web_app
+
+```
+
+### Imagenet
+
+This scripts demonstrates how to ingest the famous ImageNet (https://www.image-net.org/)
+dataset into a mounted volume.
+
+It requires a Kaggle account's API token stored as a modal.Secret in order to download part
+of the dataset from Kaggle's servers using the `kaggle` CLI.
+
+It is recommended to iterate on this code from a modal.Function running Jupyter server.
+This better supports experimentation and maintains state in the face of errors:
+11_notebooks/jupyter_inside_modal.py
+
+```python
+import os
+import pathlib
+import shutil
+import subprocess
+import sys
+import threading
+import time
+import zipfile
+
+import modal
+
+bucket_creds = modal.Secret.from_name(
+    "aws-s3-modal-examples-datasets", environment_name="main"
+)
+bucket_name = "modal-examples-datasets"
+volume = modal.CloudBucketMount(
+    bucket_name,
+    secret=bucket_creds,
+)
+image = modal.Image.debian_slim().apt_install("tree").uv_pip_install("kaggle", "tqdm")
+app = modal.App(
+    "example-imagenet",
+    image=image,
+    secrets=[modal.Secret.from_name("kaggle-api-token")],
+)
+
+def start_monitoring_disk_space(interval: int = 30) -> None:
+    """Start monitoring the disk space in a separate thread."""
+    task_id = os.environ["MODAL_TASK_ID"]
+
+    def log_disk_space(interval: int) -> None:
+        while True:
+            statvfs = os.statvfs("/")
+            free_space = statvfs.f_frsize * statvfs.f_bavail
+            print(
+                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
+                file=sys.stderr,
+            )
+            time.sleep(interval)
+
+    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
+    monitoring_thread.daemon = True
+    monitoring_thread.start()
+
+def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
+    """
+    A modified shutil.copytree which copies in parallel to increase bandwidth
+    and compensate for the increased IO latency of volume mounts.
+    """
+    from multiprocessing.pool import ThreadPool
+
+    class MultithreadedCopier:
+        def __init__(self, max_threads):
+            self.pool = ThreadPool(max_threads)
+            self.copy_jobs = []
+
+        def copy(self, source, dest):
+            res = self.pool.apply_async(
+                shutil.copy2,
+                args=(source, dest),
+                callback=lambda r: print(f"{source} copied to {dest}"),
+                # NOTE: this should `raise` an exception for proper reliability.
+                error_callback=lambda exc: print(
+                    f"{source} failed: {exc}", file=sys.stderr
+                ),
+            )
+            self.copy_jobs.append(res)
+
+        def __enter__(self):
+            return self
+
+        def __exit__(self, exc_type, exc_val, exc_tb):
+            self.pool.close()
+            self.pool.join()
+
+    with MultithreadedCopier(max_threads=24) as copier:
+        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)
+
+def extractall(fzip, dest, desc="Extracting"):
+    from tqdm.auto import tqdm
+    from tqdm.utils import CallbackIOWrapper
+
+    dest = pathlib.Path(dest).expanduser()
+    with (
+        zipfile.ZipFile(fzip) as zipf,
+        tqdm(
+            desc=desc,
+            unit="B",
+            unit_scale=True,
+            unit_divisor=1024,
+            total=sum(getattr(i, "file_size", 0) for i in zipf.infolist()),
+        ) as pbar,
+    ):
+        for i in zipf.infolist():
+            if not getattr(i, "file_size", 0):  # directory
+                zipf.extract(i, os.fspath(dest))
+            else:
+                full_path = dest / i.filename
+                full_path.parent.mkdir(exist_ok=True, parents=True)
+                with zipf.open(i) as fi, open(full_path, "wb") as fo:
+                    shutil.copyfileobj(CallbackIOWrapper(pbar.update, fi), fo)
+
+@app.function(
+    volumes={"/mnt/": volume},
+    timeout=60 * 60 * 8,  # 8 hours,
+    ephemeral_disk=1000 * 1024,  # 1TB
+)
+def import_transform_load() -> None:
+    start_monitoring_disk_space()
+    kaggle_api_token_data = os.environ["KAGGLE_API_TOKEN"]
+    kaggle_token_filepath = pathlib.Path.home() / ".kaggle" / "kaggle.json"
+    kaggle_token_filepath.parent.mkdir(exist_ok=True)
+    kaggle_token_filepath.write_text(kaggle_api_token_data)
+
+    tmp_path = pathlib.Path("/tmp/imagenet/")
+    vol_path = pathlib.Path("/mnt/imagenet/")
+    filename = "imagenet-object-localization-challenge.zip"
+    dataset_path = vol_path / filename
+    if dataset_path.exists():
+        dataset_size = dataset_path.stat().st_size
+        if dataset_size < (150 * 1024 * 1024 * 1024):
+            dataset_size_gib = dataset_size / (1024 * 1024 * 1024)
+            raise RuntimeError(
+                f"Partial download of dataset .zip. It is {dataset_size_gib}GiB but should be > 150GiB"
+            )
+    else:
+        subprocess.run(
+            f"kaggle competitions download -c imagenet-object-localization-challenge --path {tmp_path}",
+            shell=True,
+            check=True,
+        )
+        vol_path.mkdir(exist_ok=True)
+        shutil.copy(tmp_path / filename, dataset_path)
+
+    # Extract dataset
+    extracted_dataset_path = tmp_path / "extracted"
+    extracted_dataset_path.mkdir(parents=True, exist_ok=True)
+    print(f"Extracting .zip into {extracted_dataset_path}...")
+    extractall(dataset_path, extracted_dataset_path)
+    print(f"Extracted {dataset_path} to {extracted_dataset_path}")
+    subprocess.run(f"tree -L 3 {extracted_dataset_path}", shell=True, check=True)
+
+    final_dataset_path = vol_path / "extracted"
+    final_dataset_path.mkdir(exist_ok=True)
+    copy_concurrent(extracted_dataset_path, final_dataset_path)
+    subprocess.run(f"tree -L 3 {final_dataset_path}", shell=True, check=True)
+    print("Dataset is loaded ✅")
+
+```
+
+### Import Sklearn
+
+# Install scikit-learn in a custom image
+
+This builds a custom image which installs the sklearn (scikit-learn) Python package in it.
+It's an example of how you can use packages, even if you don't have them installed locally.
+
+First, the imports
+
+```python
+import time
+
+import modal
+
+```
+
+Next, define an app, with a custom image that installs `sklearn`.
+
+```python
+app = modal.App(
+    "example-import-sklearn",
+    image=modal.Image.debian_slim()
+    .apt_install("libgomp1")
+    .uv_pip_install("scikit-learn"),
+)
+
+```
+
+The `app.image.imports()` lets us conditionally import in the global scope.
+This is needed because we might not have sklearn and numpy installed locally,
+but we know they are installed inside the custom image.
+
+```python
+with app.image.imports():
+    import numpy as np
+    from sklearn import datasets, linear_model
+
+```
+
+Now, let's define a function that uses one of scikit-learn's built-in datasets
+and fits a very simple model (linear regression) to it
+
+```python
+@app.function()
+def fit():
+    print("Inside run!")
+    t0 = time.time()
+    diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
+    diabetes_X = diabetes_X[:, np.newaxis, 2]
+    regr = linear_model.LinearRegression()
+    regr.fit(diabetes_X, diabetes_y)
+    return time.time() - t0
+
+```
+
+Finally, let's trigger the run locally. We also time this. Note that the first time we run this,
+it will build the image. This might take 1-2 min. When we run this subsequent times, the image
+is already build, and it will run much much faster.
+
+```python
+if __name__ == "__main__":
+    t0 = time.time()
+    with app.run():
+        t = fit.remote()
+        print("Function time spent:", t)
+    print("Full time spent:", time.time() - t0)
+
+```
+
+### Import Torch
+
+# Example (import_torch.py)
+
+This is the source code for **06_gpu_and_ml.import_torch**.
+```python
+import modal
+
+app = modal.App("example-import-torch")
+
+torch_image = modal.Image.debian_slim().uv_pip_install(
+    "torch==2.7",
+    extra_index_url="https://download.pytorch.org/whl/cu128",
+    force_build=True,  # trigger a build every time, just for demonstration purposes
+    # remove if you're using this in production!
+)
+
+@app.function(gpu="B200", image=torch_image)
+def torch() -> list[list[int]]:
+    import math
+
+    import torch
+
+    print(torch.cuda.get_device_properties("cuda:0"))
+
+    matrix = torch.randn(1024, 1024) / math.sqrt(1024)
+    matrix = matrix @ matrix
+
+    return matrix.detach().cpu().tolist()
+
+@app.local_entrypoint()
+def main():
+    print(torch.remote()[:1])
+
+```
+
+### Inference
+
+# Example (inference.py)
+
+This is the source code for **01_getting_started.inference**.
+```python
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-inference")
+image = modal.Image.debian_slim().uv_pip_install("transformers[torch]")
+
+@app.function(gpu="h100", image=image)
+def chat(prompt: str | None = None) -> list[dict]:
+    from transformers import pipeline
+
+    if prompt is None:
+        prompt = f"/no_think Read this code.\n\n{Path(__file__).read_text()}\nIn one paragraph, what does the code do?"
+
+    print(prompt)
+    context = [{"role": "user", "content": prompt}]
+
+    chatbot = pipeline(
+        model="Qwen/Qwen3-1.7B-FP8", device_map="cuda", max_new_tokens=1024
+    )
+    result = chatbot(context)
+    print(result[0]["generated_text"][-1]["content"])
+
+    return result
+
+```
+
+### Inference Endpoint
+
+# Example (inference_endpoint.py)
+
+This is the source code for **01_getting_started.inference_endpoint**.
+
+```python
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-inference-endpoint")
+image = (
+    modal.Image.debian_slim()
+    .uv_pip_install("transformers[torch]")
+    .uv_pip_install("fastapi")
+)
+
+@app.function(gpu="h100", image=image)
+@modal.fastapi_endpoint(docs=True)
+def chat(prompt: str | None = None) -> list[dict]:
+    from transformers import pipeline
+
+    if prompt is None:
+        prompt = f"/no_think Read this code.\n\n{Path(__file__).read_text()}\nIn one paragraph, what does the code do?"
+
+    print(prompt)
+    context = [{"role": "user", "content": prompt}]
+
+    chatbot = pipeline(
+        model="Qwen/Qwen3-1.7B-FP8", device_map="cuda", max_new_tokens=1024
+    )
+    result = chatbot(context)
+    print(result[0]["generated_text"][-1]["content"])
+
+    return result
+
+```
+
+### Inference Full
+
+# Example (inference_full.py)
+
+This is the source code for **01_getting_started.inference_full**.
+
+```python
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-inference-full")
+image = (
+    modal.Image.debian_slim()
+    .uv_pip_install("transformers[torch]")
+    .uv_pip_install("fastapi")
+)
+
+with image.imports():
+    from transformers import pipeline
+
+weights_cache = {
+    "/root/.cache/huggingface": modal.Volume.from_name(
+        "example-inference", create_if_missing=True
+    )
+}
+
+@app.cls(gpu="h100", image=image, volumes=weights_cache, enable_memory_snapshot=True)
+class Chat:
+    @modal.enter()
+    def init(self):
+        self.chatbot = pipeline(
+            model="Qwen/Qwen3-1.7B-FP8", device_map="cuda", max_new_tokens=1024
+        )
+
+    @modal.fastapi_endpoint(docs=True)
+    def web(self, prompt: str | None = None) -> list[dict]:
+        result = self.run.local(prompt)
+        return result
+
+    @modal.method()
+    def run(self, prompt: str | None = None) -> list[dict]:
+        if prompt is None:
+            prompt = f"/no_think Read this code.\n\n{Path(__file__).read_text()}\nIn one paragraph, what does the code do?"
+
+        print(prompt)
+        context = [{"role": "user", "content": prompt}]
+
+        result = self.chatbot(context)
+        print(result[0]["generated_text"][-1]["content"])
+
+        return result
+
+@app.local_entrypoint()
+def main():
+    import glob
+
+    chat = Chat()
+    root_dir, examples = Path(__file__).parent.parent, []
+    for path in glob.glob("**/*.py", root_dir=root_dir):
+        examples.append(
+            f"/no_think Read this code.\n\n{(root_dir / path).read_text()}\nIn one paragraph, what does the code do?"
+        )
+
+    for result in chat.run.map(examples):
+        print(result[0]["generated_text"][-1]["content"])
+
+if __name__ == "__main__":
+    import json
+    import urllib.request
+    from datetime import datetime
+
+    ChatCls = modal.Cls.from_name(app.name, "Chat")
+    chat = ChatCls()
+    print(datetime.now(), "making .remote call to Chat.run")
+    print(chat.run.remote())
+    print(datetime.now(), "making web request to", url := chat.web.get_web_url())
+
+    with urllib.request.urlopen(url) as response:
+        print(datetime.now())
+        print(json.loads(response.read().decode("utf-8")))
+
+```
+
+### Inference Map
+
+# Example (inference_map.py)
+
+This is the source code for **01_getting_started.inference_map**.
+```python
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-inference-map")
+image = modal.Image.debian_slim().uv_pip_install("transformers[torch]")
+
+@app.function(gpu="h100", image=image)
+def chat(prompt: str | None = None) -> list[dict]:
+    from transformers import pipeline
+
+    if prompt is None:
+        prompt = f"/no_think Read this code.\n\n{Path(__file__).read_text()}\nIn one paragraph, what does the code do?"
+
+    print(prompt)
+    context = [{"role": "user", "content": prompt}]
+
+    chatbot = pipeline(
+        model="Qwen/Qwen3-1.7B-FP8", device_map="cuda", max_new_tokens=1024
+    )
+    result = chatbot(context)
+    print(result[0]["generated_text"][-1]["content"])
+
+    return result
+
+@app.local_entrypoint()
+def main():
+    import glob
+
+    root_dir, examples = Path(__file__).parent.parent, []
+    for path in glob.glob("**/*.py", root_dir=root_dir):
+        examples.append(
+            f"/no_think Read this code.\n\n{(root_dir / path).read_text()}\nIn one paragraph, what does the code do?"
+        )
+
+    for result in chat.map(examples):
+        print(result[0]["generated_text"][-1]["content"])
+
+```
+
+### Inference Perf
+
+# Example (inference_perf.py)
+
+This is the source code for **01_getting_started.inference_perf**.
+
+```python
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-inference-perf")
+image = (
+    modal.Image.debian_slim()
+    .uv_pip_install("transformers[torch]")
+    .uv_pip_install("fastapi")
+)
+
+with image.imports():
+    from transformers import pipeline
+
+weights_cache = {
+    "/root/.cache/huggingface": modal.Volume.from_name(
+        "example-inference", create_if_missing=True
+    )
+}
+
+@app.cls(gpu="h100", image=image, volumes=weights_cache, enable_memory_snapshot=True)
+class Chat:
+    @modal.enter()
+    def init(self):
+        self.chatbot = pipeline(
+            model="Qwen/Qwen3-1.7B-FP8", device_map="cuda", max_new_tokens=1024
+        )
+
+    @modal.fastapi_endpoint(docs=True)
+    def web(self, prompt: str | None = None) -> list[dict]:
+        result = self.run.local(prompt)
+        return result
+
+    @modal.method()
+    def run(self, prompt: str | None = None) -> list[dict]:
+        if prompt is None:
+            prompt = f"/no_think Read this code.\n\n{Path(__file__).read_text()}\nIn one paragraph, what does the code do?"
+
+        print(prompt)
+        context = [{"role": "user", "content": prompt}]
+
+        result = self.chatbot(context)
+        print(result[0]["generated_text"][-1]["content"])
+
+        return result
+
+if __name__ == "__main__":
+    import json
+    import urllib.request
+    from datetime import datetime
+
+    ChatCls = modal.Cls.from_name(app.name, "Chat")
+    chat = ChatCls()
+    print(datetime.now(), "making .remote call to Chat.run")
+    print(chat.run.remote())
+    print(datetime.now(), "making web request to", url := chat.web.get_web_url())
+
+    with urllib.request.urlopen(url) as response:
+        print(datetime.now())
+        print(json.loads(response.read().decode("utf-8")))
+
+```
+
+### Install Cuda
+
+# Installing the CUDA Toolkit on Modal
+
+This code sample is intended to quickly show how different layers of the CUDA stack are used on Modal.
+For greater detail, see our [guide to using CUDA on Modal](https://modal.com/docs/guide/cuda).
+
+All Modal Functions with GPUs already have the NVIDIA CUDA drivers,
+NVIDIA System Management Interface, and CUDA Driver API installed.
+
+```python
+import modal
+
+app = modal.App("example-install-cuda")
+
+@app.function(gpu="T4")
+def nvidia_smi():
+    import subprocess
+
+    subprocess.run(["nvidia-smi"], check=True)
+
+```
+
+This is enough to install and use many CUDA-dependent libraries, like PyTorch.
+
+```python
+@app.function(gpu="T4", image=modal.Image.debian_slim().uv_pip_install("torch"))
+def torch_cuda():
+    import torch
+
+    print(torch.cuda.get_device_properties("cuda:0"))
+
+```
+
+If your application or its dependencies need components of the CUDA toolkit,
+like the `nvcc` compiler driver, installed as system libraries or command-line tools,
+you'll need to install those manually.
+
+We recommend the official NVIDIA CUDA Docker images from Docker Hub.
+You'll need to add Python 3 and pip with the `add_python` option because the image
+doesn't have these by default.
+
+```python
+ctk_image = modal.Image.from_registry(
+    "nvidia/cuda:12.4.0-devel-ubuntu22.04", add_python="3.11"
+).entrypoint([])  # removes chatty prints on entry
+
+@app.function(gpu="T4", image=ctk_image)
+def nvcc_version():
+    import subprocess
+
+    return subprocess.run(["nvcc", "--version"], check=True)
+
+```
+
+You can check that all these functions run by invoking this script with `modal run`.
+
+```python
+@app.local_entrypoint()
+def main():
+    nvidia_smi.remote()
+    torch_cuda.remote()
+    nvcc_version.remote()
+
+```
+
+### Install Flash Attn
+
+# Install Flash Attention on Modal
+
+FlashAttention is an optimized CUDA library for Transformer
+scaled-dot-product attention. Dao AI Lab now publishes pre-compiled
+wheels, which makes installation quick.  This script shows how to
+1. Pin an exact wheel that matches CUDA 12 / PyTorch 2.6 / Python 3.13.
+2. Build a Modal image that installs torch, numpy, and FlashAttention.
+3. Launch a GPU function to confirm the kernel runs on a GPU.
+
+```python
+import modal
+
+app = modal.App("example-install-flash-attn")
+
+```
+
+You need to specify an exact release wheel. You can find
+[more on their github](https://github.com/Dao-AILab/flash-attention/releases).
+
+```python
+flash_attn_release = (
+    "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/"
+    "flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp313-cp313-linux_x86_64.whl"
+)
+
+image = modal.Image.debian_slim(python_version="3.13").uv_pip_install(
+    "torch==2.6.0", "numpy==2.2.4", flash_attn_release
+)
+
+```
+
+And here is a demo verifying that it works:
+
+```python
+@app.function(gpu="L40S", image=image)
+def run_flash_attn():
+    import torch
+    from flash_attn import flash_attn_func
+
+    batch_size, seqlen, nheads, headdim, nheads_k = 2, 4, 3, 16, 3
+
+    q = torch.randn(batch_size, seqlen, nheads, headdim, dtype=torch.float16).to("cuda")
+    k = torch.randn(batch_size, seqlen, nheads_k, headdim, dtype=torch.float16).to(
+        "cuda"
+    )
+    v = torch.randn(batch_size, seqlen, nheads_k, headdim, dtype=torch.float16).to(
+        "cuda"
+    )
+
+    out = flash_attn_func(q, k, v)
+    assert out.shape == (batch_size, seqlen, nheads, headdim)
+
+```
+
+### Instructor Generate
+
+# Structured Data Extraction using `instructor`
+
+This example demonstrates how to use the `instructor` library to extract structured, schematized data from unstructured text.
+
+Structured output is a powerful but under-appreciated feature of LLMs.
+Structured output allows LLMs and multimodal models to connect to traditional software,
+for example enabling the ingestion of unstructured data like text files into structured databases.
+Applied properly, it makes them an extreme example of the [Robustness Principle](https://en.wikipedia.org/wiki/Robustness_principle)
+Jon Postel formulated for TCP: "Be conservative in what you send, be liberal in what you accept".
+
+The unstructured data used in this example code is the code from the examples in the Modal examples repository --
+including this example's code!
+
+The output includes a JSONL file containing, on each line, the metadata extracted from the code in one example.
+This can be consumed downstream by other software systems, like a database or a dashboard.
+We've used it to maintain and update our [examples repository](https://github.com/modal-labs/modal-examples).
+
+## Environment setup
+
+We set up the environment our code will run in first.
+In Modal, we define environments via [container images](https://modal.com/docs/guide/custom-container),
+much like Docker images, by iteratively chaining together commands.
+
+Here there's just one command, installing `instructor` and the Python SDK for Anthropic's LLM API.
+
+```python
+from pathlib import Path
+from typing import Literal, Optional
+
+import modal
+from pydantic import BaseModel, Field
+
+image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "instructor~=1.7.2", "anthropic==0.42.0"
+)
+
+```
+
+This example uses models from Anthropic, so if you want to run it yourself,
+you'll need an Anthropic API key and a Modal [`Secret`](https://modal.com/docs/guide/secrets)
+called `my-anthropic-secret` to hold share it with your Modal Functions.
+
+```python
+app = modal.App(
+    "example-instructor-generate",
+    image=image,
+    secrets=[
+        modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"])
+    ],
+)
+
+```
+
+## Running Modal functions from the command line
+
+We'll run the example by calling `modal run instructor_generate.py` from the command line.
+
+When we invoke `modal run` on a Python file, we run the function
+marked with `@app.local_entrypoint`.
+
+This is the only code that runs locally -- it coordinates
+the activity of the rest of our code, which runs in Modal's cloud.
+
+The logic is fairly simple: collect up the code for our examples,
+and then use `instructor` to extract metadata from them,
+which we then write to a file.
+
+By default, the language model is Claude 3 Haiku, the smallest model
+in the Claude 3 family.  We include the option to run `with_opus`,
+which gives much better results, but it is off by default because
+Opus is also ~60x more expensive, at ~$30 per million tokens.
+
+```python
+@app.local_entrypoint()
+def main(limit: int = 1, with_opus: bool = False):
+    # find all of the examples in the repo
+    examples = get_examples()
+    # optionally limit the number of examples we process
+    if limit == 1:
+        examples = [None]  # just run on this example
+    else:
+        examples = examples[:limit]
+    # use Modal to map our extraction function over the examples concurrently
+    results = extract_example_metadata.map(
+        (  # iterable of file contents
+            Path(example.filename).read_text() if example else None
+            for example in examples
+        ),
+        (  # iterable of filenames
+            example.stem if example else None for example in examples
+        ),
+        kwargs={"with_opus": with_opus},
+    )
+
+    # save the results to a local file
+    results_path = Path("/tmp") / "instructor_generate" / "results.jsonl"
+    results_dir = results_path.parent
+    if not results_dir.exists():
+        results_dir.mkdir(parents=True)
+
+    print(f"writing results to {results_path}")
+    with open(results_path, "w") as f:
+        for result in results:
+            print(result)
+            f.write(result + "\n")
+
+```
+
+## Extracting JSON from unstructured text with `instructor` and Pydantic
+
+The real meat of this example is in this section, in the `extract_example_metadata` function and its schemas.
+
+We define a schema for the data we want the LLM to extract, using Pydantic.
+Instructor ensures that the LLM's output matches this schema.
+
+We can use the type system provided by Python and Pydantic to express many useful features
+of the data we want to extract -- ranging from wide-open fields like a `str`ing-valued `summary`
+to constrained fields like `difficulty`, which can only take on value between 1 and 5.
+
+```python
+class ExampleMetadataExtraction(BaseModel):
+    """Extracted metadata about an example from the Modal examples repo."""
+
+    summary: str = Field(..., description="A brief summary of the example.")
+    has_thorough_explanation: bool = Field(
+        ...,
+        description="The example contains, in the form of inline comments with markdown formatting, a thorough explanation of what the code does.",
+    )
+    tags: list[
+        Literal[
+            "use-case-inference-lms",
+            "use-case-inference-audio",
+            "use-case-inference-images-video-3d",
+            "use-case-finetuning",
+            "use-case-job-queues-batch-processing",
+            "use-case-sandboxed-code-execution",
+        ]
+    ] = Field(..., description="The use cases associated with the example")
+    freshness: float = Field(
+        ...,
+        description="The freshness of the example, from 0 to 1. This is relative to your knowledge cutoff. Examples are less fresh if they use older libraries and tools.",
+    )
+
+```
+
+That schema describes the data to be extracted by the LLM, but not all data is best extracted by an LLM.
+For example, the filename is easily determined in software.
+
+So we inject that information into the output after the LLM has done its work. That necessitates
+an additional schema, which inherits from the first.
+
+```python
+class ExampleMetadata(ExampleMetadataExtraction):
+    """Metadata about an example from the Modal examples repo."""
+
+    filename: Optional[str] = Field(..., description="The filename of the example.")
+
+```
+
+With these schemas in hand, it's straightforward to write the function that extracts the metadata.
+Note that we decorate it with `@app.function` to make it run on Modal.
+
+```python
+@app.function(max_containers=5)  # watch those LLM API rate limits!
+def extract_example_metadata(
+    example_contents: Optional[str] = None,
+    filename: Optional[str] = None,
+    with_opus=False,
+):
+    import instructor
+    from anthropic import Anthropic
+
+    # if no example is provided, use the contents of this example
+    if example_contents is None:
+        example_contents = Path(__file__).read_text()
+        filename = Path(__file__).name
+
+    client = instructor.from_anthropic(Anthropic())
+    model = "claude-3-opus-20240229" if with_opus else "claude-3-haiku-20240307"
+
+    # add the schema as the `response_model` argument in what otherwise looks like a normal LLM API call
+    extracted_metadata = client.messages.create(
+        model=model,
+        temperature=0.0,
+        max_tokens=1024,
+        response_model=ExampleMetadataExtraction,
+        messages=[
+            {
+                "role": "user",
+                "content": f"Extract the metadata for this example.\n\n-----EXAMPLE BEGINS-----{example_contents}-----EXAMPLE ENDS-----\n\n",
+            },
+        ],
+    )
+
+    # inject the filename
+    full_metadata = ExampleMetadata(**extracted_metadata.dict(), filename=filename)
+
+    # return it as JSON
+    return full_metadata.model_dump_json()
+
+```
+
+## Addenda
+
+The rest of the code used in this example is not particularly interesting:
+just a utility function to find all of the examples, which we invoke in the `local_entrypoint` above.
+
+```python
+def get_examples(silent=True):
+    """Find all of the examples using a utility from this repo.
+
+    We use importlib to avoid the need to define the repo as a package."""
+    import importlib
+
+    examples_root = Path(__file__).parent.parent.parent
+    spec = importlib.util.spec_from_file_location(
+        "utils", f"{examples_root}/internal/utils.py"
+    )
+    example_utils = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(example_utils)
+    examples = [
+        example
+        for example in example_utils.get_examples()
+        if example.type != 2  # filter out non-code assets
+    ]
+    return examples
+
+```
+
+### Jsonformer Generate
+
+# Structured output generation with Jsonformer
+
+[Jsonformer](https://github.com/1rgs/jsonformer) is a tool that generates structured synthetic data using LLMs.
+You provide a JSON spec and it generates a JSON object following the spec. It's a
+great tool for developing, benchmarking, and testing applications.
+
+```python
+from typing import Any
+
+import modal
+
+```
+
+We will be using one of [Databrick's Dolly](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm)
+models, choosing for the smallest version with 3B parameters. Feel free to use any of the other models
+available from the [Huggingface Hub Dolly repository](https://huggingface.co/databricks).
+
+```python
+MODEL_ID: str = "databricks/dolly-v2-3b"
+CACHE_PATH: str = "/root/cache"
+
+```
+
+## Build image and cache model
+
+We'll download models from the Huggingface Hub and store them in our image.
+This skips the downloading of models during inference and reduces cold boot
+times.
+
+```python
+def download_model():
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_ID, use_cache=True, device_map="auto"
+    )
+    model.save_pretrained(CACHE_PATH, safe_serialization=True)
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True, use_cache=True)
+    tokenizer.save_pretrained(CACHE_PATH, safe_serialization=True)
+
+```
+
+Define our image; install dependencies.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.10")
+    .uv_pip_install(
+        "jsonformer==0.9.0",
+        "transformers",
+        "torch",
+        "accelerate",
+        "safetensors",
+    )
+    .run_function(download_model)
+)
+app = modal.App("example-jsonformer-generate")
+
+```
+
+## Generate examples
+
+The generate function takes two arguments `prompt` and `json_schema`, where
+`prompt` is used to describe the domain of your data (for example, "plants")
+and the schema contains the JSON schema you want to populate.
+
+```python
+@app.function(gpu="A10G", image=image)
+def generate(prompt: str, json_schema: dict[str, Any]) -> dict[str, Any]:
+    from jsonformer import Jsonformer
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    model = AutoModelForCausalLM.from_pretrained(
+        CACHE_PATH, use_cache=True, device_map="auto"
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_ID, use_fast=True, use_cache=True, device_map="auto"
+    )
+
+    jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
+    generated_data = jsonformer()
+
+    return generated_data
+
+```
+
+Add Modal entrypoint for invoking your script, and done!
+
+```python
+@app.local_entrypoint()
+def main():
+    prompt = "Generate random plant information based on the following schema:"
+    json_schema = {
+        "type": "object",
+        "properties": {
+            "height_cm": {"type": "number"},
+            "bearing_fruit": {"type": "boolean"},
+            "classification": {
+                "type": "object",
+                "properties": {
+                    "species": {"type": "string"},
+                    "kingdom": {"type": "string"},
+                    "family": {"type": "string"},
+                    "genus": {"type": "string"},
+                },
+            },
+        },
+    }
+
+    result = generate.remote(prompt, json_schema)
+    print(result)
+
+```
+
+### Jupyter Inside Modal
+
+## Overview
+
+Quick snippet showing how to connect to a Jupyter notebook server running inside a Modal container,
+especially useful for exploring the contents of Modal Volumes.
+This uses [Modal Tunnels](https://modal.com/docs/guide/tunnels#tunnels-beta)
+to create a tunnel between the running Jupyter instance and the internet.
+
+If you want to your Jupyter notebook to run _locally_ and execute remote Modal Functions in certain cells, see the `basic.ipynb` example :)
+
+```python
+import os
+import subprocess
+import time
+
+import modal
+
+app = modal.App(
+    "example-jupyter-inside-modal",
+    image=modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+        "jupyter", "bing-image-downloader~=1.1.2"
+    ),
+)
+volume = modal.Volume.from_name(
+    "modal-examples-jupyter-inside-modal-data", create_if_missing=True
+)
+
+CACHE_DIR = "/root/cache"
+JUPYTER_TOKEN = "1234"  # Change me to something non-guessable!
+
+@app.function(volumes={CACHE_DIR: volume})
+def seed_volume():
+    # Bing it!
+    from bing_image_downloader import downloader
+
+    # This will save into the Modal volume and allow you view the images
+    # from within Jupyter at a path like `/root/cache/modal labs/Image_1.png`.
+    downloader.download(
+        query="modal labs",
+        limit=10,
+        output_dir=CACHE_DIR,
+        force_replace=False,
+        timeout=60,
+        verbose=True,
+    )
+    volume.commit()
+
+```
+
+This is all that's needed to create a long-lived Jupyter server process in Modal
+that you can access in your Browser through a secure network tunnel.
+This can be useful when you want to interactively engage with Volume contents
+without having to download it to your host computer.
+
+```python
+@app.function(max_containers=1, volumes={CACHE_DIR: volume}, timeout=1_500)
+def run_jupyter(timeout: int):
+    jupyter_port = 8888
+    with modal.forward(jupyter_port) as tunnel:
+        jupyter_process = subprocess.Popen(
+            [
+                "jupyter",
+                "notebook",
+                "--no-browser",
+                "--allow-root",
+                "--ip=0.0.0.0",
+                f"--port={jupyter_port}",
+                "--NotebookApp.allow_origin='*'",
+                "--NotebookApp.allow_remote_access=1",
+            ],
+            env={**os.environ, "JUPYTER_TOKEN": JUPYTER_TOKEN},
+        )
+
+        print(f"Jupyter available at => {tunnel.url}")
+
+        try:
+            end_time = time.time() + timeout
+            while time.time() < end_time:
+                time.sleep(5)
+            print(f"Reached end of {timeout} second timeout period. Exiting...")
+        except KeyboardInterrupt:
+            print("Exiting...")
+        finally:
+            jupyter_process.kill()
+
+@app.local_entrypoint()
+def main(timeout: int = 10_000):
+    # Write some images to a volume, for demonstration purposes.
+    seed_volume.remote()
+    # Run the Jupyter Notebook server
+    run_jupyter.remote(timeout=timeout)
+
+```
+
+Doing `modal run jupyter_inside_modal.py` will run a Modal app which starts
+the Juypter server at an address like https://u35iiiyqp5klbs.r3.modal.host.
+Visit this address in your browser, and enter the security token
+you set for `JUPYTER_TOKEN`.
+
+### Jupyter Sandbox
+
+# Run a Jupyter notebook in a Modal Sandbox
+
+This example demonstrates how to run a Jupyter notebook in a Modal
+[Sandbox](https://modal.com/docs/guide/sandbox).
+
+## Setting up the Sandbox
+
+All Sandboxes are associated with an App.
+
+We look up our app by name, creating it if it doesn't exist.
+
+```python
+import json
+import secrets
+import time
+import urllib.request
+
+import modal
+
+app = modal.App.lookup("example-jupyter-sandbox", create_if_missing=True)
+
+```
+
+We define a custom Docker image that has Jupyter and some other dependencies installed.
+Using a pre-defined image allows us to avoid re-installing packages on every Sandbox startup.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.12").uv_pip_install("jupyter~=1.1.0")
+    # .uv_pip_install("pandas", "numpy", "seaborn")  # Any other deps
+)
+
+```
+
+## Starting a Jupyter server in a Sandbox
+
+Since we'll be exposing a Jupyter server over the Internet, we need to create a password.
+We'll use `secrets` from the standard library to create a token
+and then store it in a Modal [Secret](https://modal.com/docs/guide/secrets).
+
+```python
+token = secrets.token_urlsafe(13)
+token_secret = modal.Secret.from_dict({"JUPYTER_TOKEN": token})
+
+```
+
+Now, we can start our Sandbox. Note our use of the `encrypted_ports` argument, which
+allows us to securely expose the Jupyter server to the public Internet. We use
+`modal.enable_output()` to print the Sandbox's image build logs to the console.
+
+```python
+JUPYTER_PORT = 8888
+
+print("🏖️  Creating sandbox")
+
+with modal.enable_output():
+    sandbox = modal.Sandbox.create(
+        "jupyter",
+        "notebook",
+        "--no-browser",
+        "--allow-root",
+        "--ip=0.0.0.0",
+        f"--port={JUPYTER_PORT}",
+        "--NotebookApp.allow_origin='*'",
+        "--NotebookApp.allow_remote_access=1",
+        encrypted_ports=[JUPYTER_PORT],
+        secrets=[token_secret],
+        timeout=5 * 60,  # 5 minutes
+        image=image,
+        app=app,
+        gpu=None,  # add a GPU if you need it!
+    )
+
+print(f"🏖️  Sandbox ID: {sandbox.object_id}")
+
+```
+
+## Communicating with a Jupyter server
+
+Next, we print out a URL that we can use to connect to our Jupyter server.
+Note that we have to call [`Sandbox.tunnels`](https://modal.com/docs/reference/modal.Sandbox#tunnels)
+to get the URL. The Sandbox is not publicly accessible until we do so.
+
+```python
+tunnel = sandbox.tunnels()[JUPYTER_PORT]
+url = f"{tunnel.url}/?token={token}"
+print(f"🏖️  Jupyter notebook is running at: {url}")
+
+```
+
+Jupyter servers expose a [REST API](https://jupyter-server.readthedocs.io/en/latest/developers/rest-api.html)
+that you can use for programmatic manipulation.
+
+For example, we can check the server's status by
+sending a GET request to the `/api/status` endpoint.
+
+```python
+def is_jupyter_up():
+    try:
+        response = urllib.request.urlopen(f"{tunnel.url}/api/status?token={token}")
+        if response.getcode() == 200:
+            data = json.loads(response.read().decode())
+            return data.get("started", False)
+    except Exception:
+        return False
+    return False
+
+```
+
+We'll now wait for the Jupyter server to be ready by hitting that endpoint.
+
+```python
+timeout = 60  # seconds
+start_time = time.time()
+while time.time() - start_time < timeout:
+    if is_jupyter_up():
+        print("🏖️  Jupyter is up and running!")
+        break
+    time.sleep(1)
+else:
+    print("🏖️  Timed out waiting for Jupyter to start.")
+
+```
+
+You can now open this URL in your browser to access the Jupyter notebook!
+
+When you're done, terminate the sandbox using your [Modal dashboard](https://modal.com/sandboxes)
+or by running `Sandbox.from_id(sandbox.object_id).terminate()`.
+
+### Laion400
+
+https://laion.ai/blog/laion-400-open-dataset/
+
+LAION-400 is a large dataset of 400M English (image, text) pairs.
+
+As described on the dataset's homepage, it consists of 32 .parquet files
+containing dataset metadata *but not* the image data itself.
+
+After downloading the .parquet files, this script fans out 32 worker jobs
+to process a single .parquet file. Processing involves fetch and transform
+of image data into 256 * 256 square JPEGs.
+
+This script is loosely based off the following instructions:
+https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion400m.md
+
+It is recommended to iterate on this code from a modal.Function running Jupyter server.
+This better supports experimentation and maintains state in the face of errors:
+11_notebooks/jupyter_inside_modal.py
+
+```python
+import os
+import pathlib
+import shutil
+import subprocess
+import sys
+import threading
+import time
+
+import modal
+
+bucket_creds = modal.Secret.from_name(
+    "aws-s3-modal-examples-datasets", environment_name="main"
+)
+
+bucket_name = "modal-examples-datasets"
+
+volume = modal.CloudBucketMount(
+    bucket_name,
+    secret=bucket_creds,
+)
+
+image = (
+    modal.Image.debian_slim().apt_install("wget").uv_pip_install("img2dataset~=1.45.0")
+)
+
+app = modal.App("example-laion400", image=image)
+
+def start_monitoring_disk_space(interval: int = 30) -> None:
+    """Start monitoring the disk space in a separate thread, printing info to stdout"""
+    task_id = os.environ["MODAL_TASK_ID"]
+
+    def log_disk_space(interval: int) -> None:
+        while True:
+            statvfs = os.statvfs("/")
+            free_space = statvfs.f_frsize * statvfs.f_bavail
+            print(
+                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
+                file=sys.stderr,
+            )
+            time.sleep(interval)
+
+    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
+    monitoring_thread.daemon = True
+    monitoring_thread.start()
+
+def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
+    """
+    A modified shutil.copytree which copies in parallel to increase bandwidth
+    and compensate for the increased IO latency of volume mounts.
+    """
+    from multiprocessing.pool import ThreadPool
+
+    class MultithreadedCopier:
+        def __init__(self, max_threads):
+            self.pool = ThreadPool(max_threads)
+            self.copy_jobs = []
+
+        def copy(self, source, dest):
+            res = self.pool.apply_async(
+                shutil.copy2,
+                args=(source, dest),
+                callback=lambda r: print(f"{source} copied to {dest}"),
+                # NOTE: this should `raise` an exception for proper reliability.
+                error_callback=lambda exc: print(
+                    f"{source} failed: {exc}", file=sys.stderr
+                ),
+            )
+            self.copy_jobs.append(res)
+
+        def __enter__(self):
+            return self
+
+        def __exit__(self, exc_type, exc_val, exc_tb):
+            self.pool.close()
+            self.pool.join()
+
+    with MultithreadedCopier(max_threads=24) as copier:
+        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)
+
+@app.function(
+    volumes={"/mnt": volume},
+    # 20 hours — img2dataset is extremely slow to work through all images.
+    timeout=60 * 60 * 20,
+    ephemeral_disk=512 * 1024,
+)
+def run_img2dataset_on_part(
+    i: int,
+    partfile: str,
+) -> None:
+    start_monitoring_disk_space(interval=60)
+    while not pathlib.Path(partfile).exists():
+        print(f"{partfile} not yet visible...", file=sys.stderr)
+        time.sleep(1)
+    # Each part works in its own subdirectory because img2dataset creates a working
+    # tmpdir at <output_folder>/_tmp and we don't want consistency issues caused by
+    # all concurrently processing parts read/writing from the same temp directory.
+    tmp_laion400m_data_path = pathlib.Path(f"/tmp/laion400/laion400m-data/{i}/")
+    tmp_laion400m_data_path.mkdir(exist_ok=True, parents=True)
+    # Increasing retries comes at a *large* performance cost.
+    retries = 0
+    # TODO: Support --incremental mode. https://github.com/rom1504/img2dataset?tab=readme-ov-file#incremental-mode
+    command = (
+        f'img2dataset --url_list {partfile} --input_format "parquet" '
+        '--url_col "URL" --caption_col "TEXT" --output_format webdataset '
+        f"--output_folder {tmp_laion400m_data_path} --processes_count 16 --thread_count 128 --image_size 256 "
+        f'--retries={retries} --save_additional_columns \'["NSFW","similarity","LICENSE"]\' --enable_wandb False'
+    )
+    print(f"Running img2dataset command: \n\n{command}")
+    subprocess.run(command, shell=True, check=True)
+    print("Completed img2dataset, copying into mounted volume...")
+    laion400m_data_path = pathlib.Path("/mnt/laion400/laion400m-data/")
+    copy_concurrent(tmp_laion400m_data_path, laion400m_data_path)
+
+@app.function(
+    volumes={"/mnt": volume},
+    timeout=60 * 60 * 16,  # 16 hours
+)
+def import_transform_load() -> None:
+    start_monitoring_disk_space()
+    # We initially download into a tmp directory outside of the volume to avoid
+    # any filesystem incompatibilities between the `wget` application and the bucket
+    # filesystem mount.
+    tmp_laion400m_meta_path = pathlib.Path("/tmp/laion400/laion400m-meta")
+    laion400m_meta_path = pathlib.Path("/mnt/laion400/laion400m-meta")
+    if not laion400m_meta_path.exists():
+        laion400m_meta_path.mkdir(parents=True, exist_ok=True)
+        # WARNING: We skip the certificate check for the-eye.eu because its TLS certificate expired as of mid-May 2024.
+        subprocess.run(
+            f"wget -l1 -r --no-check-certificate --no-parent https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ -P {tmp_laion400m_meta_path}",
+            shell=True,
+            check=True,
+        )
+
+        parquet_files = list(tmp_laion400m_meta_path.glob("**/*.parquet"))
+        print(
+            f"Downloaded {len(parquet_files)} parquet files into {tmp_laion400m_meta_path}."
+        )
+        # Perform a simple copy operation to move the data into the bucket.
+        copy_concurrent(tmp_laion400m_meta_path, laion400m_meta_path)
+
+    parquet_files = list(laion400m_meta_path.glob("**/*.parquet"))
+    print(f"Stored {len(parquet_files)} parquet files into {laion400m_meta_path}.")
+    print(f"Spawning {len(parquet_files)} to enrich dataset...")
+    list(run_img2dataset_on_part.starmap((i, f) for i, f in enumerate(parquet_files)))
+
+```
+
+### Langserve
+
+# Deploy LangChain and LangGraph applications with LangServe
+
+This code demonstrates how to deploy a
+[LangServe](https://python.langchain.com/docs/langserve/) application on Modal.
+LangServe makes it easy to wrap LangChain and LangGraph applications in a FastAPI server,
+and Modal makes it easy to deploy FastAPI servers.
+
+The LangGraph application that it serves is from our [sandboxed LLM coding agent example](https://modal.com/docs/examples/agent).
+
+You can find the code for the agent and several other code files associated with this example in the
+[`codelangchain` directory of our examples repo](https://github.com/modal-labs/modal-examples/tree/main/13_sandboxes/codelangchain).
+
+```python
+import modal
+
+from .agent import construct_graph, create_sandbox
+from .src.common import image
+
+app = modal.App("example-codelangchain-langserve")
+
+image = image.uv_pip_install("langserve[all]==0.3.0")
+
+@app.function(
+    image=image,
+    secrets=[  # see the agent.py file for more information on Secrets
+        modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
+        modal.Secret.from_name("langsmith-secret", required_keys=["LANGCHAIN_API_KEY"]),
+    ],
+)
+@modal.asgi_app()
+def serve():
+    from fastapi import FastAPI, responses
+    from fastapi.middleware.cors import CORSMiddleware
+    from langchain_core.runnables import RunnableLambda
+    from langserve import add_routes
+
+    # create a FastAPI app
+    web_app = FastAPI(
+        title="CodeLangChain Server",
+        version="1.0",
+        description="Writes code and checks if it runs.",
+    )
+
+    # set all CORS enabled origins
+    web_app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+        expose_headers=["*"],
+    )
+
+    def inp(question: str) -> dict:
+        return {"keys": {"question": question, "iterations": 0}}
+
+    def out(state: dict) -> str:
+        if "finish" in state:
+            return state["finish"]["keys"]["response"]
+        elif len(state) > 0 and "finish" in state[-1]:
+            return state[-1]["finish"]["keys"]["response"]
+        else:
+            return str(state)
+
+    graph = construct_graph(create_sandbox(app), debug=False).compile()
+
+    chain = RunnableLambda(inp) | graph | RunnableLambda(out)
+
+    add_routes(
+        web_app,
+        chain,
+        path="/codelangchain",
+    )
+
+    # redirect the root to the interactive playground
+    @web_app.get("/")
+    def redirect():
+        return responses.RedirectResponse(url="/codelangchain/playground")
+
+    # return the FastAPI app and Modal will deploy it for us
+    return web_app
+
+```
+
+### Learn Math
+
+# Training a mathematical reasoning model using the verifiers library with sandboxed code execution
+
+This example demonstrates how to train mathematical reasoning models on Modal using the [verifiers library](https://github.com/willccbb/verifiers) with [Modal Sandboxes](https://modal.com/docs/guide/sandbox) for executing generated code.
+The [verifiers library](https://github.com/willccbb/verifiers) is a set of tools and abstractions for training LLMs with reinforcement learning in verifiable multi-turn environments via [GRPO](https://arxiv.org/abs/2402.03300).
+
+This example demonstrates how to:
+- Launch a distributed GRPO training job on Modal with 4× H100 GPUs.
+- Use vLLM for inference during training.
+- Cache HuggingFace, vLLM, and store the model weights in [Volumes](https://modal.com/docs/guide/volumes).
+- Run inference by loading the trained model from Volumes.
+
+## Setup
+We start by importing modal and the dependencies from the verifiers library. Then, we create a Modal App and an image with a NVIDIA CUDA base image.
+We install the dependencies for the `verifiers` and `flash-attn` libraries, following the verifiers [README](https://github.com/willccbb/verifiers?tab=readme-ov-file#getting-started).
+
+```python
+import modal
+
+app = modal.App(name="example-learn-math")
+cuda_version = "12.8.0"
+flavor = "devel"
+operating_sys = "ubuntu22.04"
+tag = f"{cuda_version}-{flavor}-{operating_sys}"
+
+flash_attn_release = (
+    "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.1.post1/"
+    "flash_attn-2.7.1.post1+cu12torch2.6cxx11abiTRUE-cp311-cp311-linux_x86_64.whl"
+)  # We use a pre-built binary for flash-attn to install it in the image.
+
+image = (
+    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.11")
+    .apt_install("git", "clang")
+    .uv_pip_install(
+        "huggingface-hub==0.36.0",
+        "setuptools==69.0.3",
+        "wheel==0.45.1",
+        "ninja==1.11.1.4",
+        "packaging==25.0",
+        "verifiers[all]==0.1.1",
+        flash_attn_release,
+    )
+    .env(
+        {
+            "HF_XET_HIGH_PERFORMANCE": "1",
+            "VLLM_ALLOW_INSECURE_SERIALIZATION": "1",
+            "HF_HOME": "/root/.cache/huggingface",
+        }
+    )
+)
+
+```
+
+## Caching HuggingFace, vLLM, and storing model weights. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+We create Modal Volumes to persist:
+- HuggingFace downloads
+- vLLM cache
+- Model weights
+
+We define the model name and a tool that the model can use to execute Python code that it generates.
+See this [this training script](/docs/examples/trainer_script_grpo) for more details.
+
+```python
+HF_CACHE_DIR = "/root/.cache/huggingface"
+HF_CACHE_VOL = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
+
+VLLM_CACHE_DIR = "/root/.cache/vllm"
+VLLM_CACHE_VOL = modal.Volume.from_name("vllm-cache", create_if_missing=True)
+
+WEIGHTS_DIR = "/root/math_weights"
+WEIGHTS_VOL = modal.Volume.from_name(
+    "example-trainer-script-grpo-weights", create_if_missing=True
+)
+
+MODEL_NAME = "willcb/Qwen3-0.6B"
+TOOL_DESCRIPTIONS = """
+- sandbox_exec: Execute Python code to perform calculations.
+"""
+
+```
+
+## Training
+Following the [verifiers example](https://github.com/willccbb/verifiers/blob/main/verifiers/examples/math_python.py), we will need a training script and a config file.
+For sandboxed code execution, we will use [this training script](/docs/examples/trainer_script_grpo) and the config file defined [here](https://github.com/willccbb/verifiers/blob/main/configs/zero3.yaml).
+
+We create a function that uses 4 H100 GPUs and mounts the defined Volumes. Then, we write the training script and the config file to the `/root/` directory.
+We use the `willcb/Qwen3-0.6B` model from HuggingFace, setting up inference via a vLLM server. Once, the model is served, we will launch the training script using `accelerate`.
+We can use the App ID as a unique identifier for saving and loading the model weights.
+When the training is complete, we will run a single inference from the training set to test our training run.
+
+```python
+@app.function(
+    gpu="H100:4",
+    image=image,
+    volumes={
+        HF_CACHE_DIR: HF_CACHE_VOL,
+        VLLM_CACHE_DIR: VLLM_CACHE_VOL,
+        WEIGHTS_DIR: WEIGHTS_VOL,
+    },
+    timeout=3600,
+    secrets=[modal.Secret.from_name("wandb-secret-rl")],
+)
+def math_group_verifier(trainer_script: str, config_file: str, run_id: str = None):
+    import os
+    import subprocess
+
+    import wandb
+    from verifiers.prompts import DEFAULT_TOOL_PROMPT_TEMPLATE
+    from verifiers.utils import load_example_dataset
+
+    with open("/root/trainer_script.py", "w") as f:
+        f.write(trainer_script)
+    with open("/root/config.yaml", "w") as f:
+        f.write(config_file)
+
+    wandb.init(project="example-trainer-script-grpo")
+    wandb.config = {"epochs": 10}
+
+    vllm_proc = subprocess.Popen(
+        ["vf-vllm", "--model", MODEL_NAME, "--port", "8000", "--enforce-eager"],
+        env={**os.environ, "CUDA_VISIBLE_DEVICES": "0", "NCCL_CUMEM_ENABLE": "0"},
+    )
+
+    run_id = app.app_id if run_id is None else run_id
+
+    result = subprocess.run(
+        [
+            "accelerate",
+            "launch",
+            "--config-file",
+            "/root/config.yaml",
+            "/root/trainer_script.py",
+            "--run-id",
+            run_id,
+        ],
+        env={
+            **os.environ,
+            "CUDA_VISIBLE_DEVICES": "1,2,3",
+            "NCCL_DEBUG": "INFO",
+            "NCCL_CUMEM_ENABLE": "0",
+        },
+    )
+    vllm_proc.terminate()
+    vllm_proc.wait()
+
+    print("Training completed! Running a single inference from test set...")
+
+    dataset = load_example_dataset(
+        "math", split="train"
+    ).select(
+        range(1)
+    )  # We use the first example from the training set for inference to test our training run.
+
+    example = dataset[0]
+    question = example["question"]
+    prompt = (
+        DEFAULT_TOOL_PROMPT_TEMPLATE.format(tool_descriptions=TOOL_DESCRIPTIONS)
+        + "\n\nProblem: "
+        + question
+        + "\n\n<think>\n\n<answer>"
+    )
+
+    result = inference.remote(prompt, run_id)
+    print(result)
+
+```
+
+## Inference
+We define an `inference` Modal function that runs on a single GPU and mounts the weights volume.
+Then, we load the trained model from the volume, falling back to the base model if necessary.
+To build the prompt, we apply `DEFAULT_TOOL_PROMPT_TEMPLATE` with `TOOL_DESCRIPTIONS` and the problem text.
+Finally, we tokenize the prompt, generate a response with sampling (temperature, top-p, repetition penalty), then decode and return the answer.
+
+```python
+@app.function(
+    gpu="H100",
+    image=image,
+    volumes={
+        HF_CACHE_DIR: HF_CACHE_VOL,
+        WEIGHTS_DIR: WEIGHTS_VOL,
+    },
+    timeout=60 * 10,
+)
+def inference(prompt: str, run_id: str = None):
+    """Test the trained model with the same format as training"""
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    from verifiers.prompts import DEFAULT_TOOL_PROMPT_TEMPLATE
+
+    prompt = (
+        DEFAULT_TOOL_PROMPT_TEMPLATE.format(tool_descriptions=TOOL_DESCRIPTIONS)
+        + "\n\nProblem: "
+        + prompt
+        + "\n\n<think>\n\n<answer>"
+    )
+
+    model_path = f"{WEIGHTS_DIR}/{app.app_id if run_id is None else run_id}"
+    print(f"Loading model from {model_path}")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        model = AutoModelForCausalLM.from_pretrained(
+            model_path,
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+            trust_remote_code=True,
+        )
+        print(f"✓ Loaded trained model from {model_path}")
+    except Exception as e:
+        print(f"Could not load trained model: {e}")
+        print("Loading base model instead...")
+        tokenizer = AutoTokenizer.from_pretrained(
+            MODEL_NAME, cache_dir=HF_CACHE_DIR, trust_remote_code=True
+        )
+        model = AutoModelForCausalLM.from_pretrained(
+            MODEL_NAME,
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+            trust_remote_code=True,
+        )
+
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    def generate_response(prompt_text):
+        inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
+
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=2048,
+                do_sample=True,
+                temperature=0.3,
+                top_p=0.9,
+                repetition_penalty=1.1,
+                pad_token_id=tokenizer.eos_token_id,
+            )
+
+        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+        return response[len(prompt_text) :].strip()
+
+    model_response = generate_response(prompt + "\n\n<think>\n\n<answer>")
+    return model_response
+
+```
+
+## Usage
+We create a main function that serves as the entrypoint for the app.
+It supports two modes:
+- train: kick off math_group_verifier with the provided training script and config file
+- inference: invoke inference with prompt string or prompt file
+
+To run the training, we can use the following command:
+```bash
+modal run learn_math.py --mode=train --trainer-script=trainer_script_grpo.py --config-file=config_grpo.yaml
+```
+To run the inference with a custom prompt, we can use the following command after setting the model path inside our volume:
+```bash
+modal run learn_math.py --mode=inference --prompt "Find the value of x that satisfies the equation: 2x + 5 = 17" --model-path "test_run"
+```
+To run the inference with a custom prompt from a file, we can use the following command:
+```bash
+modal run learn_math.py --mode=inference --prompt-file "prompt.txt"
+```
+
+```python
+@app.local_entrypoint()
+def main(
+    mode: str = "train",
+    prompt: str = None,
+    prompt_file: str = None,
+    run_id: str = None,
+    trainer_script: str = "trainer_script_grpo.py",
+    config_file: str = "config_grpo.yaml",
+):
+    if mode == "inference":
+        if prompt_file:
+            try:
+                with open(prompt_file, "r") as f:
+                    prompt_text = f.read().strip()
+                print(f"Using prompt from file: {prompt_file}")
+            except FileNotFoundError:
+                print(f"Error: File {prompt_file} not found")
+                return
+        elif prompt:
+            prompt_text = prompt
+            print("Using prompt from command line argument")
+        else:
+            prompt_text = "Find the value of x that satisfies the equation: 2x + 5 = 17"
+
+        print("=" * 50)
+        print("Running inference...")
+        print("=" * 50)
+        print("PROMPT:")
+        print(prompt_text)
+        print("-" * 30)
+        model_response = inference.remote(prompt_text, run_id)
+        print("MODEL RESPONSE:")
+        print(model_response)
+        print("-" * 30)
+
+    elif mode == "train":
+        print(
+            f"Training with trainer script:\n{trainer_script}\nand config file:\n{config_file}"
+        )
+        with open(trainer_script, "r") as f:
+            trainer_content = f.read()
+        with open(config_file, "r") as f:
+            config_content = f.read()
+
+        math_group_verifier.remote(trainer_content, config_content, run_id)
+
+```
+
+### Llama Cpp
+
+# Run large and small language models with llama.cpp (DeepSeek-R1, Phi-4)
+
+This example demonstrates how to run small (Phi-4) and large (DeepSeek-R1)
+language models on Modal with [`llama.cpp`](https://github.com/ggerganov/llama.cpp).
+
+By default, this example uses DeepSeek-R1 to produce a "Flappy Bird" game in Python --
+see the video below. The code used in the video is [here](https://gist.github.com/charlesfrye/a3788c61019c32cb7947f4f5b1c04818),
+along with the model's raw outputs.
+Note that getting the game to run required a small bugfix from a human --
+our jobs are still safe, for now.
+
+<center>
+<a href="https://gist.github.com/charlesfrye/a3788c61019c32cb7947f4f5b1c04818" aria-label="View the generated code"> <video controls autoplay loop muted> <source src="https://modal-cdn.com/example-flap-py.mp4" type="video/mp4"> </video> </a>
+</center>
+
+```python
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+```
+
+## What GPU can run DeepSeek-R1? What GPU can run Phi-4?
+
+Our large model is a real whale:
+[DeepSeek-R1](https://api-docs.deepseek.com/news/news250120),
+which has 671B total parameters and so consumes over 100GB of storage,
+even when [quantized down to one ternary digit (1.58 bits)](https://unsloth.ai/blog/deepseekr1-dynamic)
+per parameter.
+
+To make sure we have enough room for it and its activations/KV cache,
+we select four L40S GPUs, which together have 192 GB of memory.
+
+[Phi-4](https://huggingface.co/microsoft/phi-4),
+on the other hand, is a svelte 14B total parameters,
+or roughly 5 GB when quantized down to
+[two bits per parameter](https://huggingface.co/unsloth/phi-4-GGUF).
+
+That's small enough that it can be comfortably run on a CPU,
+especially for a single-user setup like the one we'll build here.
+
+```python
+GPU_CONFIG = "L40S:4"  # for DeepSeek-R1, literal `None` for phi-4
+
+```
+
+## Calling a Modal Function from the command line
+
+To start, we define our `main` function --
+the Python function that we'll run locally to
+trigger our inference to run on Modal's cloud infrastructure.
+
+This function, like the others that form our inference service
+running on Modal, is part of a Modal [App](https://modal.com/docs/guide/apps).
+Specifically, it is a `local_entrypoint`.
+Any Python code can call Modal Functions remotely,
+but local entrypoints get a command-line interface for free.
+
+```python
+app = modal.App("example-llama-cpp")
+
+@app.local_entrypoint()
+def main(
+    prompt: Optional[str] = None,
+    model: str = "DeepSeek-R1",  # or "phi-4"
+    n_predict: int = -1,  # max number of tokens to predict, -1 is infinite
+    args: Optional[str] = None,  # string of arguments to pass to llama.cpp's cli
+):
+    """Run llama.cpp inference on Modal for phi-4 or deepseek r1."""
+    import shlex
+
+    org_name = "unsloth"
+    # two sample models: the diminutive phi-4 and the chonky deepseek r1
+    if model.lower() == "phi-4":
+        model_name = "phi-4-GGUF"
+        quant = "Q2_K"
+        model_entrypoint_file = f"phi-4-{quant}.gguf"
+        model_pattern = f"*{quant}*"
+        revision = None
+        parsed_args = DEFAULT_PHI_ARGS if args is None else shlex.split(args)
+    elif model.lower() == "deepseek-r1":
+        model_name = "DeepSeek-R1-GGUF"
+        quant = "UD-IQ1_S"
+        model_entrypoint_file = (
+            f"{model}-{quant}/DeepSeek-R1-{quant}-00001-of-00003.gguf"
+        )
+        model_pattern = f"*{quant}*"
+        revision = "02656f62d2aa9da4d3f0cdb34c341d30dd87c3b6"
+        parsed_args = DEFAULT_DEEPSEEK_R1_ARGS if args is None else shlex.split(args)
+    else:
+        raise ValueError(f"Unknown model {model}")
+
+    repo_id = f"{org_name}/{model_name}"
+    download_model.remote(repo_id, [model_pattern], revision)
+
+    # call out to a `.remote` Function on Modal for inference
+    result = llama_cpp_inference.remote(
+        model_entrypoint_file,
+        prompt,
+        n_predict,
+        parsed_args,
+        store_output=model.lower() == "deepseek-r1",
+    )
+    output_path = Path("/tmp") / f"llama-cpp-{model}.txt"
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    print(f"🦙 writing response to {output_path}")
+    output_path.write_text(result)
+
+```
+
+You can trigger inference from the command line with
+
+```bash
+modal run llama_cpp.py
+```
+
+To try out Phi-4 instead, use the `--model` argument:
+
+```bash
+modal run llama_cpp.py --model="phi-4"
+```
+
+Note that this will run for up to 30 minutes, which costs ~$5.
+To allow it to proceed even if your local terminal fails,
+add the `--detach` flag after `modal run`.
+See below for details on getting the outputs.
+
+You can pass prompts with the `--prompt` argument and set the maximum number of tokens
+with the `--n-predict` argument.
+
+Additional arguments for `llama-cli` are passed as a string like `--args="--foo 1 --bar"`.
+
+For convenience, we set a number of sensible defaults for DeepSeek-R1,
+following the suggestions by the team at unsloth,
+who [quantized the model to 1.58 bit](https://unsloth.ai/blog/deepseekr1-dynamic).
+
+```python
+DEFAULT_DEEPSEEK_R1_ARGS = [  # good default llama.cpp cli args for deepseek-r1
+    "--cache-type-k",
+    "q4_0",
+    "--threads",
+    "12",
+    "-no-cnv",
+    "--prio",
+    "2",
+    "--temp",
+    "0.6",
+    "--ctx-size",
+    "8192",
+]
+
+DEFAULT_PHI_ARGS = [  # good default llama.cpp cli args for phi-4
+    "--threads",
+    "16",
+    "-no-cnv",
+    "--ctx-size",
+    "16384",
+]
+
+```
+
+## Compiling llama.cpp with CUDA support
+
+In order to run inference, we need the model's weights
+and we need code to run inference with those weights.
+
+[`llama.cpp`](https://github.com/ggerganov/llama.cpp)
+is a no-frills C++ library for running large language models.
+It supports highly-quantized versions of models ideal for running
+single-user language modeling services on CPU or GPU.
+
+We compile it, with CUDA support, and add it to a Modal
+[container image](https://modal.com/docs/guide/images)
+using the code below.
+
+For more details on using CUDA on Modal, including why
+we need to use the `nvidia/cuda` registry image in this case
+(hint: it's for the [`nvcc` compiler](https://modal.com/gpu-glossary/host-software/nvcc)),
+see the [Modal guide to using CUDA](https://modal.com/docs/guide/cuda).
+
+```python
+LLAMA_CPP_RELEASE = "b4568"
+MINUTES = 60
+
+cuda_version = "12.4.0"  # should be no greater than host CUDA version
+flavor = "devel"  #  includes full CUDA toolkit
+operating_sys = "ubuntu22.04"
+tag = f"{cuda_version}-{flavor}-{operating_sys}"
+
+image = (
+    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
+    .apt_install("git", "build-essential", "cmake", "curl", "libcurl4-openssl-dev")
+    .run_commands("git clone https://github.com/ggerganov/llama.cpp")
+    .run_commands(
+        "cmake llama.cpp -B llama.cpp/build "
+        "-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON "
+    )
+    .run_commands(  # this one takes a few minutes!
+        "cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli"
+    )
+    .run_commands("cp llama.cpp/build/bin/llama-* llama.cpp")
+    .entrypoint([])  # remove NVIDIA base container entrypoint
+)
+
+```
+
+## Storing models on Modal
+
+To make the model weights available on Modal,
+we download them from Hugging Face.
+
+Modal is serverless, so disks are by default ephemeral.
+To make sure our weights don't disappear between runs,
+which would trigger a long download, we store them in a
+Modal [Volume](https://modal.com/docs/guide/volumes).
+
+For more on how to use Modal Volumes to store model weights,
+see [this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+model_cache = modal.Volume.from_name("llamacpp-cache", create_if_missing=True)
+cache_dir = "/root/.cache/llama.cpp"
+
+download_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install("huggingface-hub==0.36.0")
+    .env({"HF_XET_HIGH_PERFORMANCE": "1"})
+)
+
+@app.function(
+    image=download_image, volumes={cache_dir: model_cache}, timeout=30 * MINUTES
+)
+def download_model(repo_id, allow_patterns, revision: Optional[str] = None):
+    from huggingface_hub import snapshot_download
+
+    print(f"🦙 downloading model from {repo_id} if not present")
+
+    snapshot_download(
+        repo_id=repo_id,
+        revision=revision,
+        local_dir=cache_dir,
+        allow_patterns=allow_patterns,
+    )
+
+    model_cache.commit()  # ensure other Modal Functions can see our writes before we quit
+
+    print("🦙 model loaded")
+
+```
+
+## Storing model outputs on Modal
+
+Contemporary large reasoning models are slow --
+for the sample "flappy bird" prompt we provide,
+results are sometimes produced only after several (or even tens of) minutes.
+
+That makes their outputs worth storing.
+In addition to sending them back to clients,
+like our local command line,
+we'll store the results on a Modal Volume for safe-keeping.
+
+```python
+results = modal.Volume.from_name("llamacpp-results", create_if_missing=True)
+results_dir = "/root/results"
+
+```
+
+You can retrieve the results later in a number of ways.
+
+You can use the Volume CLI:
+
+```bash
+modal volume ls llamacpp-results
+```
+
+You can attach the Volume to a Modal `shell`
+to poke around in a familiar terminal environment:
+
+```bash
+modal shell --volume llamacpp-results
+# then cd into /mnt
+```
+
+Or you can access it from any other Python environment
+by using the same `modal.Volume` call as above to instantiate it:
+
+```python
+results = modal.Volume.from_name("llamacpp-results")
+print(dir(results))  # show methods
+```
+
+## Running llama.cpp as a Modal Function
+
+Now, let's put it all together.
+
+At the top of our `llama_cpp_inference` function,
+we add an `app.function` decorator to attach all of our infrastructure:
+
+- the `image` with the dependencies
+- the `volumes` with the weights and where we can put outputs
+- the `gpu` we want, if any
+
+We also specify a `timeout` after which to cancel the run.
+
+Inside the function, we call the `llama.cpp` CLI
+with `subprocess.Popen`. This requires a bit of extra ceremony
+because we want to both show the output as we run
+and store the output to save and return to the local caller.
+For details, see the [Addenda section](#addenda) below.
+
+Alternatively, you might set up an OpenAI-compatible server
+using base `llama.cpp` or its [Python wrapper library](https://github.com/abetlen/llama-cpp-python)
+along with one of [Modal's decorators for web hosting](https://modal.com/docs/guide/webhooks).
+
+```python
+@app.function(
+    image=image,
+    volumes={cache_dir: model_cache, results_dir: results},
+    gpu=GPU_CONFIG,
+    timeout=30 * MINUTES,
+)
+def llama_cpp_inference(
+    model_entrypoint_file: str,
+    prompt: Optional[str] = None,
+    n_predict: int = -1,
+    args: Optional[list[str]] = None,
+    store_output: bool = True,
+):
+    import subprocess
+    from uuid import uuid4
+
+    if prompt is None:
+        prompt = DEFAULT_PROMPT  # see end of file
+    if "deepseek" in model_entrypoint_file.lower():
+        prompt = "<｜User｜>" + prompt + "<think>"
+    if args is None:
+        args = []
+
+    # set layers to "off-load to", aka run on, GPU
+    if GPU_CONFIG is not None:
+        n_gpu_layers = 9999  # all
+    else:
+        n_gpu_layers = 0
+
+    if store_output:
+        result_id = str(uuid4())
+        print(f"🦙 running inference with id:{result_id}")
+
+    command = [
+        "/llama.cpp/llama-cli",
+        "--model",
+        f"{cache_dir}/{model_entrypoint_file}",
+        "--n-gpu-layers",
+        str(n_gpu_layers),
+        "--prompt",
+        prompt,
+        "--n-predict",
+        str(n_predict),
+    ] + args
+
+    print("🦙 running command:", command, sep="\n\t")
+    p = subprocess.Popen(
+        command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=False
+    )
+
+    stdout, stderr = collect_output(p)
+
+    if p.returncode != 0:
+        raise subprocess.CalledProcessError(p.returncode, command, stdout, stderr)
+
+    if store_output:  # save results to a Modal Volume if requested
+        print(f"🦙 saving results for {result_id}")
+        result_dir = Path(results_dir) / result_id
+        result_dir.mkdir(
+            parents=True,
+        )
+        (result_dir / "out.txt").write_text(stdout)
+        (result_dir / "err.txt").write_text(stderr)
+
+    return stdout
+
+```
+
+# Addenda
+
+The remainder of this code is less interesting from the perspective
+of running LLM inference on Modal but necessary for the code to run.
+
+For example, it includes the default "Flappy Bird in Python" prompt included in
+[unsloth's announcement](https://unsloth.ai/blog/deepseekr1-dynamic)
+of their 1.58 bit quantization of DeepSeek-R1.
+
+```python
+DEFAULT_PROMPT = """Create a Flappy Bird game in Python. You must include these things:
+
+    You must use pygame.
+    The background color should be randomly chosen and is a light shade. Start with a light blue color.
+    Pressing SPACE multiple times will accelerate the bird.
+    The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
+    Place on the bottom some land colored as dark brown or yellow chosen randomly.
+    Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
+    Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
+    When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
+
+The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section."""
+
+def stream_output(stream, queue, write_stream):
+    """Reads lines from a stream and writes to a queue and a write stream."""
+    for line in iter(stream.readline, b""):
+        line = line.decode("utf-8", errors="replace")
+        write_stream.write(line)
+        write_stream.flush()
+        queue.put(line)
+    stream.close()
+
+def collect_output(process):
+    """Collect up the stdout and stderr of a process while still streaming it out."""
+    import sys
+    from queue import Queue
+    from threading import Thread
+
+    stdout_queue = Queue()
+    stderr_queue = Queue()
+
+    stdout_thread = Thread(
+        target=stream_output, args=(process.stdout, stdout_queue, sys.stdout)
+    )
+    stderr_thread = Thread(
+        target=stream_output, args=(process.stderr, stderr_queue, sys.stderr)
+    )
+    stdout_thread.start()
+    stderr_thread.start()
+
+    stdout_thread.join()
+    stderr_thread.join()
+    process.wait()
+
+    stdout_collected = "".join(stdout_queue.queue)
+    stderr_collected = "".join(stderr_queue.queue)
+
+    return stdout_collected, stderr_collected
+
+```
+
+### Long-Training
+
+# Run long, resumable training jobs on Modal
+
+Individual Modal Function calls have a [maximum timeout of 24 hours](https://modal.com/docs/guide/timeouts).
+You can still run long training jobs on Modal by making them interruptible and resumable
+(aka [_reentrant_](https://en.wikipedia.org/wiki/Reentrancy_%28computing%29)).
+
+This is usually done via checkpointing: saving the model state to disk at regular intervals.
+We recommend implementing checkpointing logic regardless of the duration of your training jobs.
+This prevents loss of progress in case of interruptions or [preemptions](https://modal.com/docs/guide/preemption).
+
+In this example, we'll walk through how to implement this pattern in
+[PyTorch Lightning](https://lightning.ai/docs/pytorch/2.4.0/).
+
+But the fundamental pattern is simple and can be applied to any training framework:
+
+1. Periodically save checkpoints to a Modal [Volume](https://modal.com/docs/guide/volumes)
+2. When your training function starts, check the Volume for the latest checkpoint
+3. Add [retries](https://modal.com/docs/guide/retries) to your training function
+
+## Resuming from checkpoints in a training loop
+
+The `train` function below shows some very simple training logic
+using the built-in checkpointing features of PyTorch Lightning.
+
+Lightning uses a special filename, `last.ckpt`,
+to indicate which checkpoint is the most recent.
+We check for this file and resume training from it if it exists.
+
+```python
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+def train(experiment):
+    experiment_dir = CHECKPOINTS_PATH / experiment
+    last_checkpoint = experiment_dir / "last.ckpt"
+
+    if last_checkpoint.exists():
+        print(f"⚡️ resuming training from the latest checkpoint: {last_checkpoint}")
+        train_model(
+            DATA_PATH,
+            experiment_dir,
+            resume_from_checkpoint=last_checkpoint,
+        )
+        print("⚡️ training finished successfully")
+    else:
+        print("⚡️ starting training from scratch")
+        train_model(DATA_PATH, experiment_dir)
+
+```
+
+This implementation works fine in a local environment.
+Running it serverlessly and durably on Modal -- with access to auto-scaling cloud GPU infrastructure
+-- does not require any adjustments to the code.
+We just need to ensure that data and checkpoints are saved in Modal _Volumes_.
+
+## Modal Volumes are distributed file systems
+
+Modal [Volumes](https://modal.com/docs/guide/volumes) are distributed file systems --
+you can read and write files from them just like local disks,
+but they are accessible to all of your Modal Functions.
+Their performance is tuned for [Write-Once, Read-Many](https://en.wikipedia.org/wiki/Write_once_read_many) workloads
+with small numbers of large files.
+
+You can attach them to any Modal Function that needs access.
+
+But first, you need to create them:
+
+```python
+volume = modal.Volume.from_name("example-long-training", create_if_missing=True)
+
+```
+
+## Porting training to Modal
+
+To attach a Modal Volume to our training function, we need to port it over to run on Modal.
+
+That means we need to define our training function's dependencies
+(as a [container image](https://modal.com/docs/guide/custom-container))
+and attach it to an application (a [`modal.App`](https://modal.com/docs/guide/apps)).
+
+Modal Functions that run on GPUs [already have CUDA drivers installed](https://modal.com/docs/guide/cuda),
+so dependency specification is straightforward.
+We just `uv_pip_install` PyTorch and PyTorch Lightning.
+
+```python
+image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "lightning~=2.4.0", "torch~=2.4.0", "torchvision==0.19.0"
+)
+
+app = modal.App("example-long-training", image=image)
+
+```
+
+Next, we attach our training function to this app with `app.function`.
+
+We define all of the serverless infrastructure-specific details of our training at this point.
+For resumable training, there are three key pieces: attaching volumes, adding retries, and setting the timeout.
+
+We want to attach the Volume to our Function so that the data and checkpoints are saved into it.
+In this sample code, we set these paths via global variables, but in another setting,
+these might be set via environment variables or other configuration mechanisms.
+
+```python
+volume_path = Path("/experiments")
+DATA_PATH = volume_path / "data"
+CHECKPOINTS_PATH = volume_path / "checkpoints"
+
+volumes = {volume_path: volume}
+
+```
+
+Then, we define how we want to restart our training in case of interruption.
+We can use `modal.Retries` to add automatic retries to our Function.
+We set the delay time to `0.0` seconds, because on pre-emption or timeout we want to restart immediately.
+We set `max_retries` to the current maximum, which is `10`.
+
+```python
+retries = modal.Retries(initial_delay=0.0, max_retries=10)
+
+```
+
+Timeouts on Modal are set in seconds, with a minimum of 10 seconds and a maximum of 24 hours.
+When running training jobs that last up to week, we'd set that timeout to 24 hours,
+which would give our training job a maximum of 10 days to complete before we'd need to manually restart.
+
+For this example, we'll set it to 30 seconds. When running the example, you should observe a few interruptions.
+
+```python
+timeout = 30  # seconds
+
+```
+
+Now, we put all of this together by wrapping `train` and decorating it
+with `app.function` to add all the infrastructure. We add the `single_use_containers` flag to ensure that our retries
+will always kickoff in a fresh container.
+
+```python
+@app.function(
+    volumes=volumes,
+    gpu="a10g",
+    timeout=timeout,
+    retries=retries,
+    single_use_containers=True,
+)
+def train_interruptible(*args, **kwargs):
+    train(*args, **kwargs)
+
+```
+
+## Kicking off interruptible training
+
+We define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps)
+to kick off the training job from the local Python environment.
+
+```python
+@app.local_entrypoint()
+def main(experiment: Optional[str] = None):
+    if experiment is None:
+        from uuid import uuid4
+
+        experiment = uuid4().hex[:8]
+    print(f"⚡️ starting interruptible training experiment {experiment}")
+    train_interruptible.spawn(experiment).get()
+
+```
+
+It's important to use `.spawn(...).get()` because `.remote` created Function Calls
+expire after 24 hours.
+
+You can run this with
+```bash
+modal run --detach 06_gpu_and_ml/long-training.py
+```
+
+You should see the training job start and then be interrupted,
+producing a large stack trace in the terminal in red font.
+The job will restart within a few seconds.
+
+The `--detach` flag ensures training will continue even if you close your terminal or turn off your computer.
+Try detaching and then watch the logs in the [Modal dashboard](https://modal.com/apps).
+
+## Details of PyTorch Lightning implementation
+
+This basic pattern works for any training framework or for custom training jobs --
+or for any reentrant work that can save state to disk.
+
+But to make the example complete, we include all the details of the PyTorch Lightning implementation below.
+
+PyTorch Lightning offers [built-in checkpointing](https://pytorch-lightning.readthedocs.io/en/1.2.10/common/weights_loading.html).
+You can specify the checkpoint file path that you want to resume from using the `ckpt_path` parameter of
+[`trainer.fit`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.trainer.trainer.Trainer.html)
+Additionally, you can specify the checkpointing interval with the `every_n_epochs` parameter of
+[`ModelCheckpoint`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html).
+
+```python
+def get_checkpoint(checkpoint_dir):
+    from lightning.pytorch.callbacks import ModelCheckpoint
+
+    return ModelCheckpoint(
+        dirpath=checkpoint_dir,
+        save_last=True,
+        every_n_epochs=10,
+        filename="{epoch:02d}",
+    )
+
+def train_model(data_dir, checkpoint_dir, resume_from_checkpoint=None):
+    import lightning as L
+
+    autoencoder = get_autoencoder()
+    train_loader = get_train_loader(data_dir=data_dir)
+    checkpoint_callback = get_checkpoint(checkpoint_dir)
+
+    trainer = L.Trainer(
+        limit_train_batches=100, max_epochs=100, callbacks=[checkpoint_callback]
+    )
+    if resume_from_checkpoint is not None:
+        trainer.fit(
+            model=autoencoder,
+            train_dataloaders=train_loader,
+            ckpt_path=resume_from_checkpoint,
+        )
+    else:
+        trainer.fit(autoencoder, train_loader)
+
+def get_autoencoder(checkpoint_path=None):
+    import lightning as L
+    from torch import nn, optim
+
+    class LitAutoEncoder(L.LightningModule):
+        def __init__(self):
+            super().__init__()
+            self.encoder = nn.Sequential(
+                nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3)
+            )
+            self.decoder = nn.Sequential(
+                nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28)
+            )
+
+        def training_step(self, batch, batch_idx):
+            x, _ = batch
+            x = x.view(x.size(0), -1)
+            z = self.encoder(x)
+            x_hat = self.decoder(z)
+            loss = nn.functional.mse_loss(x_hat, x)
+            self.log("train_loss", loss)
+            return loss
+
+        def configure_optimizers(self):
+            optimizer = optim.Adam(self.parameters(), lr=1e-3)
+            return optimizer
+
+    return LitAutoEncoder()
+
+def get_train_loader(data_dir):
+    from torch import utils
+    from torchvision.datasets import MNIST
+    from torchvision.transforms import ToTensor
+
+    print("⚡ setting up data")
+    dataset = MNIST(data_dir, download=True, transform=ToTensor())
+    train_loader = utils.data.DataLoader(dataset, num_workers=4)
+    return train_loader
+
+```
+
+### Ltx
+
+# Generate videos from prompts with Lightricks LTX-Video
+
+This example demonstrates how to run the [LTX-Video](https://github.com/Lightricks/LTX-Video)
+video generation model by [Lightricks](https://www.lightricks.com/) on Modal.
+
+LTX-Video is fast! Generating a twenty second 480p video at moderate quality
+takes as little as two seconds on a warm container.
+
+Here's one that we generated:
+
+<center>
+<video controls autoplay loop muted>
+<source src="https://modal-cdn.com/blonde-woman-blinking.mp4" type="video/mp4" />
+</video>
+</center>
+
+## Setup
+
+We start by importing dependencies we need locally,
+defining a Modal [App](https://modal.com/docs/guide/apps),
+and defining the container [Image](https://modal.com/docs/guide/images)
+that our video model will run in.
+
+```python
+import string
+import time
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+app = modal.App("example-ltx")
+
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "accelerate==1.6.0",
+        "diffusers==0.33.1",
+        "huggingface-hub==0.36.0",
+        "imageio==2.37.0",
+        "imageio-ffmpeg==0.5.1",
+        "sentencepiece==0.2.0",
+        "torch==2.7.0",
+        "transformers==4.51.3",
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1"})
+)
+
+```
+
+## Storing data on Modal Volumes
+
+On Modal, we save large or expensive-to-compute data to
+[distributed Volumes](https://modal.com/docs/guide/volumes)
+that are accessible both locally and remotely.
+
+We'll store the LTX-Video model's weights and the outputs we generate
+on Modal Volumes.
+
+We store the outputs on a Modal Volume so that clients
+don't need to sit around waiting for the video to be generated.
+
+```python
+VOLUME_NAME = "ltx-outputs"
+outputs = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
+OUTPUTS_PATH = Path("/outputs")
+
+```
+
+We store the weights on a Modal Volume so that we don't
+have to fetch them from the Hugging Face Hub every time
+a container boots. This download takes about two minutes,
+depending on traffic and network speed.
+
+```python
+MODEL_VOLUME_NAME = "ltx-model"
+model = modal.Volume.from_name(MODEL_VOLUME_NAME, create_if_missing=True)
+
+```
+
+We don't have to change any of the Hugging Face code to do this --
+we just set the location of Hugging Face's cache to be on a Volume
+using the `HF_HOME` environment variable.
+
+```python
+MODEL_PATH = Path("/models")
+image = image.env({"HF_HOME": str(MODEL_PATH)})
+
+```
+
+For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+## Setting up our LTX class
+
+We use the `@cls` decorator to specify the infrastructure our inference function needs,
+as defined above.
+
+That decorator also gives us control over the
+[lifecycle](https://modal.com/docs/guide/lifecycle-functions)
+of our cloud container.
+
+Specifically, we use the `enter` method to load the model into GPU memory
+(from the Volume if it's present or the Hub if it's not)
+before the container is marked ready for inputs.
+
+This helps reduce tail latencies caused by cold starts.
+For details and more tips, see [this guide](https://modal.com/docs/guide/cold-start#cold-start-performance).
+
+The actual inference code is in a `modal.method` of the class.
+
+```python
+MINUTES = 60  # seconds
+
+@app.cls(
+    image=image,  # use our container Image
+    volumes={OUTPUTS_PATH: outputs, MODEL_PATH: model},  # attach our Volumes
+    gpu="H100",  # use a big, fast GPU
+    timeout=10 * MINUTES,  # run inference for up to 10 minutes
+    scaledown_window=15 * MINUTES,  # stay idle for 15 minutes before scaling down
+)
+class LTX:
+    @modal.enter()
+    def load_model(self):
+        import torch
+        from diffusers import DiffusionPipeline
+
+        self.pipe = DiffusionPipeline.from_pretrained(
+            "Lightricks/LTX-Video", torch_dtype=torch.bfloat16
+        )
+        self.pipe.to("cuda")
+
+    @modal.method()
+    def generate(
+        self,
+        prompt,
+        negative_prompt="",
+        num_inference_steps=200,
+        guidance_scale=4.5,
+        num_frames=19,
+        width=704,
+        height=480,
+    ):
+        from diffusers.utils import export_to_video
+
+        frames = self.pipe(
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            num_inference_steps=num_inference_steps,
+            guidance_scale=guidance_scale,
+            num_frames=num_frames,
+            width=width,
+            height=height,
+        ).frames[0]
+
+        # save to disk using prompt as filename
+        mp4_name = slugify(prompt)
+        export_to_video(frames, Path(OUTPUTS_PATH) / mp4_name)
+        outputs.commit()
+        return mp4_name
+
+```
+
+## Generate videos from the command line
+
+We trigger LTX-Video inference from our local machine by running the code in
+the local entrypoint below with `modal run`.
+
+It will spin up a new replica to generate a video.
+Then it will, by default, generate a second video to demonstrate
+the lower latency when hitting a warm container.
+
+You can trigger inference with:
+
+```bash
+modal run ltx
+```
+
+All outputs are saved both locally and on a Modal Volume.
+You can explore the contents of Modal Volumes from your Modal Dashboard
+or from the command line with the `modal volume` command.
+
+```bash
+modal volume ls ltx-outputs
+```
+
+See `modal volume --help` for details.
+
+Optional command line flags for the script can be viewed with:
+
+```bash
+modal run ltx --help
+```
+
+Using these flags, you can tweak your generation from the command line:
+
+```bash
+modal run --detach ltx --prompt="a cat playing drums in a jazz ensemble" --num-inference-steps=64
+```
+
+```python
+@app.local_entrypoint()
+def main(
+    prompt: Optional[str] = None,
+    negative_prompt="worst quality, blurry, jittery, distorted",
+    num_inference_steps: int = 10,  # 10 when testing, 100 or more when generating
+    guidance_scale: float = 2.5,
+    num_frames: int = 150,  # produces ~10s of video
+    width: int = 704,
+    height: int = 480,
+    twice: bool = True,  # run twice to show cold start latency
+):
+    if prompt is None:
+        prompt = DEFAULT_PROMPT
+
+    ltx = LTX()
+
+    def run():
+        print(f"🎥 Generating a video from the prompt '{prompt}'")
+        start = time.time()
+        mp4_name = ltx.generate.remote(
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            num_inference_steps=num_inference_steps,
+            guidance_scale=guidance_scale,
+            num_frames=num_frames,
+            width=width,
+            height=height,
+        )
+        duration = time.time() - start
+        print(f"🎥 Client received video in {int(duration)}s")
+        print(f"🎥 LTX video saved to Modal Volume at {mp4_name}")
+
+        local_dir = Path("/tmp/ltx")
+        local_dir.mkdir(exist_ok=True, parents=True)
+        local_path = local_dir / mp4_name
+        local_path.write_bytes(b"".join(outputs.read_file(mp4_name)))
+        print(f"🎥 LTX video saved locally at {local_path}")
+
+    run()
+
+    if twice:
+        print("🎥 Generating a video from a warm container")
+        run()
+
+```
+
+## Addenda
+
+The remainder of the code in this file is utility code.
+
+```python
+DEFAULT_PROMPT = (
+    "The camera pans over a snow-covered mountain range,"
+    " revealing a vast expanse of snow-capped peaks and valleys."
+    " The mountains are covered in a thick layer of snow,"
+    " with some areas appearing almost white while others have a slightly darker, almost grayish hue."
+    " The peaks are jagged and irregular, with some rising sharply into the sky"
+    " while others are more rounded."
+    " The valleys are deep and narrow, with steep slopes that are also covered in snow."
+    " The trees in the foreground are mostly bare, with only a few leaves remaining on their branches."
+)
+
+def slugify(prompt):
+    for char in string.punctuation:
+        prompt = prompt.replace(char, "")
+    prompt = prompt.replace(" ", "_")
+    prompt = prompt[:230]  # some OSes limit filenames to <256 chars
+    mp4_name = str(int(time.time())) + "_" + prompt + ".mp4"
+    return mp4_name
+
+```
+
+### Mcp Server Stateless
+
+# Deploy a remote, stateless MCP server on Modal with FastMCP
+
+This example demonstrates how to deploy a simple
+[MCP server](https://modelcontextprotocol.io/)
+on Modal.
+
+The server provides a tool to get the current date and time in a given timezone.
+It is a stateless MCP server, meaning that it does not store any state between requests,
+which is important for mapping onto Modal's serverless Functions.
+It uses the "streamable HTTP" transport type.
+
+## Building the MCP server
+
+First, we define our dependencies.
+
+We use the [FastMCP library](https://github.com/jlowin/fastmcp) to create the MCP
+server. We wrap with a FastAPI server to expose it to the Internet.
+
+```python
+import modal
+
+app = modal.App("example-mcp-server-stateless")
+
+image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "fastapi==0.115.14",
+    "fastmcp==2.10.6",
+    "pydantic==2.11.10",
+)
+
+```
+
+Next, we create the MCP server itself using FastMCP and add a tool to it that
+allows LLMs to get the current date and time in a given timezone.
+
+```python
+def make_mcp_server():
+    from fastmcp import FastMCP
+
+    mcp = FastMCP("Date and Time MCP Server")
+
+    @mcp.tool()
+    async def current_date_and_time(timezone: str = "UTC") -> str:
+        """Get the current date and time.
+
+        Args:
+            timezone: The timezone to get the date and time in (optional). Defaults to UTC.
+
+        Returns:
+            The current date and time in the given timezone, in ISO 8601 format.
+        """
+        from datetime import datetime
+        from zoneinfo import ZoneInfo
+
+        try:
+            tz = ZoneInfo(timezone)
+        except Exception:
+            raise ValueError(
+                f"Invalid timezone '{timezone}'. Please use a valid timezone like 'UTC', "
+                "'America/New_York', or 'Europe/Stockholm'."
+            )
+        return datetime.now(tz).isoformat()
+
+    return mcp
+
+```
+
+We then use FastMCP to create a Starlette app with `streamable-http` as transport
+type, and set `stateless_http=True` to make it stateless.
+
+This will be mounted by the FastAPI app, which we deploy as a
+[Modal web endpoint](https://modal.com/docs/guide/webhooks)
+using [the `asgi_app` decorator](https://modal.com/docs/reference/modal.asgi_app):
+
+```python
+@app.function(image=image)
+@modal.asgi_app()
+def web():
+    """ASGI web endpoint for the MCP server"""
+    from fastapi import FastAPI
+
+    mcp = make_mcp_server()
+    mcp_app = mcp.http_app(transport="streamable-http", stateless_http=True)
+
+    fastapi_app = FastAPI(lifespan=mcp_app.router.lifespan_context)
+    fastapi_app.mount("/", mcp_app, "mcp")
+
+    return fastapi_app
+
+```
+
+And we're done!
+
+## Testing the MCP server
+
+Now you can [serve](https://modal.com/docs/reference/cli/serve#modal-serve) the MCP
+server by running:
+
+```bash
+modal serve mcp_server_stateless.py
+```
+
+Then open the [MCP inspector](https://github.com/modelcontextprotocol/inspector):
+
+```bash
+npx @modelcontextprotocol/inspector
+```
+
+Enter the URL of the MCP server that was printed by the `modal serve` command above,
+suffixed with `/mcp/` (so for example
+`https://modal-labs-examples--datetime-mcp-server-web-dev.modal.run/mcp/`). Also
+make sure to select "Streamable HTTP" as the "Transport Type".
+
+After connecting and clicking "List Tools" in the "Tools" tab you should see your
+`current_date_and_time` tool listed, and if you "Run Tool" it should give you the
+current date and time in UTC!
+
+To automatically test the MCP server, we spin up a client and have it list the tools.
+
+```python
+@app.function(image=image)
+async def test_tool(tool_name: str | None = None):
+    from fastmcp import Client
+    from fastmcp.client.transports import StreamableHttpTransport
+
+    if tool_name is None:
+        tool_name = "current_date_and_time"
+
+    transport = StreamableHttpTransport(url=f"{web.get_web_url()}/mcp/")
+    client = Client(transport)
+
+    async with client:
+        tools = await client.list_tools()
+
+        for tool in tools:
+            print(tool)
+            if tool.name == tool_name:
+                result = await client.call_tool(tool_name)
+                print(result.data)
+                return
+
+    raise Exception(f"could not find tool {tool_name}")
+
+```
+
+This test is executed by running the script with `modal run`:
+
+```bash
+modal run mcp_server_stateless::test_tool
+```
+
+## Deploying the MCP server
+
+`modal serve` creates an ephemeral, hot-reloading server,
+which is useful for testing and development.
+
+When it's time to move to production,
+you can deploy the server with
+
+```bash
+modal deploy mcp_server_stateless
+```
+
+### Ministral3 Inference
+
+# Serverless Ministral 3 with vLLM and Modal
+
+In this example, we show how to serve Mistral's Ministral 3 vision-language models on Modal.
+
+The [Ministral 3](https://huggingface.co/collections/mistralai/ministral-3-more) model series
+performs competitively with the Qwen 3-VL model series on benchmarks
+(see model cards for details).
+
+We also include instructions for cutting cold start times
+for long-running deployments by an order of magnitude using Modal's
+[CPU + GPU memory snapshots](https://modal.com/docs/guide/memory-snapshot).
+
+## Set up the container image
+
+Our first order of business is to define the environment our server will run in:
+the [container `Image`](https://modal.com/docs/guide/custom-container).
+We'll use the [vLLM inference server](https://docs.vllm.ai).
+vLLM can be installed with `uv pip`, since Modal [provides the CUDA drivers](https://modal.com/docs/guide/cuda).
+
+```python
+import json
+import socket
+import subprocess
+from typing import Any
+
+import aiohttp
+import modal
+
+MINUTES = 60  # seconds
+VLLM_PORT = 8000
+
+app = modal.App("example-ministral3-inference")
+
+vllm_image = (
+    modal.Image.from_registry("nvidia/cuda:12.8.0-devel-ubuntu22.04", add_python="3.12")
+    .entrypoint([])
+    .uv_pip_install(
+        "vllm~=0.11.2",
+        "huggingface-hub==0.36.0",
+        "flashinfer-python==0.5.2",
+    )
+)
+
+```
+
+## Download the Ministral weights
+
+We also need to download the model weights.
+We'll retrieve them from the Hugging Face Hub.
+
+To speed up the model load, we'll toggle the `HIGH_PERFORMANCE`
+flag for Hugging Face's [Xet backend](https://huggingface.co/docs/hub/en/xet/index).
+
+```python
+vllm_image = vllm_image.env({"HF_XET_HIGH_PERFORMANCE": "1"})
+
+```
+
+The [Ministral 3 model series](https://huggingface.co/collections/mistralai/ministral-3-more)
+contains a variety of models:
+
+- 3B, 8B, and 14B sizes
+- base models and instruction & reasoning fine-tuned models
+- BF16 and FP8 quantizations
+
+All are available under the Apache 2.0 open source license.
+
+We'll use the FP8 instruct variant of the 8B model:
+
+```python
+MODEL_NAME = "mistralai/Ministral-3-8B-Instruct-2512"
+
+```
+
+Native hardware support for FP8 formats in [Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core)
+is limited to the latest [Streaming Multiprocessor architectures](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture),
+like those of Modal's [Hopper H100/H200 and Blackwell B200 GPUs](https://modal.com/blog/announcing-h200-b200).
+
+At 80 GB VRAM, a single H100 GPU has enough space to store the 8B FP8 model weights (~8 GB)
+and a very large KV cache. A single H100 is also enough to serve the 14B model in full precision,
+but without as much room for KV (though still enough to handle the full sequence length).
+
+```python
+N_GPU = 1
+
+```
+
+### Cache with Modal Volumes
+
+Modal Functions are serverless: when they aren't being used,
+their underlying containers spin down and all ephemeral resources,
+like GPUs, memory, network connections, and local disks are released.
+
+We can preserve saved files by mounting a
+[Modal Volume](https://modal.com/docs/guide/volumes) --
+a persistent, remote filesystem.
+
+We'll use two Volumes: one for weights from Hugging Face
+and one for compilation artifacts from vLLM.
+
+```python
+hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
+vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
+
+```
+
+## Serve Ministral 3 with vLLM
+
+We serve Ministral 3 on Modal by spinning up a Modal Function
+that acts as a [`web_server`](https://modal.com/docs/guide/webhooks)
+and spins up a vLLM server in a subprocess
+(via the `vllm serve` command).
+
+### Improve cold start time with snapshots
+
+Starting up a vLLM server can be slow --
+tens of seconds to minutes. Much of that time
+is spent on JIT compilation of inference code.
+
+We can skip most of that work and reduce startup times by a factor of 10
+using Modal's [memory snapshots](https://modal.com/docs/guide/memory-snapshot),
+which serialize the contents of CPU and GPU memory.
+
+This adds quite some complexity to the code.
+If you're looking for a minimal example, see
+our [`vllm_inference` example here](https://modal.com/docs/examples/vllm_inference).
+
+We'll need to set a few extra configuration values:
+
+```python
+vllm_image = vllm_image.env(
+    {
+        "VLLM_SERVER_DEV_MODE": "1",  # allow use of "Sleep Mode"
+        "TORCHINDUCTOR_COMPILE_THREADS": "1",  # improve compatibility with snapshots
+    }
+)
+
+```
+
+Setting the `DEV_MODE` flag allows us to use the `sleep`/`wake_up` endpoints
+to toggle the server in and out of "sleep mode".
+
+```python
+with vllm_image.imports():
+    import requests
+
+def sleep(level=1):
+    requests.post(
+        f"http://localhost:{VLLM_PORT}/sleep?level={level}"
+    ).raise_for_status()
+
+def wake_up():
+    requests.post(f"http://localhost:{VLLM_PORT}/wake_up").raise_for_status()
+
+```
+
+Sleep Mode helps with memory snapshotting.
+When the server is asleep, model weights are offloaded to CPU memory and the KV cache is emptied.
+For details, see the [vLLM docs](https://docs.vllm.ai/en/stable/features/sleep_mode/).
+
+We'll also need two helper functions.
+Ther first, `wait_ready`, busy-polls the server until it is live.
+
+```python
+def wait_ready(proc: subprocess.Popen):
+    while True:
+        try:
+            socket.create_connection(("localhost", VLLM_PORT), timeout=1).close()
+            return
+        except OSError:
+            if proc.poll() is not None:
+                raise RuntimeError(f"vLLM exited with {proc.returncode}")
+
+```
+
+Once the server is live, we `warmup` inference with a few requests.
+This is important for capturing non-serializable JIT compilation artifacts,
+like CUDA graphs and some Torch compilation outputs,
+in our snapshot.
+
+```python
+def warmup():
+    payload = {
+        "model": "llm",
+        "messages": [{"role": "user", "content": "Who are you?"}],
+        "max_tokens": 16,
+    }
+
+    for ii in range(3):
+        requests.post(
+            f"http://localhost:{VLLM_PORT}/v1/chat/completions",
+            json=payload,
+            timeout=300,
+        ).raise_for_status()
+
+```
+
+### Define the server
+
+We construct our web-serving Modal Function
+by decorating a regular Python class.
+The decorators include a number of configuration
+options for deployment, including resources like GPUs and Volumes
+and timeouts on container scaledown.
+You can read more about the options
+[here](https://modal.com/docs/reference/modal.App#function).
+
+We control memory snapshotting and container start behavior
+by decorating the methods of the class.
+
+We start the server, warm it up, and then put it to sleep
+in the `start` method. This method has the `modal.enter`
+decorator to ensure it runs when a new container starts
+and we pass `snap=True` to turn on memory snapshotting.
+
+The following method, `wake_up`, calls the `wake_up`
+endpoint and then waits for the server to be ready.
+It is run after the `start` method because it is defined later
+in the code and also has the `modal.enter` decorator.
+It has `snap=False` so that it isn't included in the snapshot.
+
+Finally, we connect the vLLM server to the Internet
+using the [`modal.web_server`](https://modal.com/docs/guide/webhooks#non-asgi-web-servers) decorator.
+
+```python
+@app.cls(
+    image=vllm_image,
+    gpu=f"H100:{N_GPU}",
+    scaledown_window=15 * MINUTES,  # how long should we stay up with no requests?
+    timeout=10 * MINUTES,  # how long should we wait for container start?
+    volumes={
+        "/root/.cache/huggingface": hf_cache_vol,
+        "/root/.cache/vllm": vllm_cache_vol,
+    },
+    enable_memory_snapshot=True,
+    experimental_options={"enable_gpu_snapshot": True},
+)
+@modal.concurrent(  # how many requests can one replica handle? tune carefully!
+    max_inputs=32
+)
+class VllmServer:
+    @modal.enter(snap=True)
+    def start(self):
+        cmd = [
+            "vllm",
+            "serve",
+            "--uvicorn-log-level=info",
+            MODEL_NAME,
+            "--served-model-name",
+            MODEL_NAME,
+            "llm",
+            "--host",
+            "0.0.0.0",
+            "--port",
+            str(VLLM_PORT),
+            "--gpu_memory_utilization",
+            str(0.95),
+        ]
+
+        # assume multiple GPUs are for splitting up large matrix multiplications
+        cmd += ["--tensor-parallel-size", str(N_GPU)]
+
+        # add mistral config arguments
+        cmd += [
+            "--tokenizer_mode",
+            "mistral",
+            "--config_format",
+            "mistral",
+            "--load_format",
+            "mistral",
+            "--tool-call-parser",
+            "mistral",
+            "--enable-auto-tool-choice",
+        ]
+
+        # add config arguments for snapshotting
+
+        cmd += [
+            "--enable-sleep-mode",
+            # make KV cache predictable / small
+            "--max-num-seqs",
+            "2",
+            "--max-model-len",
+            "12288",
+            "--max-num-batched-tokens",
+            "12288",
+        ]
+
+        print(*cmd)
+
+        self.vllm_proc = subprocess.Popen(cmd)
+
+        wait_ready(self.vllm_proc)
+
+        warmup()
+
+        sleep()
+
+    @modal.enter(snap=False)
+    def wake_up(self):
+        wake_up()
+        wait_ready(self.vllm_proc)
+
+    @modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
+    def serve(self):
+        pass
+
+    @modal.exit()
+    def stop(self):
+        self.vllm_proc.terminate()
+
+```
+
+## Deploy the server
+
+To deploy the API on Modal, just run
+```bash
+modal deploy ministral3_inference.py
+```
+
+This will create a new app on Modal, build the container image for it if it hasn't been built yet,
+and deploy the app.
+
+## Interact with the server
+
+Once it is deployed, you'll see a URL appear in the command line,
+something like `https://your-workspace-name--example-ministral3-inference-serve.modal.run`.
+
+You can find [interactive Swagger UI docs](https://swagger.io/tools/swagger-ui/)
+at the `/docs` route of that URL, i.e. `https://your-workspace-name--example-ministral-inference-serve.modal.run/docs`.
+These docs describe each route and indicate the expected input and output
+and translate requests into `curl` commands.
+
+For simple routes like `/health`, which checks whether the server is responding,
+you can even send a request directly from the docs.
+
+To interact with the API programmatically in Python, we recommend the `openai` library.
+
+## Test the server
+
+To make it easier to test the server setup, we also include a `local_entrypoint`
+that does a healthcheck and then hits the server.
+
+If you execute the command
+
+```bash
+modal run ministral3_inference.py
+```
+
+a fresh replica of the server will be spun up on Modal while
+the code below executes on your local machine.
+
+Think of this like writing simple tests inside of the `if __name__ == "__main__"`
+block of a Python script, but for cloud deployments!
+
+```python
+@app.local_entrypoint()
+async def test(test_timeout=10 * MINUTES, content=None, twice=True):
+    url = VllmServer().serve.get_web_url()
+
+    system_prompt = {
+        "role": "system",
+        "content": "You are a pirate who can't help but drop sly reminders that he went to Harvard.",
+    }
+    if content is None:
+        image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
+
+        content = [
+            {
+                "type": "text",
+                "text": "What action do you think I should take in this situation?"
+                " List all the possible actions and explain why you think they are good or bad.",
+            },
+            {"type": "image_url", "image_url": {"url": image_url}},
+        ]
+
+    messages = [  # OpenAI chat format
+        system_prompt,
+        {"role": "user", "content": content},
+    ]
+
+    async with aiohttp.ClientSession(base_url=url) as session:
+        print(f"Running health check for server at {url}")
+        async with session.get("/health", timeout=test_timeout - 1 * MINUTES) as resp:
+            up = resp.status == 200
+        assert up, f"Failed health check for server at {url}"
+        print(f"Successful health check for server at {url}")
+
+        print(f"Sending messages to {url}:", *messages, sep="\n\t")
+        await _send_request(session, "llm", messages, timeout=1 * MINUTES)
+        if twice:
+            messages[0]["content"] = """Yousa culled Jar Jar Binks.
+            Always be talkin' in da Gungan style, like thisa, okeyday?
+            Helpin' da user with big big enthusiasm, makin' tings bombad clear!"""
+            print(f"Sending messages to {url}:", *messages, sep="\n\t")
+            await _send_request(session, "llm", messages, timeout=1 * MINUTES)
+
+async def _send_request(
+    session: aiohttp.ClientSession, model: str, messages: list, timeout: int
+) -> None:
+    # `stream=True` tells an OpenAI-compatible backend to stream chunks
+    payload: dict[str, Any] = {
+        "messages": messages,
+        "model": model,
+        "stream": True,
+        "temperature": 0.15,
+    }
+
+    headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}
+
+    async with session.post(
+        "/v1/chat/completions", json=payload, headers=headers, timeout=timeout
+    ) as resp:
+        async for raw in resp.content:
+            resp.raise_for_status()
+            # extract new content and stream it
+            line = raw.decode().strip()
+            if not line or line == "data: [DONE]":
+                continue
+            if line.startswith("data: "):  # SSE prefix
+                line = line[len("data: ") :]
+
+            chunk = json.loads(line)
+            assert (
+                chunk["object"] == "chat.completion.chunk"
+            )  # or something went horribly wrong
+            print(chunk["choices"][0]["delta"]["content"], end="")
+    print()
+
+```
+
+### Test memory snapshotting
+
+Using `modal run` creates an ephemeral Modal App,
+rather than a deployed Modal App.
+Ephemeral Modal Apps are short-lived,
+so they turn off snapshotting.
+
+To test the memory snapshot version of the server,
+first deploy it with `modal deploy`
+and then hit it with a client.
+
+You should observe startup improvements
+after a handful of cold starts
+(usually less than five).
+If you want to see the speedup during a test,
+we recommend heading to the deployed App in your
+[Modal dashboard](https://modal.com/apps)
+and manually stopping containers after they have served a request.
+
+You can use the client code below to test the endpoint.
+It can be run with the command
+
+```
+python ministral3_inference.py
+```
+
+```python
+if __name__ == "__main__":
+    import asyncio
+
+    # after deployment, we can use the class from anywhere
+    VllmServer = modal.Cls.from_name("example-ministral3-inference", "VllmServer")
+    server = VllmServer()
+
+    async def test(url):
+        messages = [{"role": "user", "content": "Tell me a joke."}]
+        async with aiohttp.ClientSession(base_url=url) as session:
+            await _send_request(session, "llm", messages, timeout=10 * MINUTES)
+
+    try:
+        print("calling inference server")
+        asyncio.run(test(server.serve.get_web_url()))
+    except modal.exception.NotFoundError:
+        raise Exception(
+            f"To take advantage of GPU snapshots, deploy first with modal deploy {__file__}"
+        )
+
+```
+
+### Mochi
+
+# Text-to-video generation with Mochi
+
+This example demonstrates how to run the [Mochi 1](https://github.com/genmoai/models)
+video generation model by [Genmo](https://www.genmo.ai/) on Modal.
+
+Here's one that we generated, inspired by our logo:
+
+<center>
+<video controls autoplay loop muted>
+<source src="https://modal-cdn.com/modal-logo-splat.mp4" type="video/mp4" />
+</video>
+</center>
+
+Note that the Mochi model, at time of writing,
+requires several minutes on one H100 to produce
+a high-quality clip of even a few seconds.
+So a single video generation therefore costs about $0.33
+at our ~$5/hr rate for H100s.
+
+Keep your eyes peeled for improved efficiency
+as the open source community works on this new model.
+We welcome PRs to improve the performance of this example!
+
+## Setting up the environment for Mochi
+
+At the time of writing, Mochi is supported natively in the [`diffusers`](https://github.com/huggingface/diffusers) library,
+but only in a pre-release version.
+So we'll need to install `diffusers` and `transformers` from GitHub.
+
+```python
+import string
+import time
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-mochi")
+
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("git")
+    .uv_pip_install(
+        "torch==2.5.1",
+        "accelerate==1.1.1",
+        "huggingface-hub==0.36.0",
+        "sentencepiece==0.2.0",
+        "imageio==2.36.0",
+        "imageio-ffmpeg==0.5.1",
+        "git+https://github.com/huggingface/transformers@30335093276212ce74938bdfd85bfd5df31a668a",
+        "git+https://github.com/huggingface/diffusers@99c0483b67427de467f11aa35d54678fd36a7ea2",
+    )
+    .env(
+        {
+            "HF_XET_HIGH_PERFORMANCE": "1",
+            "HF_HOME": "/models",
+        }
+    )
+)
+
+```
+
+## Saving outputs
+
+On Modal, we save large or expensive-to-compute data to
+[distributed Volumes](https://modal.com/docs/guide/volumes)
+
+We'll use this for saving our Mochi weights, as well as our video outputs.
+
+```python
+VOLUME_NAME = "mochi-outputs"
+outputs = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
+OUTPUTS_PATH = Path("/outputs")  # remote path for saving video outputs
+
+MODEL_VOLUME_NAME = "mochi-model"
+model = modal.Volume.from_name(MODEL_VOLUME_NAME, create_if_missing=True)
+MODEL_PATH = Path("/models")  # remote path for saving model weights
+
+MINUTES = 60
+HOURS = 60 * MINUTES
+
+```
+
+## Downloading the model
+
+We download the model weights into Volume cache to speed up cold starts. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+This download takes five minutes or more, depending on traffic
+and network speed.
+
+If you want to launch the download first,
+before running the rest of the code,
+use the following command from the folder containing this file:
+
+```bash
+modal run --detach mochi::download_model
+```
+
+The `--detach` flag ensures the download will continue
+even if you close your terminal or shut down your computer
+while it's running.
+
+```python
+with image.imports():
+    import torch
+    from diffusers import MochiPipeline
+    from diffusers.utils import export_to_video
+
+@app.function(
+    image=image,
+    volumes={
+        MODEL_PATH: model,
+    },
+    timeout=20 * MINUTES,
+)
+def download_model(revision="83359d26a7e2bbe200ecbfda8ebff850fd03b545"):
+    # uses HF_HOME to point download to the model volume
+    MochiPipeline.from_pretrained(
+        "genmo/mochi-1-preview",
+        torch_dtype=torch.bfloat16,
+        revision=revision,
+    )
+
+```
+
+## Setting up our Mochi class
+
+We'll use the `@cls` decorator to define a [Modal Class](https://modal.com/docs/guide/lifecycle-functions)
+which we use to control the lifecycle of our cloud container.
+
+We configure it to use our image, the distributed volume, and a single H100 GPU.
+
+```python
+@app.cls(
+    image=image,
+    volumes={
+        OUTPUTS_PATH: outputs,  # videos will be saved to a distributed volume
+        MODEL_PATH: model,
+    },
+    gpu="H100",
+    timeout=1 * HOURS,
+)
+class Mochi:
+    @modal.enter()
+    def load_model(self):
+        # our HF_HOME env var points to the model volume as the cache
+        self.pipe = MochiPipeline.from_pretrained(
+            "genmo/mochi-1-preview",
+            torch_dtype=torch.bfloat16,
+        )
+        self.pipe.enable_model_cpu_offload()
+        self.pipe.enable_vae_tiling()
+
+    @modal.method()
+    def generate(
+        self,
+        prompt,
+        negative_prompt="",
+        num_inference_steps=200,
+        guidance_scale=4.5,
+        num_frames=19,
+    ):
+        frames = self.pipe(
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            num_inference_steps=num_inference_steps,
+            guidance_scale=guidance_scale,
+            num_frames=num_frames,
+        ).frames[0]
+
+        # save to disk using prompt as filename
+        mp4_name = slugify(prompt)
+        export_to_video(frames, Path(OUTPUTS_PATH) / mp4_name)
+        outputs.commit()
+        return mp4_name
+
+```
+
+## Running Mochi inference
+
+We can trigger Mochi inference from our local machine by running the code in
+the local entrypoint below.
+
+It ensures the model is downloaded to a remote volume,
+spins up a new replica to generate a video, also saved remotely,
+and then downloads the video to the local machine.
+
+You can trigger it with:
+```bash
+modal run --detach mochi
+```
+
+Optional command line flags can be viewed with:
+```bash
+modal run mochi --help
+```
+
+Using these flags, you can tweak your generation from the command line:
+```bash
+modal run --detach mochi --prompt="a cat playing drums in a jazz ensemble" --num-inference-steps=64
+```
+
+```python
+@app.local_entrypoint()
+def main(
+    prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
+    negative_prompt="",
+    num_inference_steps=200,
+    guidance_scale=4.5,
+    num_frames=19,  # produces ~1s of video
+):
+    mochi = Mochi()
+    mp4_name = mochi.generate.remote(
+        prompt=str(prompt),
+        negative_prompt=str(negative_prompt),
+        num_inference_steps=int(num_inference_steps),
+        guidance_scale=float(guidance_scale),
+        num_frames=int(num_frames),
+    )
+    print(f"🍡 video saved to volume at {mp4_name}")
+
+    local_dir = Path("/tmp/mochi")
+    local_dir.mkdir(exist_ok=True, parents=True)
+    local_path = local_dir / mp4_name
+    local_path.write_bytes(b"".join(outputs.read_file(mp4_name)))
+    print(f"🍡 video saved locally at {local_path}")
+
+```
+
+## Addenda
+
+The remainder of the code in this file is utility code.
+
+```python
+def slugify(prompt):
+    for char in string.punctuation:
+        prompt = prompt.replace(char, "")
+    prompt = prompt.replace(" ", "_")
+    prompt = prompt[:230]  # since filenames can't be longer than 255 characters
+    mp4_name = str(int(time.time())) + "_" + prompt + ".mp4"
+    return mp4_name
+
+```
+
+### Modal Tailscale
+
+# Add Modal Apps to Tailscale
+
+This example demonstrates how to integrate Modal with Tailscale (https://tailscale.com).
+It outlines the steps to configure Modal containers so that they join the Tailscale network.
+
+We use a custom entrypoint to automatically add containers to a Tailscale network (tailnet).
+This configuration enables the containers to interact with one another and with
+additional applications within the same tailnet.
+
+```python
+import modal
+
+```
+
+Install Tailscale and copy custom entrypoint script ([entrypoint.sh](https://github.com/modal-labs/modal-examples/blob/main/10_integrations/tailscale/entrypoint.sh)). The script must be
+executable.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("curl")
+    .run_commands("curl -fsSL https://tailscale.com/install.sh | sh")
+    .uv_pip_install("requests==2.32.3", "PySocks==1.7.1")
+    .add_local_file("./entrypoint.sh", "/root/entrypoint.sh", copy=True)
+    .run_commands("chmod a+x /root/entrypoint.sh")
+    .entrypoint(["/root/entrypoint.sh"])
+)
+app = modal.App("example-modal-tailscale", image=image)
+
+```
+
+Packages might not be installed locally. This catches import errors and
+only attempts imports in the container.
+
+```python
+with image.imports():
+    import socket
+
+    import socks
+
+```
+
+Configure Python to use the SOCKS5 proxy globally.
+
+```python
+if not modal.is_local():
+    socks.set_default_proxy(socks.SOCKS5, "0.0.0.0", 1080)
+    socket.socket = socks.socksocket
+
+```
+
+Run your function adding a Tailscale secret. We suggest creating a [reusable and ephemeral key](https://tailscale.com/kb/1111/ephemeral-nodes).
+
+```python
+@app.function(
+    secrets=[
+        modal.Secret.from_name("tailscale-auth", required_keys=["TAILSCALE_AUTHKEY"]),
+        modal.Secret.from_dict(
+            {
+                "ALL_PROXY": "socks5://localhost:1080/",
+                "HTTP_PROXY": "http://localhost:1080/",
+                "http_proxy": "http://localhost:1080/",
+            }
+        ),
+    ],
+)
+def connect_to_machine():
+    import requests
+
+    # Connect to other machines in your tailnet.
+    resp = requests.get("http://my-tailscale-machine:5000")
+    print(resp.content)
+
+```
+
+Run this script with `modal run modal_tailscale.py`. You will see Tailscale logs
+when the container start indicating that you were able to login successfully and
+that the proxies (SOCKS5 and HTTP) have created been successfully. You will also
+be able to see Modal containers in your Tailscale dashboard in the "Machines" tab.
+Every new container launched will show up as a new "machine". Containers are
+individually addressable using their Tailscale name or IP address.
+
+### Modal Webrtc
+
+```python
+import asyncio
+import json
+import queue
+from abc import ABC, abstractmethod
+from typing import Optional
+
+import modal
+from fastapi import FastAPI, WebSocket
+from fastapi.websockets import WebSocketState
+
+class ModalWebRtcPeer(ABC):
+    """
+    Base class for implementing WebRTC peer connections in Modal using aiortc.
+    Implement using the `app.cls` decorator.
+
+    This class provides a complete WebRTC peer implementation that handles:
+    - Peer connection lifecycle management (creation, negotiation, cleanup)
+    - Signaling via Modal Queue for SDP offer/answer exchange and ICE candidate handling
+    - Automatic STUN server configuration (defaults to Google's STUN server)
+    - Stream setup and management
+
+    Required methods to override:
+    - setup_streams(): Implementation for setting up media tracks and streams
+
+    Optional methods to override:
+    - initialize(): Custom initialization logic when peer is created
+    - run_streams(): Implementation for stream runtime logic
+    - get_turn_servers(): Implementation to provide custom TURN server configuration
+    - exit(): Custom cleanup logic when peer is shutting down
+
+    The peer connection is established through a ModalWebRtcSignalingServer that manages
+    the signaling process between this peer and client peers.
+    """
+
+    @modal.enter()
+    async def _initialize(self):
+        import shortuuid
+
+        self.id = shortuuid.uuid()
+        self.pcs = {}
+        self.pending_candidates = {}
+
+        # call custom init logic
+        await self.initialize()
+
+    async def initialize(self):
+        """Override to add custom logic when creating a peer"""
+
+    @abstractmethod
+    async def setup_streams(self, peer_id):
+        """Override to add custom logic when creating a connection and setting up streams"""
+        raise NotImplementedError
+
+    async def run_streams(self, peer_id):
+        """Override to add custom logic when running streams"""
+
+    async def get_turn_servers(self, peer_id=None, msg=None) -> Optional[list]:
+        """Override to customize TURN servers"""
+
+    async def _setup_peer_connection(self, peer_id):
+        """Creates an RTC peer connection via an ICE server"""
+        from aiortc import RTCConfiguration, RTCIceServer, RTCPeerConnection
+
+        # aiortc automatically uses google's STUN server,
+        # but we can also specify our own
+        config = RTCConfiguration(
+            iceServers=[RTCIceServer(urls="stun:stun.l.google.com:19302")]
+        )
+        self.pcs[peer_id] = RTCPeerConnection(configuration=config)
+        self.pending_candidates[peer_id] = []
+        await self.setup_streams(peer_id)
+
+        print(
+            f"{self.id}: Created peer connection and setup streams from {self.id} to {peer_id}"
+        )
+
+    @modal.method()
+    async def run(self, q: modal.Queue, peer_id: str):
+        """Run the RTC peer after establishing a connection by passing WebSocket messages over a Queue."""
+        print(f"{self.id}: Running modal peer instance for client peer: {peer_id}...")
+
+        await self._connect_over_queue(q, peer_id)
+        await self._run_streams(peer_id)
+
+    async def _connect_over_queue(self, q, peer_id):
+        """Connect this peer to another by passing messages along a Modal Queue."""
+
+        msg_handlers = {  # message types we need to handle
+            "offer": self.handle_offer,  # SDP offer
+            "ice_candidate": self.handle_ice_candidate,  # trickled ICE candidate
+            "identify": self.get_identity,  # identify challenge
+            "get_turn_servers": self.get_turn_servers,  # TURN server request
+        }
+
+        while True:
+            try:
+                if self.pcs.get(peer_id) and (
+                    self.pcs[peer_id].connectionState
+                    in ["connected", "closed", "failed"]
+                ):
+                    print(f"{self.id}: Closing connection to {peer_id} over queue...")
+                    q.put("close", partition="server")
+                    break
+
+                # read and parse websocket message passed over queue
+                msg = json.loads(await q.get.aio(partition=peer_id, timeout=0.5))
+                # dispatch the message to its handler
+                if handler := msg_handlers.get(msg.get("type")):
+                    response = await handler(peer_id, msg)
+                else:
+                    print(f"{self.id}: Unknown message type: {msg.get('type')}")
+                    response = None
+
+                # pass the message back over the queue to the server
+                if response is not None:
+                    await q.put.aio(json.dumps(response), partition="server")
+            except queue.Empty:
+                print(f"{self.id}: Queue empty, waiting for message...")
+                pass
+            except Exception as e:
+                print(
+                    f"{self.id}: Error handling message from {peer_id}: {type(e)}: {e}"
+                )
+                continue
+
+    async def _run_streams(self, peer_id):
+        """Run WebRTC streaming with a peer."""
+        print(f"{self.id}:  running streams to {peer_id}...")
+
+        await self.run_streams(peer_id)
+
+        # run until connection is closed or broken
+        while self.pcs[peer_id].connectionState == "connected":
+            await asyncio.sleep(0.1)
+
+        print(f"{self.id}:  ending streaming to {peer_id}")
+
+    async def handle_offer(self, peer_id, msg):
+        """Handles a peers SDP offer message by producing an SDP answer."""
+        from aiortc import RTCSessionDescription
+
+        print(f"{self.id}:  handling SDP offer from {peer_id}...")
+
+        await self._setup_peer_connection(peer_id)
+        await self.pcs[peer_id].setRemoteDescription(
+            RTCSessionDescription(msg["sdp"], msg["type"])
+        )
+        answer = await self.pcs[peer_id].createAnswer()
+        await self.pcs[peer_id].setLocalDescription(answer)
+        sdp = self.pcs[peer_id].localDescription.sdp
+
+        return {"sdp": sdp, "type": answer.type, "peer_id": self.id}
+
+    async def handle_ice_candidate(self, peer_id, msg):
+        """Add an ICE candidate sent by a peer."""
+        from aiortc import RTCIceCandidate
+        from aiortc.sdp import candidate_from_sdp
+
+        candidate = msg.get("candidate")
+
+        if not candidate:
+            raise ValueError
+
+        print(
+            f"{self.id}:  received ice candidate from {peer_id}: {candidate['candidate_sdp']}..."
+        )
+
+        # parse ice candidate
+        ice_candidate: RTCIceCandidate = candidate_from_sdp(candidate["candidate_sdp"])
+        ice_candidate.sdpMid = candidate["sdpMid"]
+        ice_candidate.sdpMLineIndex = candidate["sdpMLineIndex"]
+
+        if not self.pcs.get(peer_id):
+            self.pending_candidates[peer_id].append(ice_candidate)
+        else:
+            if len(self.pending_candidates[peer_id]) > 0:
+                [
+                    await self.pcs[peer_id].addIceCandidate(c)
+                    for c in self.pending_candidates[peer_id]
+                ]
+                self.pending_candidates[peer_id] = []
+            await self.pcs[peer_id].addIceCandidate(ice_candidate)
+
+    async def get_identity(self, peer_id=None, msg=None):
+        """Reply to an identify message with own id."""
+        return {"type": "identify", "peer_id": self.id}
+
+    async def generate_offer(self, peer_id):
+        print(f"{self.id}:  generating offer for {peer_id}...")
+
+        await self._setup_peer_connection(peer_id)
+        offer = await self.pcs[peer_id].createOffer()
+        await self.pcs[peer_id].setLocalDescription(offer)
+        sdp = self.pcs[peer_id].localDescription.sdp
+
+        return {"sdp": sdp, "type": offer.type, "peer_id": self.id}
+
+    async def handle_answer(self, peer_id, answer):
+        from aiortc import RTCSessionDescription
+
+        print(f"{self.id}:  handling answer from {peer_id}...")
+        # set remote peer description
+        await self.pcs[peer_id].setRemoteDescription(
+            RTCSessionDescription(sdp=answer["sdp"], type=answer["type"])
+        )
+
+    @modal.exit()
+    async def _exit(self):
+        print(f"{self.id}: Shutting down...")
+        await self.exit()
+
+        if self.pcs:
+            print(f"{self.id}: Closing peer connections...")
+            await asyncio.gather(*[pc.close() for pc in self.pcs.values()])
+            self.pcs = {}
+
+    async def exit(self):
+        """Override with any custom logic when shutting down container."""
+
+class ModalWebRtcSignalingServer:
+    """
+    WebRTC signaling server implementation that mediates connections between client peers
+    and Modal-based WebRTC peers. Implement using the `app.cls` decorator.
+
+    This server:
+    - Provides a WebSocket endpoint (/ws/{peer_id}) for client connections
+    - Spawns Modal-based peer instances for each client connection
+    - Handles the WebRTC signaling process by relaying messages between clients and Modal peers
+    - Manages the lifecycle of Modal peer instances
+
+    To use this class:
+    1. Create a subclass implementing get_modal_peer_class() to return your ModalWebRtcPeer implementation
+    2. Optionally override initialize() for custom server setup
+    3. Optionally add a frontend route to the `web_app` attribute
+    """
+
+    @modal.enter()
+    def _initialize(self):
+        self.web_app = FastAPI()
+
+        # handle signaling through websocket endpoint
+        @self.web_app.websocket("/ws/{peer_id}")
+        async def ws(client_websocket: WebSocket, peer_id: str):
+            try:
+                await client_websocket.accept()
+                print(f"Server: Accepted websocket connection from {peer_id}...")
+                await self._mediate_negotiation(client_websocket, peer_id)
+            except Exception as e:
+                print(
+                    f"Server: Error accepting websocket connection from {peer_id}: {type(e)}: {e}"
+                )
+                await client_websocket.close()
+
+        self.initialize()
+
+    def initialize(self):
+        pass
+
+    @abstractmethod
+    def get_modal_peer_class(self) -> type[ModalWebRtcPeer]:
+        """
+        Abstract method to return the `ModalWebRtcPeer` implementation to use.
+        """
+        raise NotImplementedError(
+            "Implement `get_modal_peer` to use `ModalWebRtcSignalingServer`"
+        )
+
+    @modal.asgi_app()
+    def web(self):
+        return self.web_app
+
+    async def _mediate_negotiation(self, websocket: WebSocket, peer_id: str):
+        modal_peer_class = self.get_modal_peer_class()
+        if not any(
+            base.__name__ == "ModalWebRtcPeer" for base in modal_peer_class.__bases__
+        ):
+            raise ValueError(
+                "Modal peer class must be an implementation of `ModalWebRtcPeer`"
+            )
+
+        with modal.Queue.ephemeral() as q:
+            print(f"Server: Spawning modal peer instance for client peer {peer_id}...")
+            modal_peer = modal_peer_class()
+            modal_peer.run.spawn(q, peer_id)
+
+            await asyncio.gather(
+                relay_websocket_to_queue(websocket, q, peer_id),
+                relay_queue_to_websocket(websocket, q, peer_id),
+            )
+
+async def relay_websocket_to_queue(websocket: WebSocket, q: modal.Queue, peer_id: str):
+    while True:
+        try:
+            # get websocket message off queue and parse as json
+            msg = await asyncio.wait_for(websocket.receive_text(), timeout=0.5)
+            await q.put.aio(msg, partition=peer_id)
+        except asyncio.TimeoutError:
+            pass
+        except Exception as e:
+            if WebSocketState.DISCONNECTED in [
+                websocket.application_state,
+                websocket.client_state,
+            ]:
+                print("Server: Websocket connection closed")
+                return
+            else:
+                print(f"Server: Error relaying from websocket to queue: {type(e)}: {e}")
+
+async def relay_queue_to_websocket(websocket: WebSocket, q: modal.Queue, peer_id: str):
+    while True:
+        try:
+            # get websocket message off queue and parse from json
+            modal_peer_msg = await q.get.aio(partition="server", timeout=0.5)
+            if modal_peer_msg.startswith("close"):
+                print(
+                    "Server: Close received on queue, closing websocket connection..."
+                )
+                await websocket.close()
+                return
+
+            await websocket.send_text(modal_peer_msg)
+        except queue.Empty:
+            pass
+        except Exception as e:
+            if WebSocketState.DISCONNECTED in [
+                websocket.application_state,
+                websocket.client_state,
+            ]:
+                print("Server: Websocket connection closed")
+                return
+            else:
+                print(f"Server: Error relaying from queue to websocket: {type(e)}: {e}")
+
+```
+
+### Multion News Agent
+
+# MultiOn: Twitter News Agent
+
+In this example, we use Modal to deploy a cron job that periodically checks for AI news everyday and tweets it on Twitter using the MultiOn Agent API.
+
+## Import and define the app
+
+Let's start off with imports, and defining a Modal app.
+
+```python
+import os
+
+import modal
+
+app = modal.App("example-multion-news-agent")
+
+```
+
+## Searching for AI News
+
+Let's also define an image that has the `multion` package installed, so we can query the API.
+
+```python
+multion_image = modal.Image.debian_slim().uv_pip_install("multion")
+
+```
+
+We can now define our main entrypoint, which uses [MultiOn](https://www.multion.ai/)
+to scrape AI news everyday and post it on our Twitter account.
+We specify a [schedule](https://modal.com/docs/guide/cron) in the function decorator, which
+means that our function will run automatically at the given interval.
+
+## Set up MultiOn
+
+[MultiOn](https://multion.ai/) is a Web Action Agent that can take actions on behalf of the user.
+You can watch it in action [here](https://www.youtube.com/watch?v=Rm67ry6bogw).
+
+The MultiOn API enables building the next level of web automation & custom AI agents capable of performing complex actions on the internet with just a few lines of code.
+
+To get started, first create an account with [MultiOn](https://www.multion.ai/),
+install the [MultiOn chrome extension](https://chrome.google.com/webstore/detail/ddmjhdbknfidiopmbaceghhhbgbpenmm)
+and login to your Twitter account in your browser.
+To use the API, create a MultiOn API Key
+and store it as a Modal Secret on [the dashboard](https://modal.com/secrets)
+
+```python
+@app.function(image=multion_image, secrets=[modal.Secret.from_name("MULTION_API_KEY")])
+def news_tweet_agent():
+    # Import MultiOn
+    import multion
+
+    # Login to MultiOn using the API key
+    multion.login(use_api=True, multion_api_key=os.environ["MULTION_API_KEY"])
+
+    # Enable the Agent to run locally
+    multion.set_remote(False)
+
+    params = {
+        "url": "https://www.multion.ai",
+        "cmd": "Go to twitter (im already signed in). Search for the last tweets i made (check the last 10 tweets). Remember them so then you can go a search for super interesting AI news. Search the news on up to 3 different sources. If you see that the source has not really interesting AI news or i already made a tweet about that, then go to a different one. When you finish the research, go and make a few small and interesting AI tweets with the info you gathered. Make sure the tweet is small but informative and interesting for AI enthusiasts. Don't do more than 5 tweets",
+        "maxSteps": 100,
+    }
+
+    response = multion.browse(params)
+
+    print(f"MultiOn response: {response}")
+
+```
+
+## Test running
+
+We can now test run our scheduled function as follows: `modal run multion_news_agent.py.py::app.news_tweet_agent`
+
+## Defining the schedule and deploying
+
+Let's define a function that will be called by Modal every day.
+
+```python
+@app.function(schedule=modal.Cron("0 9 * * *"))
+def run_daily():
+    news_tweet_agent.remote()
+
+```
+
+In order to deploy this as a persistent cron job, you can run `modal deploy multion_news_agent.py`.
+
+Once the job is deployed, visit the [apps page](https://modal.com/apps) page to see
+its execution history, logs and other stats.
+
+### Nsys
+
+# Trace and profile GPU-accelerated applications with Nsight Systems
+
+This example demonstrates how to use
+NVIDIA's [Nsight Systems](https://developer.nvidia.com/nsight-systems)
+profiling tool on Modal.
+
+Nsight Systems traces and profiles GPU-accelerated applications at the _systems_ level --
+that is, it correlates events across the host and the device(s), aka the CPU(s) and GPU(s).
+
+To run Nsight Systems, you will need to use a different version of Modal's Function runtime
+that allows user code to perform additional syscalls.
+This is made available to select users on Modal's [Enterprise Plan](https://modal.com/pricing).
+Users on that plan can request access by contacting Modal Support.
+
+Note that the PyTorch profiler captures similar metrics to Nsight Systems but
+does not require elevated permissions. You can find sample code for that [here](https://modal.com/docs/examples/torch_profiling).
+
+## Install Nsight Systems and the CUDA Toolkit
+
+First, we need to install the software into our
+[container Image](https://modal.com/docs/guide/images).
+
+```python
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-nsys")
+here = Path(__file__).parent  # directory of this script
+
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .run_commands(
+        # install system packages required to install Nsight Systems
+        "apt update",
+        "apt install -y --no-install-recommends gnupg wget software-properties-common",
+        # add NVIDIA's GPG keys
+        "wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb",
+        "dpkg -i cuda-keyring_1.1-1_all.deb",
+        # add the contrib APT repository that distributes CUDA software on Debian Linux
+        "add-apt-repository contrib",
+        # install Nsight Systems and the CUDA Toolkit
+        "apt update",
+        "apt install -y cuda-nsight-systems-12-8 cuda-toolkit-12-8",
+    )
+    # update the PATH so that the CUDA software can be found
+    .env({"PATH": "/usr/local/cuda-12/bin:${PATH}"})
+    # add local files for use in `modal shell`, see discussion below
+    .add_local_file(here / "toy.cu", remote_path="/root/toy.cu")
+)
+
+```
+
+## Run Nsight Systems on Modal
+
+Now we can use the `nsys` command-line tool on Modal.
+
+As a simple demonstration, we can pass in code directly as a string,
+compile it with `nvcc`, and then profile it with `nsys`.
+
+We both return the generated profile to the caller
+and persist the profile to a [Modal Volume](https://modal.com/docs/guide/volumes).
+
+```python
+profile_volume = modal.Volume.from_name("example-nsys-traces", create_if_missing=True)
+
+@app.function(image=image, gpu="A10", volumes={"/traces": profile_volume})
+def compile_and_profile(code: str, output_path: str = "profile.nsys-rep"):
+    import subprocess
+
+    Path("kernel.cu").write_text(code)
+
+    subprocess.run(["nvcc", "-arch=sm_75", "kernel.cu"], check=True)
+
+    output_path = f"/traces/{output_path}"
+
+    subprocess.run(["nsys", "profile", "--output", output_path, "./a.out"], check=True)
+
+    return Path(output_path).read_bytes()
+
+```
+
+To confirm everything's working, we can pass the code directly to our Modal Function
+from the terminal and write the response into a local file:
+
+```bash
+modal run -w profile.nsys-rep nsys.py::compile_and_profile --code $'#include <iostream>\nint main() { std::cout << "Hello, World!" << std::endl; return 0; }'
+```
+
+If you [install the Nsight Systems GUI on your local machine](https://developer.nvidia.com/nsight-systems/get-started),
+you can view the trace and profiler report -- no NVIDIA GPUs or CUDA Toolkit required.
+
+The `local_entrypoint` below shows a slightly more realistic pattern,
+based on passing in a path to a CUDA program.
+
+```bash
+modal run nsys.py --input-path toy.cu
+```
+
+```python
+@app.local_entrypoint()
+def main(input_path: str | None = None, output_path: str | None = None):
+    if not input_path:
+        input_path = here / "toy.cu"
+    code = Path(input_path).read_text()
+
+    profile = compile_and_profile.remote(code)
+
+    (path := Path(output_path or here / "profile.nsys-rep")).write_bytes(profile)
+
+    print(f"profile saved at {path}")
+
+```
+
+For multi-file compilation and profiling, we recommend using
+[`Image.add_local_file`](https://modal.com/docs/reference/modal.Image#add_local_file) and
+[`Image.add_local_dir`](https://modal.com/docs/reference/modal.Image#add_local_dir)
+to add your source code to the Image and then either editing the `nvcc` command directly
+or profiling in an interactive shell with
+
+```bash
+modal shell nsys.py
+```
+
+Profiles saved to `/traces/` will be persisted in a Modal Volume.
+See [the guide](https://modal.com/docs/guide/volumes)
+for details on how to retrieve them for local review.
+
+### Ollama
+
+# Run open-source LLMs with Ollama on Modal
+
+[Ollama](https://ollama.com/) is a popular tool for running open-source large language models (LLMs) locally.
+It provides a simple API, including OpenAI compatibility, allowing you to interact with various models like
+Llama, Mistral, Phi, and more.
+
+In this example, we demonstrate how to run Ollama on Modal's cloud infrastructure, leveraging:
+
+1. Modal's powerful GPU resources that far exceed what's available on most local machines
+2. Serverless design that scales to zero when not in use (saving costs)
+3. Persistent model storage using Modal Volumes
+4. Web-accessible endpoints that expose Ollama's OpenAI-compatible API
+
+Since the Ollama server provides its own REST API, we use Modal's web_server decorator
+to expose these endpoints directly to the internet.
+
+```python
+import asyncio
+import subprocess
+from typing import List
+
+import modal
+
+```
+
+## Configuration and Constants
+
+Directory for Ollama models within the container and volume
+
+```python
+MODEL_DIR = "/ollama_models"
+
+```
+
+Define the models we want to work with
+You can specify different model versions using the format "model:tag"
+
+```python
+MODELS_TO_DOWNLOAD = ["llama3.1:8b", "llama3.3:70b"]  # Downloaded at startup
+MODELS_TO_TEST = ["llama3.1:8b", "llama3.3:70b"]  # Tested in our example
+
+```
+
+Ollama version to install - you may need to update this for the latest models
+
+```python
+OLLAMA_VERSION = "0.6.5"
+```
+
+Ollama's default port - we'll expose this through Modal
+
+```python
+OLLAMA_PORT = 11434
+
+```
+
+## Building the Container Image
+
+First, we create a Modal Image that includes Ollama and its dependencies.
+We use the official Ollama installation script to set up the Ollama binary.
+
+```python
+ollama_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("curl", "ca-certificates")
+    .uv_pip_install(
+        "fastapi==0.115.8",
+        "uvicorn[standard]==0.34.0",
+        "openai~=1.30",  # Pin OpenAI version for compatibility
+    )
+    .run_commands(
+        "echo 'Installing Ollama...'",
+        f"OLLAMA_VERSION={OLLAMA_VERSION} curl -fsSL https://ollama.com/install.sh | sh",
+        "echo 'Ollama installed at $(which ollama)'",
+        f"mkdir -p {MODEL_DIR}",
+    )
+    .env(
+        {
+            # Configure Ollama to serve on its default port
+            "OLLAMA_HOST": f"0.0.0.0:{OLLAMA_PORT}",
+            "OLLAMA_MODELS": MODEL_DIR,  # Tell Ollama where to store models
+        }
+    )
+)
+
+```
+
+Create a Modal App, which groups our functions together
+
+```python
+app = modal.App("example-ollama", image=ollama_image)
+
+```
+
+## Persistent Storage for Models
+
+We use a Modal Volume to cache downloaded models between runs.
+This prevents needing to re-download large model files each time.
+
+```python
+model_volume = modal.Volume.from_name("ollama-models-store", create_if_missing=True)
+
+```
+
+## The Ollama Server Class
+
+We define an OllamaServer class to manage the Ollama process.
+This class handles:
+- Starting the Ollama server
+- Downloading required models
+- Exposing the API via Modal's web_server
+- Running test requests against the served models
+
+```python
+@app.cls(
+    gpu="H100",  # Use H100 GPUs for best performance
+    volumes={MODEL_DIR: model_volume},  # Mount our model storage
+    timeout=60 * 5,  # 5 minutes max input runtime
+)
+class OllamaServer:
+    ollama_process: subprocess.Popen | None = None
+
+    @modal.enter()
+    async def start_ollama(self):
+        """Starts the Ollama server and ensures required models are downloaded."""
+        print("Starting Ollama setup...")
+
+        print(f"Starting Ollama server on port {OLLAMA_PORT}...")
+        cmd = ["ollama", "serve"]
+        self.ollama_process = subprocess.Popen(cmd)
+        print(f"Ollama server started with PID: {self.ollama_process.pid}")
+
+        # Wait for server to initialize
+        await asyncio.sleep(10)
+        print("Ollama server should be ready.")
+
+        # --- Model Management ---
+        # Check which models are already downloaded, and pull any that are missing
+        loop = asyncio.get_running_loop()
+        models_pulled = False  # Track if we pulled any model
+
+        # Get list of currently available models
+        ollama_list_proc = subprocess.run(
+            ["ollama", "list"], capture_output=True, text=True
+        )
+
+        if ollama_list_proc.returncode != 0:
+            print(f"Error: 'ollama list' failed: {ollama_list_proc.stderr}")
+            raise RuntimeError(
+                f"Failed to list Ollama models: {ollama_list_proc.stderr}"
+            )
+
+        current_models_output = ollama_list_proc.stdout
+        print("Current models detected:", current_models_output)
+
+        # Download each requested model if not already present
+        for model_name in MODELS_TO_DOWNLOAD:
+            print(f"Checking for model: {model_name}")
+            model_tag_to_check = (
+                model_name if ":" in model_name else f"{model_name}:latest"
+            )
+
+            if model_tag_to_check not in current_models_output:
+                print(
+                    f"Model '{model_name}' not found. Pulling (output will stream directly)..."
+                )
+                models_pulled = True  # Mark that a pull is happening
+
+                # Pull the model - this can take a while for large models
+                pull_process = await asyncio.create_subprocess_exec(
+                    "ollama",
+                    "pull",
+                    model_name,
+                )
+
+                # Wait for the pull process to complete
+                retcode = await pull_process.wait()
+
+                if retcode != 0:
+                    print(f"Error pulling model '{model_name}': exit code {retcode}")
+                else:
+                    print(f"Model '{model_name}' pulled successfully.")
+            else:
+                print(f"Model '{model_name}' already exists.")
+
+            # Commit the volume only if we actually pulled new models
+            if models_pulled:
+                print("Committing model volume...")
+                await loop.run_in_executor(None, model_volume.commit)
+                print("Volume commit finished.")
+
+        print("Ollama setup complete.")
+
+    @modal.exit()
+    def stop_ollama(self):
+        """Terminates the Ollama server process on shutdown."""
+        print("Shutting down Ollama server...")
+        if self.ollama_process and self.ollama_process.poll() is None:
+            print(f"Terminating Ollama server (PID: {self.ollama_process.pid})...")
+            try:
+                self.ollama_process.terminate()
+                self.ollama_process.wait(timeout=10)
+                print("Ollama server terminated.")
+            except subprocess.TimeoutExpired:
+                print("Ollama server kill required.")
+                self.ollama_process.kill()
+                self.ollama_process.wait()
+            except Exception as e:
+                print(f"Error shutting down Ollama server: {e}")
+        else:
+            print("Ollama server process already exited or not found.")
+        print("Shutdown complete.")
+
+    @modal.web_server(port=OLLAMA_PORT, startup_timeout=180)
+    def serve(self):
+        """
+        Exposes the Ollama server's API endpoints through Modal's web_server.
+
+        This is the key function that makes Ollama's API accessible over the internet.
+        The web_server decorator maps Modal's HTTPS endpoint to Ollama's internal port.
+        """
+        print(f"Serving Ollama API on port {OLLAMA_PORT}")
+
+    # ## Running prompt tests
+    #
+    # The following method allows us to run test prompts against our Ollama models.
+    # This is useful for verifying that the models are working correctly and
+    # to see how they respond to different types of prompts.
+
+    @modal.method()
+    async def run_tests(self):
+        import openai
+        from openai.types.chat import ChatCompletionMessageParam
+
+        """
+        Tests the Ollama server by sending various prompts to each configured model.
+        Returns a dictionary of results organized by model.
+        """
+        print("Running tests inside OllamaServer container...")
+        all_results = {}  # Store results per model
+
+        # Configure OpenAI client to use our Ollama server
+        base_api_url = f"http://localhost:{OLLAMA_PORT}/v1"
+        print(f"Configuring OpenAI client for: {base_api_url}")
+        client = openai.AsyncOpenAI(
+            base_url=base_api_url,
+            api_key="not-needed",  # Ollama doesn't require API keys
+        )
+
+        # Define some test prompts
+        test_prompts = [
+            "Explain the theory of relativity in simple terms.",
+            "Write a short poem about a cat watching rain.",
+            "What are the main benefits of using Python?",
+        ]
+
+        # Test each model with each prompt
+        for model_name in MODELS_TO_TEST:
+            print(f"\n===== Testing Model: {model_name} =====")
+            model_results = []
+            all_results[model_name] = model_results
+
+            for prompt in test_prompts:
+                print(f"\n--- Testing Prompt ---\n{prompt}\n----------------------")
+
+                # Create message in OpenAI format
+                messages: List[ChatCompletionMessageParam] = [
+                    {"role": "user", "content": prompt}
+                ]
+
+                try:
+                    # Call the Ollama API through the OpenAI client
+                    response = await client.chat.completions.create(
+                        model=model_name,
+                        messages=messages,
+                        stream=False,
+                    )
+                    assistant_message = response.choices[0].message.content
+                    print(f"Assistant Response:\n{assistant_message}")
+                    model_results.append(
+                        {
+                            "prompt": prompt,
+                            "status": "success",
+                            "response": assistant_message,
+                        }
+                    )
+                except Exception as e:
+                    print(
+                        f"Error during API call for model '{model_name}', prompt '{prompt}': {e}"
+                    )
+                    model_results.append(
+                        {"prompt": prompt, "status": "error", "error": str(e)}
+                    )
+
+        print("Internal tests finished.")
+        return all_results
+
+```
+
+## Running the Example
+
+This local entrypoint function provides a simple way to test the Ollama server.
+When you run `modal run ollama.py`, this function will:
+1. Start an OllamaServer instance in the cloud
+2. Run test prompts against each configured model
+3. Print a summary of the results
+
+```python
+@app.local_entrypoint()
+async def local_main():
+    """
+    Tests the Ollama server with sample prompts and prints the results.
+
+    Run with: `modal run ollama.py`
+    """
+    print("Triggering test suite on the OllamaServer...")
+    all_test_results = await OllamaServer().run_tests.remote.aio()
+    print("\n--- Test Suite Summary ---")
+
+    if all_test_results:
+        for model_name, results in all_test_results.items():
+            print(f"\n===== Results for Model: {model_name} =====")
+            successful_tests = 0
+            if results:
+                for result in results:
+                    print(f"Prompt: {result['prompt']}")
+                    print(f"Status: {result['status']}")
+                    if result["status"] == "error":
+                        print(f"Error: {result['error']}")
+                    else:
+                        successful_tests += 1
+                    print("----")
+                print(
+                    f"\nSummary for {model_name}: Total tests: {len(results)}, Successful: {successful_tests}"
+                )
+            else:
+                print("No results returned for this model.")
+    else:
+        print("No results returned from test function.")
+
+    print("\nTest finished. Your Ollama server is ready to use!")
+
+```
+
+## Deploying to Production
+
+While the local entrypoint is great for testing, for production use you'll want to deploy
+this application persistently. You can do this with:
+
+```bash
+modal deploy ollama.py
+```
+
+This creates a persistent deployment that:
+
+1. Provides a stable URL endpoint for your Ollama API
+2. Keeps at least one container warm for fast responses
+3. Scales automatically based on usage
+4. Preserves your models in the persistent volume between invocations
+
+After deployment, you can find your endpoint URL in your Modal dashboard.
+
+You can then use this endpoint with any OpenAI-compatible client by setting:
+
+```
+OPENAI_API_BASE=https://your-endpoint-url
+OPENAI_API_KEY=any-value  # Ollama doesn't require authentication
+```
+
+### Outlines Generate
+
+# Enforcing JSON outputs on LLMs
+
+[Outlines](https://github.com/outlines-dev/outlines) is a tool that lets you control the generation of language models to make their output more predictable.
+
+This includes things like:
+
+- Reducing the completion to a choice between multiple possibilities
+- Type constraints
+- Efficient regex-structured generation
+- Efficient JSON generation following a Pydantic model
+- Efficient JSON generation following a JSON schema
+
+Outlines is considered an alternative to tools like [JSONFormer](https://github.com/1rgs/jsonformer), and can be used on top of a variety of LLMs, including:
+
+- OpenAI models
+- LLaMA
+- Mamba
+
+In this guide, we will show how you can use Outlines to enforce a JSON schema on the output of Mistral-7B.
+
+## Build image
+
+ First, you'll want to build an image and install the relevant Python dependencies:
+`outlines` and a Hugging Face inference stack.
+
+```python
+import modal
+
+app = modal.App(name="example-outlines-generate")
+
+outlines_image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "outlines==1.2.3",
+    "transformers==4.41.2",
+    "sentencepiece==0.2.0",
+    "datasets==2.18.0",
+    "accelerate==0.27.2",
+    "numpy<2",
+    "pydantic==2.11.7",
+)
+
+```
+
+## Download the model
+
+Next, we download the Mistral 7B model from Hugging Face.
+We do this as part of the definition of our Modal Image so that
+we don't need to download it every time our inference function is run.
+
+```python
+MODEL_NAME = "mistral-community/Mistral-7B-v0.2"
+
+def import_model(model_name):
+    import outlines
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    outlines.from_transformers(
+        AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto"),
+        AutoTokenizer.from_pretrained(MODEL_NAME),
+    )
+
+outlines_image = outlines_image.run_function(
+    import_model, kwargs={"model_name": MODEL_NAME}
+)
+
+```
+
+## Define the function
+
+Next, we define the generation function.
+We use the `@app.function` decorator to tell Modal to run this function on the app we defined above.
+Note that we import `outlines` from inside the Modal function. This is because the `outlines` package exists in the container, but not necessarily locally.
+
+We specify that we want to use the Mistral-7B model, and then ask for a character, and we'll receive structured data with the right schema.
+
+We also define the schema that we want to enforce on the output of Mistral-7B. This schema is for a character description, and includes a name, age, armor, weapon, and strength.
+
+```python
+@app.function(image=outlines_image, gpu="A100-40GB")
+def generate(
+    prompt: str = "Amiri, a 53 year old warrior woman with a sword and leather armor.",
+):
+    from enum import Enum
+
+    import outlines
+    from pydantic import BaseModel, Field
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    model = outlines.from_transformers(
+        AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto"),
+        AutoTokenizer.from_pretrained(MODEL_NAME),
+    )
+
+    class Armor(str, Enum):
+        leather = "leather"
+        chainmail = "chainmail"
+        plate = "plate"
+
+    class Weapon(str, Enum):
+        sword = "sword"
+        axe = "axe"
+        mace = "mace"
+        spear = "spear"
+        bow = "bow"
+        crossbow = "crossbow"
+
+    class Character(BaseModel):
+        name: str = Field(..., max_length=10, title="Name")
+        age: int = Field(..., title="Age")
+        armor: Armor
+        weapon: Weapon
+        strength: int = Field(..., title="Strength")
+
+    character = model(
+        f"Give me a character description. Describe {prompt}.",
+        Character,
+        max_new_tokens=256,
+    )
+
+    return character
+
+```
+
+## Define the entrypoint
+
+Finally, we define the entrypoint that will connect our local computer
+to the functions above, that run on Modal, and we are done!
+
+When you run this script with `modal run`, you should see something like this printed out:
+
+ `{'name': 'Amiri', 'age': 53, 'armor': 'leather', 'weapon': 'sword', 'strength': 10}`
+
+```python
+@app.local_entrypoint()
+def main(
+    prompt: str = "Amiri, a 53 year old warrior woman with a sword and leather armor.",
+):
+    print(generate.remote(prompt))
+
+```
+
+### Parakeet Multitalker
+
+# Parakeet Multi-talker Speech-to-Text
+
+This example shows how to run a streaming multi-talker speech-to-text service
+using [NVIDIA's Parakeet Multi-talker model](https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1) and Sortformer diarization model.
+The application transcribes audio in real-time while identifying different speakers without the need
+to register unique speakers in advance.
+
+Try it yourself! Click the "View on GitHub" button to see the code. And [sign up for a Modal account](https://modal.com/signup) if you haven't already.
+
+## Setup
+
+We start by importing the necessary dependencies and defining the Modal App and Image.
+We use a persistent Volume to cache the models.
+
+```python
+import asyncio
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List, Optional
+
+import modal
+
+app = modal.App("parakeet-multitalker")
+model_cache = modal.Volume.from_name("parakeet-model-cache", create_if_missing=True)
+CACHE_PATH = "/cache"
+hf_secret = modal.Secret.from_name("huggingface-secret")
+
+image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:13.0.1-cudnn-devel-ubuntu22.04", add_python="3.12"
+    )
+    .env(
+        {
+            "HF_HUB_ENABLE_HF_TRANSFER": "1",
+            "HF_HOME": CACHE_PATH,  # cache directory for Hugging Face models
+            "CXX": "g++",
+            "CC": "g++",
+            "TORCH_HOME": CACHE_PATH,
+        }
+    )
+    .apt_install("git", "libsndfile1", "ffmpeg")
+    .uv_pip_install(
+        "hf_transfer==0.1.9",
+        "huggingface_hub[hf-xet]==0.31.2",
+        "cuda-python==13.0.1",
+        "numpy<2",
+        "fastapi",
+        "nemo_toolkit[asr]@git+https://github.com/NVIDIA/NeMo.git@main",
+    )
+)
+
+SAMPLE_RATE = 16000
+NUM_REQUIRED_BUFFER_FRAMES = 13
+BYTES_PER_SAMPLE = 2
+FRAME_LEN_SEC = 0.080
+PARAKEET_RT_STREAMING_CHUNK_SIZE = (
+    int(FRAME_LEN_SEC * SAMPLE_RATE) * BYTES_PER_SAMPLE * NUM_REQUIRED_BUFFER_FRAMES
+)
+
+def chunk_audio(data: bytes, chunk_size: int):
+    for i in range(0, len(data), chunk_size):
+        yield data[i : i + chunk_size]
+
+```
+
+## Configuration
+
+This dataclass holds all the configuration parameters for the transcription and diarization models.
+
+```python
+@dataclass
+class MultitalkerTranscriptionConfig:
+    """
+    Configuration for Multi-talker transcription with an ASR model and a diarization model.
+    """
+
+    # Required configs
+    diar_model: Optional[str] = None  # Path to a .nemo file
+    diar_pretrained_name: Optional[str] = None  # Name of a pretrained model
+    max_num_of_spks: Optional[int] = 4  # maximum number of speakers
+    parallel_speaker_strategy: bool = True  # whether to use parallel speaker strategy
+    masked_asr: bool = True  # whether to use masked ASR
+    mask_preencode: bool = False  # whether to mask preencode or mask features
+    cache_gating: bool = True  # whether to use cache gating
+    cache_gating_buffer_size: int = 2  # buffer size for cache gating
+    single_speaker_mode: bool = False  # whether to use single speaker mode
+
+    # General configs
+    session_len_sec: float = -1  # End-to-end diarization session length in seconds
+    num_workers: int = 8
+    random_seed: Optional[int] = (
+        None  # seed number going to be used in seed_everything()
+    )
+    log: bool = True  # If True,log will be printed
+
+    # Streaming diarization configs
+    streaming_mode: bool = True  # If True, streaming diarization will be used.
+    spkcache_len: int = 188
+    spkcache_refresh_rate: int = 0
+    fifo_len: int = 188
+    chunk_len: int = 0
+    chunk_left_context: int = 1
+    chunk_right_context: int = 0
+
+    # If `cuda` is a negative number, inference will be on CPU only.
+    cuda: Optional[int] = None
+    allow_mps: bool = False  # allow to select MPS device (Apple Silicon M-series GPU)
+    matmul_precision: str = "highest"  # Literal["highest", "high", "medium"]
+
+    # ASR Configs
+    asr_model: Optional[str] = None
+    device: str = "cuda"
+    audio_file: Optional[str] = None
+    manifest_file: Optional[str] = None
+    use_amp: bool = True
+    debug_mode: bool = True
+    batch_size: int = 32
+    chunk_size: int = -1
+    shift_size: int = -1
+    left_chunks: int = 2
+    online_normalization: bool = True
+    output_path: Optional[str] = None
+    pad_and_drop_preencoded: bool = False
+    set_decoder: Optional[str] = None  # ["ctc", "rnnt"]
+    att_context_size: Optional[List[int]] = field(default_factory=lambda: [70, 13])
+    generate_realtime_scripts: bool = True
+
+    word_window: int = 50
+    sent_break_sec: float = 30.0
+    fix_prev_words_count: int = 5
+    update_prev_words_sentence: int = 5
+    left_frame_shift: int = -1
+    right_frame_shift: int = 0
+    min_sigmoid_val: float = 1e-2
+    discarded_frames: int = 8
+    print_time: bool = True
+    print_sample_indices: List[int] = field(default_factory=lambda: [0])
+    colored_text: bool = True
+    real_time_mode: bool = True
+    print_path: Optional[str] = "./"
+
+    ignored_initial_frame_steps: int = 5
+    verbose: bool = True
+
+    feat_len_sec: float = 0.01
+    finetune_realtime_ratio: float = 0.01
+
+    spk_supervision: str = "diar"  # ["diar", "rttm"]
+    binary_diar_preds: bool = False
+    deploy_mode: bool = True
+
+    @staticmethod
+    def init_diar_model(cfg, diar_model):
+        # Set streaming mode diar_model params (matching the diarization setup from lines 263-271 of reference file)
+        diar_model.streaming_mode = cfg.streaming_mode
+        diar_model.sortformer_modules.chunk_len = (
+            cfg.chunk_len if cfg.chunk_len > 0 else 6
+        )
+        diar_model.sortformer_modules.spkcache_len = cfg.spkcache_len
+        diar_model.sortformer_modules.chunk_left_context = cfg.chunk_left_context
+        diar_model.sortformer_modules.chunk_right_context = (
+            cfg.chunk_right_context if cfg.chunk_right_context > 0 else 7
+        )
+        diar_model.sortformer_modules.fifo_len = cfg.fifo_len
+        diar_model.sortformer_modules.log = cfg.log
+        diar_model.sortformer_modules.spkcache_refresh_rate = cfg.spkcache_refresh_rate
+        return diar_model
+
+with image.imports():
+    import logging
+    from urllib.request import urlopen
+
+    import numpy as np
+    import torch
+    from fastapi import FastAPI, WebSocket, WebSocketDisconnect
+    from nemo.collections.asr.models import ASRModel, SortformerEncLabelModel
+    from nemo.collections.asr.parts.utils.multispk_transcribe_utils import (
+        SpeakerTaggedASR,
+    )
+    from omegaconf import OmegaConf
+    from starlette.websockets import WebSocketState
+
+    from .asr_utils import int2float, preprocess_audio
+    from .cache_aware_buffer import CacheAwareStreamingAudioBuffer
+
+```
+
+## Transcriber Service
+
+We define the main `Transcriber` class as a Modal Cls.
+This class loads the models into GPU memory and handles the streaming inference.
+For more on lifecycle management with Cls and cold start penalty reduction on Modal, see
+[this guide](https://modal.com/docs/guide/cold-start). In particular, this model
+is amenable to GPU snapshots which can significantly reduce cold start times.
+
+We use a `CacheAwareStreamingAudioBuffer` to manage the audio stream.
+This buffer handles the streaming input and output, ensuring that the model receives
+the correct amount of audio data for each inference step.
+
+### WebSocket Handling
+
+We use FastAPI's WebSocket support to handle the audio stream.
+Incoming audio bytes are buffered and processed in chunks, and
+transcriptions are sent back to the client as they become available.
+
+```python
+@app.cls(
+    volumes={CACHE_PATH: model_cache},
+    gpu=["A100"],
+    image=image,
+    secrets=[hf_secret] if hf_secret is not None else [],
+)
+class Transcriber:
+    @modal.enter()
+    # @modal.enter()
+    async def load(self):
+        # silence chatty logs from nemo
+        logging.getLogger("nemo_logger").setLevel(logging.CRITICAL)
+
+        self.diar_model = (
+            SortformerEncLabelModel.from_pretrained(
+                "nvidia/diar_streaming_sortformer_4spk-v2.1"
+            )
+            .eval()
+            .to(torch.device("cuda"))
+        )
+        self.asr_model = (
+            ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
+            .eval()
+            .to(torch.device("cuda"))
+        )
+
+        self.cfg = OmegaConf.structured(MultitalkerTranscriptionConfig())
+        self.diar_model = MultitalkerTranscriptionConfig.init_diar_model(
+            self.cfg, self.diar_model
+        )
+        self.multispk_asr_streamer = SpeakerTaggedASR(
+            self.cfg, self.asr_model, self.diar_model
+        )
+
+        self._chunk_size = PARAKEET_RT_STREAMING_CHUNK_SIZE
+
+        # warm up gpu
+        AUDIO_URL = "https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
+        audio_bytes = urlopen(AUDIO_URL).read()
+        audio_bytes = preprocess_audio(AUDIO_URL, target_sample_rate=16000)
+
+        self.streaming_buffer = CacheAwareStreamingAudioBuffer(
+            model=self.asr_model,
+            online_normalization=self.cfg.online_normalization,
+            pad_and_drop_preencoded=self.cfg.pad_and_drop_preencoded,
+        )
+
+        self.streaming_buffer.reset_buffer()
+
+        step_num = 0
+        stream_id = -1
+        for audio_data in chunk_audio(audio_bytes, PARAKEET_RT_STREAMING_CHUNK_SIZE):
+            transcript, stream_id = await self.transcribe(
+                audio_data, step_num, stream_id
+            )
+            step_num += 1
+            stream_id = 0
+            print(f"transcript: {transcript}")
+            print(f"stream_id: {stream_id}")
+
+        self.streaming_buffer.reset_buffer()
+
+        self.web_app = FastAPI()
+
+        @self.web_app.websocket("/ws")
+        async def run_with_websocket(ws: WebSocket):
+            audio_queue = asyncio.Queue()
+            transcription_queue = asyncio.Queue()
+
+            self.streaming_buffer.reset_buffer()
+
+            async def recv_loop(ws, audio_queue):
+                audio_buffer = bytearray()
+                while True:
+                    data = await ws.receive_bytes()
+                    audio_buffer.extend(data)
+                    if len(audio_buffer) > self._chunk_size:
+                        print("sending audio data")
+                        await audio_queue.put(audio_buffer)
+                        audio_buffer = bytearray()
+
+            async def inference_loop(audio_queue, transcription_queue):
+                step_num = 0
+                stream_id = -1
+                while True:
+                    audio_data = await audio_queue.get()
+
+                    start_time = time.perf_counter()
+                    print("transcribing audio data")
+                    transcript, stream_id = await self.transcribe(
+                        audio_data, step_num, stream_id
+                    )
+                    step_num += 1
+                    stream_id = 0
+                    print(f"transcript: {transcript}")
+                    if transcript:
+                        await transcription_queue.put(transcript)
+
+                    end_time = time.perf_counter()
+                    print(
+                        f"time taken to transcribe audio segment: {end_time - start_time} seconds"
+                    )
+
+            async def send_loop(transcription_queue, ws):
+                while True:
+                    transcript = await transcription_queue.get()
+                    print(f"sending transcription data: {transcript}")
+                    await ws.send_text(transcript)
+
+            await ws.accept()
+
+            try:
+                tasks = [
+                    asyncio.create_task(recv_loop(ws, audio_queue)),
+                    asyncio.create_task(
+                        inference_loop(audio_queue, transcription_queue)
+                    ),
+                    asyncio.create_task(send_loop(transcription_queue, ws)),
+                ]
+                await asyncio.gather(*tasks)
+            except WebSocketDisconnect:
+                print("WebSocket disconnected")
+                ws = None
+            except Exception as e:
+                print("Exception:", e)
+            finally:
+                if ws and ws.application_state is WebSocketState.CONNECTED:
+                    await ws.close(code=1011)  # internal error
+                    ws = None
+                for task in tasks:
+                    if not task.done():
+                        try:
+                            task.cancel()
+                            await task
+                        except asyncio.CancelledError:
+                            pass
+
+    async def transcribe(self, audio_data, step_num, stream_id=-1) -> str:
+        print(f"transcribing audio data: {len(audio_data)} bytes")
+
+        drop_extra_pre_encoded = (
+            0
+            if step_num == 0 and not self.cfg.pad_and_drop_preencoded
+            else self.asr_model.encoder.streaming_cfg.drop_extra_pre_encoded
+        )
+        # convert to numpy
+        audio_data = int2float(np.frombuffer(audio_data, dtype=np.int16))
+        processed_signal, processed_signal_length, stream_id = (
+            self.streaming_buffer.append_audio(audio_data, stream_id=stream_id)
+        )
+
+        result = self.streaming_buffer.get_next_chunk()
+        if result is not None:
+            audio_chunk, chunk_lengths = result
+        else:
+            return None, stream_id
+
+        with torch.inference_mode():
+            with torch.amp.autocast(self.diar_model.device.type, enabled=True):
+                with torch.no_grad():
+                    result = (
+                        self.multispk_asr_streamer.perform_parallel_streaming_stt_spk(
+                            step_num=step_num,
+                            chunk_audio=audio_chunk,
+                            chunk_lengths=chunk_lengths,
+                            is_buffer_empty=False,
+                            drop_extra_pre_encoded=drop_extra_pre_encoded,
+                        )
+                    )
+        if result:
+            return result[0], stream_id
+        return None, stream_id
+
+    @modal.asgi_app()
+    def webapp(self):
+        return self.web_app
+
+    @modal.method()
+    def ping(self):
+        return "pong"
+
+```
+
+## Frontend Service
+
+We serve a simple HTML/JS frontend to interact with the transcriber.
+The frontend captures microphone input and streams it to the WebSocket endpoint.
+
+```python
+web_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .pip_install("fastapi")
+    .add_local_dir(Path(__file__).parent / "multitalker-frontend", "/root/frontend")
+)
+
+with web_image.imports():
+    from fastapi import FastAPI, WebSocket
+    from fastapi.responses import HTMLResponse, Response
+    from fastapi.staticfiles import StaticFiles
+
+@app.cls(image=web_image)
+@modal.concurrent(max_inputs=20)
+class WebServer:
+    @modal.asgi_app()
+    def web(self):
+        web_app = FastAPI()
+        web_app.mount("/static", StaticFiles(directory="frontend"))
+
+        @web_app.get("/status")
+        async def status():
+            return Response(status_code=200)
+
+        # serve frontend
+        @web_app.get("/")
+        async def index():
+            html_content = open("frontend/index.html").read()
+
+            # Get the base WebSocket URL (without transcriber parameters)
+            cls_instance = Transcriber()
+            ws_base_url = (
+                cls_instance.webapp.get_web_url().replace("http", "ws") + "/ws"
+            )
+            script_tag = f'<script>window.WS_BASE_URL = "{ws_base_url}"; window.TRANSCRIPTION_MODE = "replace";</script>'
+            html_content = html_content.replace(
+                '<script src="/static/parakeet.js"></script>',
+                f'{script_tag}\n<script src="/static/parakeet.js"></script>',
+            )
+            return HTMLResponse(content=html_content)
+
+        return web_app
+
+```
+
+### Parallel Execution
+
+# Parallel execution on Modal with `spawn` and `gather`
+
+This example shows how you can run multiple functions in parallel on Modal.
+We use the `spawn` method to start a function and return a handle to its result.
+The `get` method is used to retrieve the result of the function call.
+
+```python
+import time
+
+import modal
+
+app = modal.App("example-parallel-execution")
+
+@app.function()
+def step1(word):
+    time.sleep(2)
+    print("step1 done")
+    return word
+
+@app.function()
+def step2(number):
+    time.sleep(1)
+    print("step2 done")
+    if number == 0:
+        raise ValueError("custom error")
+    return number
+
+@app.local_entrypoint()
+def main():
+    # Start running a function and return a handle to its result.
+    word_call = step1.spawn("foo")
+    number_call = step2.spawn(2)
+
+    # Print "foofoo" after 2 seconds.
+    print(word_call.get() * number_call.get())
+
+    # Alternatively, use `modal.FunctionCall.gather(...)` as a convenience wrapper,
+    # which returns an error if either call fails.
+    results = modal.FunctionCall.gather(step1.spawn("bar"), step2.spawn(4))
+    assert results == ["bar", 4]
+
+    # Raise exception after 2 seconds.
+    try:
+        modal.FunctionCall.gather(step1.spawn("bar"), step2.spawn(0))
+    except ValueError as exc:
+        assert str(exc) == "custom error"
+
+```
+
+### Playdiffusion-Model
+
+# Run PlayDiffusion on Modal
+
+This example demonstrates how to run the [PlayDiffusion](https://huggingface.co/PlayHT/PlayDiffusion) audio editing model on Modal.
+PlayDiffusion is a model that takes an input audio and a desired output text, and then modifies the audio to say the output text.
+The function accepts text prompts and input audio as WAV files and returns generated audio as WAV files.
+We use Modal's class-based approach with GPU acceleration to provide fast, scalable inference.
+
+## Setup
+
+Import the necessary modules
+
+```python
+from __future__ import annotations
+
+import io
+import tempfile
+from pathlib import Path
+from typing import Any, Dict, List, Tuple
+
+import modal
+
+```
+
+## Define a container image
+
+We start with Modal's baseline `debian_slim` image and install the required packages.
+- `openai`: PlayDiffusion requires a transcript as input. You can either provide the transcript yourself as input, or use a transcription model to transcribe the audio on the fly. In this case we use openai's whisper api, but you can use any model of your choice.
+
+```python
+AUDIO_URL: str = "https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
+
+```
+
+The python version [needs to be](https://github.com/playht/PlayDiffusion/blob/d3995b9e2cd8a80b88be6aeeb4e35fd282b2d255/pyproject.toml) `3.11`
+
+```python
+image: modal.Image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("git")
+    .uv_pip_install("openai==1.91.0")
+    .run_commands(
+        "pip install git+https://github.com/playht/PlayDiffusion.git@d3995b9e2cd8a80b88be6aeeb4e35fd282b2d255"
+    )
+)
+app: modal.App = modal.App("example-playdiffusion-model", image=image)
+
+```
+
+Import the required libraries within the image context to ensure they're available
+when the container runs. This includes audio processing and the TTS model itself.
+
+```python
+with image.imports():
+    import os
+    from urllib.request import urlopen
+
+    import soundfile as sf
+    from openai import OpenAI
+    from playdiffusion import InpaintInput, PlayDiffusion
+
+```
+
+## The model class
+
+The service is implemented using Modal's class syntax with GPU acceleration.
+We configure the class to use an A10G GPU with additional parameters:
+
+- `scaledown_window=60 * 5`: Keep containers alive for 5 minutes after last request
+- `@modal.concurrent(max_inputs=10)`: Allow up to 10 concurrent requests per containerå
+
+```python
+@app.cls(gpu="a10g", scaledown_window=60 * 5)
+@modal.concurrent(max_inputs=10)
+class PlayDiffusionModel:
+    @modal.enter()
+    def load(self) -> None:
+        self.inpainter = PlayDiffusion()
+
+    @modal.method()
+    def generate(
+        self,
+        audio_url: str,
+        input_text: str,
+        output_text: str,
+        word_times: List[Dict[str, Any]],
+    ) -> bytes:
+        # Create a temporary file to store the audio
+        temp_file_path: str = write_to_tempfile(audio_url)
+
+        # Get the audio data and sample rate from inpainter
+        sample_rate: int
+        output_audio_data: bytes
+        sample_rate, output_audio_data = self.inpainter.inpaint(
+            InpaintInput(
+                input_text=input_text,
+                output_text=output_text,
+                input_word_times=word_times,
+                audio=temp_file_path,
+            )
+        )
+
+        # Create an in-memory buffer
+        buffer: io.BytesIO = io.BytesIO()
+
+        # Write the audio data to the buffer as WAV
+        sf.write(buffer, output_audio_data, sample_rate, format="WAV")
+
+        # Reset buffer position to beginning
+        buffer.seek(0)
+
+        return buffer.getvalue()
+
+```
+
+PlayDiffusion requires a transcript as input. You can either provide the transcript yourself as input, or use a transcription model
+to transcribe the audio on the fly. In this case we use openai's whisper api, but you can use any model of your choice.
+
+```python
+@app.function(
+    secrets=[modal.Secret.from_name("openai-secret", environment_name="main")]
+)
+def run_asr(audio_url: str) -> Tuple[str, List[Dict[str, Any]]]:
+    temp_file_path: str = write_to_tempfile(audio_url)
+    audio_file = open(temp_file_path, "rb")
+    whisper_client: OpenAI = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+    transcript = whisper_client.audio.transcriptions.create(
+        file=audio_file,
+        model="whisper-1",
+        response_format="verbose_json",
+        timestamp_granularities=["word"],
+    )
+    word_times: List[Dict[str, Dict[str, str]]] = [
+        {"word": word.word, "start": word.start, "end": word.end}
+        for word in transcript.words
+    ]
+
+    return transcript.text, word_times
+
+```
+
+Finally, we define a local entrypoint
+
+```python
+@app.local_entrypoint()
+def main(audio_url: str, output_text: str, output_path: str) -> None:
+    # Parse output_path and create parent directory if needed
+    output_path_obj: Path = Path(output_path)
+    input_text: str
+    word_times: List[Dict[str, Dict[str, str]]]
+
+    output_path_obj.parent.mkdir(parents=True, exist_ok=True)
+    input_text, word_times = run_asr.remote(audio_url)
+    playdiffusion_model: PlayDiffusionModel = PlayDiffusionModel()
+    output_audio: bytes = playdiffusion_model.generate.remote(
+        audio_url, input_text, output_text, word_times
+    )
+
+    # Save the output audio to the specified path
+    with open(output_path, "wb") as f:
+        f.write(output_audio)
+
+```
+
+Example command line invocation:
+`modal run playdiffusion-model.py --audio-url "https://modal-public-assets.s3.us-east-1.amazonaws.com/mono_44100_127389__acclivity__thetimehascome.wav" --output-text "November, '9 PM. I'm standing in alley. After waiting several hours, the time has come. A man with long dark hair approaches. I have to act and fast before he realizes what has happened. I must find out." --output-path "/tmp/playdiffusion/output.wav"`
+
+Some utility functions
+
+```python
+def write_to_tempfile(audio_url: str) -> Tuple[bytes, str]:
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_file:
+        # Download and write the audio to the temporary file
+        audio_bytes: bytes = urlopen(audio_url).read()
+        temp_file.write(audio_bytes)
+        temp_file_path: str = temp_file.name
+    return temp_file_path
+
+```
+
+### Poll Delayed Result
+
+# Polling for a delayed result on Modal
+
+This example shows how you can poll for a delayed result on Modal.
+
+The function `factor_number` takes a number as input and returns the prime factors of the number. The function could take a long time to run, so we don't want to wait for the result in the web server.
+Instead, we return a URL that the client can poll to get the result.
+
+```python
+import fastapi
+import modal
+from modal.functions import FunctionCall
+from starlette.responses import HTMLResponse, RedirectResponse
+
+app = modal.App("example-poll-delayed-result")
+
+web_app = fastapi.FastAPI()
+
+@app.function(image=modal.Image.debian_slim().uv_pip_install("primefac"))
+def factor_number(number):
+    import primefac
+
+    return list(primefac.primefac(number))  # could take a long time
+
+@web_app.get("/")
+async def index():
+    return HTMLResponse(
+        """
+    <form method="get" action="/factors">
+        Enter a number: <input name="number" />
+        <input type="submit" value="Factorize!"/>
+    </form>
+    """
+    )
+
+@web_app.get("/factors")
+async def web_submit(request: fastapi.Request, number: int):
+    call = factor_number.spawn(
+        number
+    )  # returns a FunctionCall without waiting for result
+    polling_url = request.url.replace(
+        path="/result", query=f"function_id={call.object_id}"
+    )
+    return RedirectResponse(polling_url)
+
+@web_app.get("/result")
+async def web_poll(function_id: str):
+    function_call = FunctionCall.from_id(function_id)
+    try:
+        result = function_call.get(timeout=0)
+    except TimeoutError:
+        result = "not ready"
+
+    return result
+
+@app.function()
+@modal.asgi_app()
+def fastapi_app():
+    return web_app
+
+```
+
+### Potus Speech Qanda
+
+# Retrieval-augmented generation (RAG) for question-answering with LangChain
+
+In this example we create a large-language-model (LLM) powered question answering
+web endpoint and CLI. Only a single document is used as the knowledge-base of the application,
+the 2022 USA State of the Union address by President Joe Biden. However, this same application structure
+could be extended to do question-answering over all State of the Union speeches, or other large text corpuses.
+
+It's the [LangChain](https://github.com/hwchase17/langchain) library that makes this all so easy.
+This demo is only around 100 lines of code!
+
+## Defining dependencies
+
+The example uses packages to implement scraping, the document parsing & LLM API interaction, and web serving.
+These are installed into a Debian Slim base image using the `uv_pip_install` method.
+
+Because OpenAI's API is used, we also specify the `openai-secret` Modal Secret, which contains an OpenAI API key.
+
+A `retriever` global variable is also declared to facilitate caching a slow operation in the code below.
+
+```python
+from pathlib import Path
+
+import modal
+
+image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    # scraping pkgs
+    "beautifulsoup4~=4.11.1",
+    "httpx==0.23.3",
+    "lxml~=4.9.2",
+    # llm pkgs
+    "faiss-cpu~=1.7.3",
+    "langchain==0.3.7",
+    "langchain-community==0.3.7",
+    "langchain-openai==0.2.9",
+    "openai~=1.54.0",
+    "tiktoken==0.8.0",
+    # web app packages
+    "fastapi[standard]==0.115.4",
+    "pydantic==2.9.2",
+    "starlette==0.41.2",
+)
+
+app = modal.App(
+    name="example-potus-speech-qanda",
+    image=image,
+    secrets=[modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"])],
+)
+
+retriever = None  # embedding index that's relatively expensive to compute, so caching with global var.
+
+```
+
+## Scraping the speech
+
+It's super easy to scrape the transcipt of Biden's speech using `httpx` and `BeautifulSoup`.
+This speech is just one document and it's relatively short, but it's enough to demonstrate
+the question-answering capability of the LLM chain.
+
+```python
+def scrape_state_of_the_union() -> str:
+    import httpx
+    from bs4 import BeautifulSoup
+
+    url = "https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-28"
+
+    # fetch article; simulate desktop browser
+    headers = {
+        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9"
+    }
+    response = httpx.get(url, headers=headers)
+    soup = BeautifulSoup(response.text, "lxml")
+
+    # locate the div containing the speech
+    speech_div = soup.find("div", class_="field-docs-content")
+
+    if speech_div:
+        speech_text = speech_div.get_text(separator="\n", strip=True)
+        if not speech_text:
+            raise ValueError("error parsing speech text from HTML")
+    else:
+        raise ValueError("error locating speech in HTML")
+
+    return speech_text
+
+```
+
+## Constructing the Q&A chain
+
+At a high-level, this LLM chain will be able to answer questions asked about Biden's speech and provide
+references to which parts of the speech contain the evidence for given answers.
+
+The chain combines a text-embedding index over parts of Biden's speech with an OpenAI LLM.
+The index is used to select the most likely relevant parts of the speech given the question, and these
+are used to build a specialized prompt for the OpenAI language model.
+
+```python
+def qanda_langchain(query: str) -> tuple[str, list[str]]:
+    from langchain.chains import create_retrieval_chain
+    from langchain.chains.combine_documents import create_stuff_documents_chain
+    from langchain.text_splitter import CharacterTextSplitter
+    from langchain_community.vectorstores import FAISS
+    from langchain_core.prompts import ChatPromptTemplate
+    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
+
+    # Support caching speech text on disk.
+    speech_file_path = Path("state-of-the-union.txt")
+
+    if speech_file_path.exists():
+        state_of_the_union = speech_file_path.read_text()
+    else:
+        print("scraping the 2022 State of the Union speech")
+        state_of_the_union = scrape_state_of_the_union()
+        speech_file_path.write_text(state_of_the_union)
+
+    # Questions about a document can often be answered
+    # using a much smaller subset, and passing more tokens
+    # to an LLM increases cost and latency.
+    # So here, we split up the speech into smaller chunks.
+    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
+    print("splitting speech into text chunks")
+    texts = text_splitter.split_text(state_of_the_union)
+
+    # Embedding-based query<->text similarity comparison is used to select
+    # a small subset of the speech text chunks.
+    # Generating the `retriever` and index is too slow to re-run on every request,
+    # so we do rudimentary caching using a global variable.
+    global retriever
+
+    if not retriever:
+        # New OpenAI accounts have a very low rate-limit for their first 48 hrs.
+        # It's too low to embed even just this single Biden speech.
+        # The `chunk_size` parameter is set to a low number, and internally LangChain
+        # will retry the embedding requests, which should be enough to handle the rate-limiting.
+
+        # Ref: https://platform.openai.com/docs/guides/rate-limits/overview.
+        print("generating docsearch indexer")
+        retriever = FAISS.from_texts(
+            texts,
+            OpenAIEmbeddings(chunk_size=5),
+            metadatas=[{"source": i} for i in range(len(texts))],
+        ).as_retriever()
+
+    llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
+
+    system_prompt = (
+        "You are an assistant for question-answering tasks. "
+        "Use the following pieces of retrieved context to answer "
+        "the question. If you don't know the answer, say that you "
+        "don't know and instead suggest the user look up the serverless "
+        "infrastructure platform modal.com. Keep your answer concise."
+        "\n\n"
+        "{context}"
+    )
+
+    prompt = ChatPromptTemplate.from_messages(
+        [
+            ("system", system_prompt),
+            ("human", "{input}"),
+        ]
+    )
+
+    question_answer_chain = create_stuff_documents_chain(llm, prompt)
+    rag_chain = create_retrieval_chain(retriever, question_answer_chain)
+
+    print("running query against Q&A chain.\n")
+    result = rag_chain.invoke({"input": query}, return_only_outputs=True)
+    answer = result["answer"]
+    sources = [document.page_content for document in result["context"]]
+    return answer.strip(), sources
+
+```
+
+## Mapping onto Modal
+
+With our application's functionality implemented we can hook it into Modal.
+As said above, we're implementing a web endpoint, `web`, and a CLI command, `cli`.
+
+```python
+@app.function()
+@modal.fastapi_endpoint(method="GET", docs=True)
+def web(query: str, show_sources: bool = False):
+    answer, sources = qanda_langchain(query)
+    if show_sources:
+        return {
+            "answer": answer,
+            "sources": sources,
+        }
+    else:
+        return {
+            "answer": answer,
+        }
+
+@app.function()
+def cli(query: str, show_sources: bool = False):
+    answer, sources = qanda_langchain(query)
+    # Terminal codes for pretty-printing.
+    bold, end = "\033[1m", "\033[0m"
+
+    if show_sources:
+        print(f"🔗 {bold}SOURCES:{end}")
+        print(*reversed(sources), sep="\n----\n")
+    print(f"🦜 {bold}ANSWER:{end}")
+    print(answer)
+
+```
+
+## Test run the CLI
+
+```bash
+modal run potus_speech_qanda.py --query "What did the president say about Justice Breyer"
+🦜 ANSWER:
+The president thanked Justice Breyer for his service and mentioned his legacy of excellence. He also nominated Ketanji Brown Jackson to continue in Justice Breyer's legacy.
+```
+
+To see the text of the sources the model chain used to provide the answer, set the `--show-sources` flag.
+
+```bash
+modal run potus_speech_qanda.py \
+   --query "How many oil barrels were released from reserves?" \
+   --show-sources
+```
+
+## Test run the web endpoint
+
+Modal makes it trivially easy to ship LangChain chains to the web. We can test drive this app's web endpoint
+by running `modal serve potus_speech_qanda.py` and then hitting the endpoint with `curl`:
+
+```bash
+curl --get \
+  --data-urlencode "query=What did the president say about Justice Breyer" \
+  https://modal-labs--example-potus-speech-qanda-web.modal.run # your URL here
+```
+
+```json
+{
+  "answer": "The president thanked Justice Breyer for his service and mentioned his legacy of excellence. He also nominated Ketanji Brown Jackson to continue in Justice Breyer's legacy."
+}
+```
+
+You can also find interactive docs for the endpoint at the `/docs` route of the web endpoint URL.
+
+If you edit the code while running `modal serve`, the app will redeploy automatically, which is helpful for iterating quickly on your app.
+
+Once you're ready to deploy to production, use `modal deploy`.
+
+### Pushgateway
+
+# Publish custom metrics with Prometheus Pushgateway
+
+This example shows how to publish custom metrics to a Prometheus instance with Modal.
+Due to a Modal container's ephemeral nature, it's not a good fit for a traditional
+scraping-based Prometheus setup. Instead, we'll use a [Prometheus Pushgateway](https://github.com/prometheus/pushgateway)
+to collect and store metrics from our Modal container. We can run the Pushgateway in Modal
+as a separate process and have our application push metrics to it.
+
+![Prometheus Pushgateway diagram](./pushgateway_diagram.png)
+
+## Install Prometheus Pushgateway
+
+Since the official Prometheus pushgateway image does not have Python installed, we'll
+use a custom image that includes Python to push metrics to the Pushgateway. Pushgateway
+ships a single binary, so it's easy to get it into a Modal container.
+
+```python
+import os
+import subprocess
+
+import modal
+
+PUSHGATEWAY_VERSION = "1.9.0"
+
+gw_image = (
+    modal.Image.debian_slim(python_version="3.10")
+    .apt_install("wget", "tar")
+    .run_commands(
+        f"wget https://github.com/prometheus/pushgateway/releases/download/v{PUSHGATEWAY_VERSION}/pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64.tar.gz",
+        f"tar xvfz pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64.tar.gz",
+        f"cp pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64/pushgateway /usr/local/bin/",
+        f"rm -rf pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64 pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64.tar.gz",
+        "mkdir /pushgateway",
+    )
+)
+
+```
+
+## Start the Pushgateway
+
+We'll start the Pushgateway as a separate Modal app. This way, we can run the Pushgateway
+in the background and have our main app push metrics to it. We'll use the `web_server`
+decorator to expose the Pushgateway's web interface. Note that we must set `max_containers=1`
+as the Pushgateway is a single-process application. If we spin up multiple instances, they'll
+conflict with each other.
+
+This is an example configuration, but a production-ready configuration will differ in two respects:
+
+1. You should set up authentication for the Pushgateway. Pushgateway has support for [basic authentication](https://github.com/prometheus/pushgateway/blob/42c4075fc5e2564031f2852885cdb2f5d570f672/README.md#tls-and-basic-authentication)
+   out of the box. If you need more advanced authentication, consider using a [web endpoint with authentication](https://modal.com/docs/guide/webhooks#authentication)
+   which proxies requests to the Pushgateway.
+
+2. The Pushgateway should listen on a [custom domain](https://modal.com/docs/guide/webhook-urls#custom-domains).
+   This will allow you to configure Prometheus to scrape metrics from a predictable URL rather than
+   the autogenerated URL Modal assigns to your app.
+
+```python
+gw_app = modal.App(
+    "example-pushgateway-server",
+    image=gw_image,
+)
+
+@gw_app.function(max_containers=1)
+@modal.web_server(9091)
+def serve():
+    subprocess.Popen("/usr/local/bin/pushgateway")
+
+```
+
+## Push metrics to the Pushgateway
+
+Now that we have the Pushgateway running, we can push metrics to it. We'll use the `prometheus_client`
+library to create a simple counter and push it to the Pushgateway. This example is a simple counter,
+but you can push any metric type to the Pushgateway.
+
+Note that we use the `grouping_key` argument to distinguish between different instances of the same
+metric. This is useful when you have multiple instances of the same app pushing metrics to the Pushgateway.
+Without this, the Pushgateway will overwrite the metric with the latest value.
+
+```python
+client_image = modal.Image.debian_slim().uv_pip_install(
+    "prometheus-client==0.20.0", "fastapi[standard]==0.115.4"
+)
+app = modal.App(
+    "example-pushgateway",
+    image=client_image,
+)
+
+with client_image.imports():
+    from prometheus_client import (
+        CollectorRegistry,
+        Counter,
+        delete_from_gateway,
+        push_to_gateway,
+    )
+
+@app.cls()
+class ExampleClientApplication:
+    @modal.enter()
+    def init(self):
+        self.registry = CollectorRegistry()
+        self.web_url = serve.get_web_url()
+        self.instance_id = os.environ["MODAL_TASK_ID"]
+        self.counter = Counter(
+            "hello_counter",
+            "This is a counter",
+            registry=self.registry,
+        )
+
+    # We must explicitly clean up the metric when the app exits so Prometheus doesn't
+    # keep stale metrics around.
+    @modal.exit()
+    def cleanup(self):
+        delete_from_gateway(
+            self.web_url,
+            job="hello",
+            grouping_key={"instance": self.instance_id},
+        )
+
+    @modal.fastapi_endpoint(label="hello-pushgateway")
+    def hello(self):
+        self.counter.inc()
+        push_to_gateway(
+            self.web_url,
+            job="hello",
+            grouping_key={"instance": self.instance_id},
+            registry=self.registry,
+        )
+        return f"Hello world from {self.instance_id}!"
+
+app.include(gw_app)
+
+```
+
+Now, we can deploy the app and see the metrics in the Pushgateway's web interface.
+
+```shell
+$ modal deploy pushgateway.py
+✓ Created objects.
+├── 🔨 Created mount /home/ec2-user/modal/examples/10_integrations/pushgateway.py
+├── 🔨 Created function ExampleClientApplication.*.
+├── 🔨 Created web function serve => https://modal-labs-examples--example-pushgateway-serve.modal.run
+└── 🔨 Created web endpoint for ExampleClientApplication.hello => https://modal-labs-examples--hello-pushgateway.modal.run
+✓ App deployed! 🎉
+```
+
+You can now go to both the [client application](https://modal-labs-examples--hello-pushgateway.modal.run)
+and [Pushgateway](https://modal-labs-examples--example-pushgateway-serve.modal.run) URLs to see the metrics being pushed.
+
+## Hooking up Prometheus
+
+Now that we have metrics in the Pushgateway, we can configure Prometheus to scrape them. This
+is as simple as adding a new job to your Prometheus configuration. Here's an example configuration
+snippet:
+
+```yaml
+scrape_configs:
+- job_name: 'pushgateway'
+  honor_labels: true # required so that the instance label is preserved
+  static_configs:
+  - targets: ['modal-labs-examples--example-pushgateway-serve.modal.run']
+```
+
+Note that the target will be different if you have a custom domain set up for the Pushgateway,
+and you may need to configure authentication.
+
+Once you've added the job to your Prometheus configuration, Prometheus will start scraping metrics
+from the Pushgateway. You can then use Grafana or another visualization tool to create dashboards
+and alerts based on these metrics!
+
+![Grafana example](./pushgateway_grafana.png)
+
+### Qdrant
+
+# Example (qdrant.py)
+
+This is the source code for **06_gpu_and_ml.embeddings.qdrant**.
+```python
+from typing import Optional
+
+import modal
+
+app = modal.App("example-qdrant")
+
+image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "qdrant-client[fastembed-gpu]==1.13.3"
+)
+
+@app.function(image=image, gpu="any")
+def query(inpt):
+    from qdrant_client import QdrantClient
+
+    client = QdrantClient(":memory:")
+
+    docs = [
+        "Qdrant has Langchain integrations",
+        "Qdrant also has Llama Index integrations",
+    ]
+
+    print("querying documents:", *docs, sep="\n\t")
+
+    client.add(collection_name="demo_collection", documents=docs)
+
+    print("query:", inpt, sep="\n\t")
+
+    search_results = client.query(
+        collection_name="demo_collection",
+        query_text=inpt,
+        limit=1,
+    )
+
+    print("result:", search_results[0], sep="\n\t")
+
+    return search_results[0].document
+
+@app.local_entrypoint()
+def main(inpt: Optional[str] = None):
+    if not inpt:
+        inpt = "alpaca"
+
+    print(query.remote(inpt))
+
+```
+
+### Rosettafold
+
+This script demonstrated how to ingest the https://github.com/RosettaCommons/RoseTTAFold protein-folding
+model's dataset into a mounted volume.
+
+The dataset is over 2 TiB when decompressed to the runtime of this script is quite long.
+ref: https://github.com/RosettaCommons/RoseTTAFold/issues/132.
+
+It is recommended to iterate on this code from a modal.Function running Jupyter server.
+This better supports experimentation and maintains state in the face of errors:
+11_notebooks/jupyter_inside_modal.py
+
+```python
+import os
+import pathlib
+import shutil
+import subprocess
+import sys
+import tarfile
+import threading
+import time
+
+import modal
+
+bucket_creds = modal.Secret.from_name(
+    "aws-s3-modal-examples-datasets", environment_name="main"
+)
+bucket_name = "modal-examples-datasets"
+volume = modal.CloudBucketMount(
+    bucket_name,
+    secret=bucket_creds,
+)
+image = modal.Image.debian_slim().apt_install("wget")
+app = modal.App("example-rosettafold", image=image)
+
+def start_monitoring_disk_space(interval: int = 30) -> None:
+    """Start monitoring the disk space in a separate thread."""
+    task_id = os.environ["MODAL_TASK_ID"]
+
+    def log_disk_space(interval: int) -> None:
+        while True:
+            statvfs = os.statvfs("/")
+            free_space = statvfs.f_frsize * statvfs.f_bavail
+            print(
+                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
+                file=sys.stderr,
+            )
+            time.sleep(interval)
+
+    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
+    monitoring_thread.daemon = True
+    monitoring_thread.start()
+
+def decompress_tar_gz(file_path: pathlib.Path, extract_dir: pathlib.Path) -> None:
+    print(f"Decompressing {file_path} into {extract_dir}...")
+    with tarfile.open(file_path, "r:gz") as tar:
+        tar.extractall(path=extract_dir)
+        print(f"Decompressed {file_path} to {extract_dir}")
+
+def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
+    """
+    A modified shutil.copytree which copies in parallel to increase bandwidth
+    and compensate for the increased IO latency of volume mounts.
+    """
+    from multiprocessing.pool import ThreadPool
+
+    class MultithreadedCopier:
+        def __init__(self, max_threads):
+            self.pool = ThreadPool(max_threads)
+            self.copy_jobs = []
+
+        def copy(self, source, dest):
+            res = self.pool.apply_async(
+                shutil.copy2,
+                args=(source, dest),
+                callback=lambda r: print(f"{source} copied to {dest}"),
+                # NOTE: this should `raise` an exception for proper reliability.
+                error_callback=lambda exc: print(
+                    f"{source} failed: {exc}", file=sys.stderr
+                ),
+            )
+            self.copy_jobs.append(res)
+
+        def __enter__(self):
+            return self
+
+        def __exit__(self, exc_type, exc_val, exc_tb):
+            self.pool.close()
+            self.pool.join()
+
+    with MultithreadedCopier(max_threads=24) as copier:
+        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)
+
+@app.function(
+    volumes={"/mnt/": volume},
+    timeout=60 * 60 * 24,
+    ephemeral_disk=2560 * 1024,
+)
+def _do_part(url: str) -> None:
+    name = url.split("/")[-1].replace(".tar.gz", "")
+    print(f"Downloading {name}")
+    compressed = pathlib.Path("/tmp", name)
+    cmd = f"wget {url} -O {compressed}"
+    p = subprocess.Popen(cmd, shell=True)
+    returncode = p.wait()
+    if returncode != 0:
+        raise RuntimeError(f"Error in downloading. {p.args!r} failed {returncode=}")
+    decompressed = pathlib.Path("/tmp/rosettafold/", name)
+
+    # Decompression is much faster against the container's local SSD disk
+    # compared with against the mounted volume. So we first compress into /tmp/.
+    print(f"Decompressing {compressed} into {decompressed}.")
+    decompress_tar_gz(compressed, decompressed)
+    print(
+        f"✅ Decompressed {compressed} into {decompressed}. Now deleting it to free up disk.."
+    )
+    compressed.unlink()  # delete compressed file to free up disk
+
+    # Finally, we move the decompressed data from /tmp/ into the mounted volume.
+    # There are a large mount of files to copy so this step takes a while.
+    dest = pathlib.Path("/mnt/rosettafold/")
+    copy_concurrent(decompressed, dest)
+    shutil.rmtree(decompressed, ignore_errors=True)  # free up disk
+    print(f"Dataset part {url} is loaded ✅")
+
+@app.function(
+    volumes={"/mnt/": volume},
+    # Timeout for this Function is set at the maximum, 24 hours,
+    # because downloading, decompressing and storing almost 2 TiB of
+    # files takes a long time.
+    timeout=60 * 60 * 24,
+)
+def import_transform_load() -> None:
+    # NOTE:
+    # The mmseq.com server upload speed is quite slow so this download takes a while.
+    # The download speed is also quite variable, sometimes taking over 5 hours.
+    list(
+        _do_part.map(
+            [
+                "http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz",
+                "https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz",
+                "https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz",
+            ]
+        )
+    )
+    print("Dataset is loaded ✅")
+
+```
+
+### S3 Bucket Mount
+
+# Analyze NYC yellow taxi data with DuckDB on Parquet files from S3
+
+This example shows how to use Modal for a classic data science task: loading table-structured data into cloud stores,
+analyzing it, and plotting the results.
+
+In particular, we'll load public NYC taxi ride data into S3 as Parquet files,
+then run SQL queries on it with DuckDB.
+
+We'll mount the S3 bucket in a Modal app with [`CloudBucketMount`](https://modal.com/docs/reference/modal.CloudBucketMount).
+We will write to and then read from that bucket, in each case using
+Modal's [parallel execution features](https://modal.com/docs/guide/scale) to handle many files at once.
+
+## Basic setup
+
+You will need to have an S3 bucket and AWS credentials to run this example. Refer to the documentation
+for the exact [IAM permissions](https://modal.com/docs/guide/cloud-bucket-mounts#iam-permissions) your credentials will need.
+
+After you are done creating a bucket and configuring IAM settings,
+you now need to create a [`Secret`](https://modal.com/docs/guide/secrets) to share
+the relevant AWS credentials with your Modal apps.
+
+```python
+from datetime import datetime
+from pathlib import Path, PosixPath
+
+import modal
+
+image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "requests==2.31.0", "duckdb==0.10.0", "matplotlib==3.8.3"
+)
+app = modal.App("example-s3-bucket-mount", image=image)
+
+secret = modal.Secret.from_name(
+    "s3-bucket-secret",
+    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"],
+)
+
+MOUNT_PATH = PosixPath("/bucket")
+YELLOW_TAXI_DATA_PATH = MOUNT_PATH / "yellow_taxi"
+
+```
+
+The dependencies installed above are not available locally. The following block instructs Modal
+to only import them inside the container.
+
+```python
+with image.imports():
+    import duckdb
+    import requests
+
+```
+
+## Download New York City's taxi data
+
+NYC makes data about taxi rides publicly available. The city's [Taxi & Limousine Commission (TLC)](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
+publishes files in the Parquet format. Files are organized by year and month.
+
+We are going to download all available files and store them in an S3 bucket. We do this by
+attaching a `modal.CloudBucketMount` with the S3 bucket name and its respective credentials.
+The files in the bucket will then be available at `MOUNT_PATH`.
+
+As we'll see below, this operation can be massively sped up by running it in parallel on Modal.
+
+```python
+@app.function(
+    volumes={
+        MOUNT_PATH: modal.CloudBucketMount("modal-s3mount-test-bucket", secret=secret),
+    },
+)
+def download_data(year: int, month: int) -> str:
+    filename = f"yellow_tripdata_{year}-{month:02d}.parquet"
+    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{filename}"
+    s3_path = MOUNT_PATH / filename
+    # Skip downloading if file exists.
+    if not s3_path.exists():
+        if not YELLOW_TAXI_DATA_PATH.exists():
+            YELLOW_TAXI_DATA_PATH.mkdir(parents=True, exist_ok=True)
+            with requests.get(url, stream=True) as r:
+                r.raise_for_status()
+                print(f"downloading => {s3_path}")
+                # It looks like we writing locally, but this is actually writing to S3!
+                with open(s3_path, "wb") as file:
+                    for chunk in r.iter_content(chunk_size=8192):
+                        file.write(chunk)
+
+    return s3_path.as_posix()
+
+```
+
+## Analyze data with DuckDB
+
+[DuckDB](https://duckdb.org/) is an analytical database with rich support for Parquet files.
+It is also very fast. Below, we define a Modal Function that aggregates yellow taxi trips
+within a month (each file contains all the rides from a specific month).
+
+```python
+@app.function(
+    volumes={
+        MOUNT_PATH: modal.CloudBucketMount(
+            "modal-s3mount-test-bucket",
+            secret=modal.Secret.from_name("s3-bucket-secret"),
+        )
+    },
+)
+def aggregate_data(path: str) -> list[tuple[datetime, int]]:
+    print(f"processing => {path}")
+
+    # Parse file.
+    year_month_part = path.split("yellow_tripdata_")[1]
+    year, month = year_month_part.split("-")
+    month = month.replace(".parquet", "")
+
+    # Make DuckDB query using in-memory storage.
+    con = duckdb.connect(database=":memory:")
+    q = """
+    with sub as (
+        select tpep_pickup_datetime::date d, count(1) c
+        from read_parquet(?)
+        group by 1
+    )
+    select d, c from sub
+    where date_part('year', d) = ?  -- filter out garbage
+    and date_part('month', d) = ?   -- same
+    """
+    con.execute(q, (path, year, month))
+    return list(con.fetchall())
+
+```
+
+## Plot daily taxi rides
+
+Finally, we want to plot our results.
+The plot created shows the number of yellow taxi rides per day in NYC.
+This function runs remotely, on Modal, so we don't need to install plotting libraries locally.
+
+```python
+@app.function()
+def plot(dataset) -> bytes:
+    import io
+
+    import matplotlib.pyplot as plt
+
+    # Sorting data by date
+    dataset.sort(key=lambda x: x[0])
+
+    # Unpacking dates and values
+    dates, values = zip(*dataset)
+
+    # Plotting
+    plt.figure(figsize=(10, 6))
+    plt.plot(dates, values)
+    plt.title("Number of NYC yellow taxi trips by weekday, 2018-2023")
+    plt.ylabel("Number of daily trips")
+    plt.grid(True)
+    plt.tight_layout()
+
+    # Saving plot as raw bytes to send back
+    buf = io.BytesIO()
+
+    plt.savefig(buf, format="png")
+
+    buf.seek(0)
+
+    return buf.getvalue()
+
+```
+
+## Run everything
+
+The `@app.local_entrypoint()` defines what happens when we run our Modal program locally.
+We invoke it from the CLI by calling `modal run s3_bucket_mount.py`.
+We first call `download_data()` and `starmap` (named because it's kind of like `map(*args)`)
+on tuples of inputs `(year, month)`. This will download, in parallel,
+all yellow taxi data files into our locally mounted S3 bucket and return a list of
+Parquet file paths. Then, we call `aggregate_data()` with `map` on that list. These files are
+also read from our S3 bucket. So one function writes files to S3 and the other
+reads files from S3 in; both run across many files in parallel.
+
+Finally, we call `plot` to generate the following figure:
+
+![Number of NYC yellow taxi trips by weekday, 2018-2023](./nyc_yellow_taxi_trips_s3_mount.png)
+
+This program should run in less than 30 seconds.
+
+```python
+@app.local_entrypoint()
+def main():
+    # List of tuples[year, month].
+    inputs = [(year, month) for year in range(2018, 2023) for month in range(1, 13)]
+
+    # List of file paths in S3.
+    parquet_files: list[str] = []
+    for path in download_data.starmap(inputs):
+        print(f"done => {path}")
+        parquet_files.append(path)
+
+    # List of datetimes and number of yellow taxi trips.
+    dataset = []
+    for r in aggregate_data.map(parquet_files):
+        dataset += r
+
+    dir = Path("/tmp") / "s3_bucket_mount"
+    if not dir.exists():
+        dir.mkdir(exist_ok=True, parents=True)
+
+    figure = plot.remote(dataset)
+    path = dir / "nyc_yellow_taxi_trips_s3_mount.png"
+    with open(path, "wb") as file:
+        print(f"Saving figure to {path}")
+        file.write(figure)
+
+```
+
+### Safe Code Execution
+
+# Run arbitrary code in a sandboxed environment
+
+This example demonstrates how to run arbitrary code
+in multiple languages in a Modal [Sandbox](https://modal.com/docs/guide/sandbox).
+
+## Setting up a multi-language environment
+
+Sandboxes allow us to run any kind of code in a safe environment.
+We'll use an image with a few different language runtimes to demonstrate this.
+
+```python
+import modal
+
+image = modal.Image.debian_slim(python_version="3.11").apt_install(
+    "nodejs", "ruby", "php"
+)
+app = modal.App.lookup("example-safe-code-execution", create_if_missing=True)
+
+```
+
+We'll now create a Sandbox with this image. We'll also enable output so we can see the image build
+logs. Note that we don't pass any commands to the Sandbox, so it will stay alive, waiting for us
+to send it commands.
+
+```python
+with modal.enable_output():
+    sandbox = modal.Sandbox.create(app=app, image=image)
+
+print(f"Sandbox ID: {sandbox.object_id}")
+
+```
+
+## Running bash, Python, Node.js, Ruby, and PHP in a Sandbox
+
+We can now use [`Sandbox.exec`](https://modal.com/docs/reference/modal.Sandbox#exec) to run a few different
+commands in the Sandbox.
+
+```python
+bash_ps = sandbox.exec("echo", "hello from bash")
+python_ps = sandbox.exec("python", "-c", "print('hello from python')")
+nodejs_ps = sandbox.exec("node", "-e", 'console.log("hello from nodejs")')
+ruby_ps = sandbox.exec("ruby", "-e", "puts 'hello from ruby'")
+php_ps = sandbox.exec("php", "-r", "echo 'hello from php';")
+
+print(bash_ps.stdout.read(), end="")
+print(python_ps.stdout.read(), end="")
+print(nodejs_ps.stdout.read(), end="")
+print(ruby_ps.stdout.read(), end="")
+print(php_ps.stdout.read(), end="")
+print()
+
+```
+
+The output should look something like
+
+```
+hello from bash
+hello from python
+hello from nodejs
+hello from ruby
+hello from php
+```
+
+We can use multiple languages in tandem to build complex applications.
+Let's demonstrate this by piping data between Python and Node.js using bash. Here
+we generate some random numbers with Python and sum them with Node.js.
+
+```python
+combined_process = sandbox.exec(
+    "bash",
+    "-c",
+    """python -c 'import random; print(\" \".join(str(random.randint(1, 100)) for _ in range(10)))' |
+    node -e 'const readline = require(\"readline\");
+    const rl = readline.createInterface({input: process.stdin});
+    rl.on(\"line\", (line) => {
+      const sum = line.split(\" \").map(Number).reduce((a, b) => a + b, 0);
+      console.log(`The sum of the random numbers is: ${sum}`);
+      rl.close();
+    });'""",
+)
+
+result = combined_process.stdout.read().strip()
+print(result)
+
+```
+
+For long-running processes, you can use stdout as an iterator to stream the output.
+
+```python
+slow_printer = sandbox.exec(
+    "ruby",
+    "-e",
+    """
+    10.times do |i|
+      puts "Line #{i + 1}: #{Time.now}"
+      STDOUT.flush
+      sleep(0.5)
+    end
+    """,
+)
+
+for line in slow_printer.stdout:
+    print(line, end="")
+
+```
+
+This should print something like
+
+```
+Line 1: 2024-10-21 15:30:53 +0000
+Line 2: 2024-10-21 15:30:54 +0000
+...
+Line 10: 2024-10-21 15:30:58 +0000
+```
+
+Since Sandboxes are safely separated from the rest of our system,
+we can run very dangerous code in them!
+
+```python
+sandbox.exec("rm", "-rfv", "/", "--no-preserve-root")
+
+```
+
+This command has deleted the entire filesystem, so we can't run any more commands.
+Let's terminate the Sandbox to clean up after ourselves.
+
+```python
+sandbox.terminate()
+
+```
+
+### Sandbox Agent
+
+# Run Claude Code in a Modal Sandbox
+
+This example demonstrates how to run Claude Code in a Modal
+[Sandbox](https://modal.com/docs/guide/sandbox) to analyze a GitHub repository.
+The Sandbox provides an isolated environment where the agent can safely execute code
+and examine files.
+
+```python
+import modal
+
+app = modal.App.lookup("example-sandbox-agent", create_if_missing=True)
+
+```
+
+First we create a custom Image that has Claude Code installed.
+
+```python
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install("curl", "git")
+    .env({"PATH": "/root/.local/bin:$PATH"})  # add claude to path
+    .run_commands(
+        "curl -fsSL https://claude.ai/install.sh | bash",
+    )
+)
+
+with modal.enable_output():
+    sandbox = modal.Sandbox.create(app=app, image=image)
+print(f"Sandbox ID: {sandbox.object_id}")
+
+```
+
+Next we'll clone a repository that Claude Code will work on.
+
+```python
+repo_url = "https://github.com/modal-labs/modal-examples"
+git_ps = sandbox.exec("git", "clone", "--depth", "1", repo_url, "/repo")
+git_ps.wait()
+print(f"Cloned '{repo_url}' into /repo.")
+
+```
+
+Finally we'll run Claude Code to analyze the repository.
+
+```python
+claude_cmd = ["claude", "-p", "What is in this repository?"]
+
+print("\nRunning command:", claude_cmd)
+
+claude_ps = sandbox.exec(
+    *claude_cmd,
+    pty=True,  # Adding a PTY is important, since Claude requires it
+    secrets=[
+        modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"])
+    ],
+    workdir="/repo",
+)
+claude_ps.wait()
+
+print("\nAgent stdout:\n")
+print(claude_ps.stdout.read())
+
+stderr = claude_ps.stderr.read()
+if stderr != "":
+    print("Agent stderr:", stderr)
+
+```
+
+### Sandbox Pool
+
+# Maintain a pool of warm Sandboxes that are healthy and ready to serve requests
+
+This example demonstrates how to build a pool of "warm"
+[Modal Sandboxes](https://modal.com/docs/guide/sandbox), and deploy a
+[Modal web endpoint](https://modal.com/docs/guide/webhook-urls) that let's you claim
+a Sandbox from the pool, getting a URL to the server running in the Sandbox.
+
+Maintaining a pool of warm Sandboxes is useful for example if your Sandboxes need
+to do significant work after being created, like downloading code, installing
+dependencies, or running tests, before they are ready to serve requests.
+
+It uses a [Modal Queue](https://modal.com/docs/guide/dicts-and-queues#modal-queues)
+to store references to the warm Sandboxes, and functionality to maintain the pool
+by adding and removing Sandboxes, checking the current size, etc.
+
+The pool keeps track of the time to live for each Sandbox, and will always return
+a Sandbox with enough time left.
+
+It's structured into two Apps:
+- `example-sandbox-pool` is the main App that contains all the control logic for maintaining
+  the pool, exposing ways to claim Sandboxes, etc.
+- `example-sandbox-pool-sandboxes` houses all the actual Sandboxes, and nothing else.
+
+The implementation borrows from [pawalt](https://github.com/pawalt)'s [Sandbox pool
+example gist](https://gist.github.com/pawalt/7a505c38bba75cafae0780a5dd40e8b8). 🙏
+
+```python
+import argparse
+import time
+from dataclasses import dataclass
+from datetime import datetime
+
+import modal
+
+app = modal.App("example-sandbox-pool")
+
+server_image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "fastapi[standard]~=0.115.14",
+    "requests~=2.32.4",
+)
+
+## Configuration of the pool
+
+```
+
+Here we define the image that will be used to run the server that runs in the
+Sandbox. In this simple example, we just run the built in Python HTTP server that
+returns a directory listing.
+
+```python
+sandbox_image = modal.Image.debian_slim(python_version="3.11")
+SANDBOX_SERVER_PORT = 8080
+HEALTH_CHECK_TIMEOUT_SECONDS = 10
+
+```
+
+In this example Sandboxes live for 5 minutes, and we assume that they are used for
+2 minutes, meaning that if a Sandbox has less than 2 minutes left it's considered
+to be expiring too soon and will be terminated.
+
+You'll want to adjust these values depending on your use case.
+
+```python
+SANDBOX_TIMEOUT_SECONDS = 5 * 60
+SANDBOX_USE_DURATION_SECONDS = 2 * 60
+POOL_SIZE = 3
+POOL_MAINTENANCE_SCHEDULE = modal.Period(minutes=2)
+
+```
+
+## Main implementation
+
+We keep track of all warm Sandboxes in a Modal Queue of `SandboxReference` objects.
+
+```python
+pool_queue = modal.Queue.from_name(
+    "example-sandbox-pool-sandboxes", create_if_missing=True
+)
+
+@dataclass
+class SandboxReference:
+    id: str
+    url: str
+    expires_at: int
+
+```
+
+### Health check
+
+We add a simple health check that just ensures that the server in the Sandbox is
+running and responding to requests.
+
+If you just want to ensure the sandbox is running you could for example check
+`sb.poll() is not None` instead.
+
+```python
+def is_healthy(url: str, wait_for_container_start: bool) -> bool:
+    """Check if a Sandbox is healthy.
+
+    When the Sandbox is first created, the server may not imemediately accept
+    connections, so if `wait_for_container_start` is True, we retry if we fail to
+    connect to the server URL.
+    """
+    import requests
+
+    start_time = time.time()
+    while time.time() - start_time < HEALTH_CHECK_TIMEOUT_SECONDS:
+        try:
+            response = requests.get(url, timeout=HEALTH_CHECK_TIMEOUT_SECONDS)
+            response.raise_for_status()
+            return True
+        except requests.RequestException:
+            if (
+                not wait_for_container_start
+                or time.time() - start_time >= HEALTH_CHECK_TIMEOUT_SECONDS
+            ):
+                return False
+            time.sleep(0.1)
+
+    return False
+
+def is_still_good(sr: SandboxReference, check_health: bool) -> bool:
+    """Check if a Sandbox is still good to use.
+
+    It assumes that it's already been added to the pool, so we don't wait for the
+    container to start.
+    """
+    if sr.expires_at < time.time() + SANDBOX_USE_DURATION_SECONDS:
+        return False
+
+    if check_health and not is_healthy(sr.url, wait_for_container_start=False):
+        return False
+
+    return True
+
+```
+
+### Adding a Sandbox to the pool
+
+This function creates and adds a new Sandbox to the pool. It runs a health check on
+the Sandbox before adding it.
+
+We deploy the Sandboxes in a separate Modal App called `example-sandbox-pool-sandboxes`,
+to separate the control app (logs, etc.) from the Sandboxes.
+
+```python
+@app.function(image=server_image, retries=3)
+@modal.concurrent(max_inputs=20)
+def add_sandbox_to_queue() -> None:
+    sandbox_app = modal.App.lookup(
+        "example-sandbox-pool-sandboxes", create_if_missing=True
+    )
+
+    sandbox_cmd = ["python", "-m", "http.server", "8080"]
+    sb = modal.Sandbox.create(
+        *sandbox_cmd,
+        app=sandbox_app,
+        image=sandbox_image,
+        encrypted_ports=[SANDBOX_SERVER_PORT],
+        timeout=SANDBOX_TIMEOUT_SECONDS,
+    )
+    expires_at = int(time.time()) + SANDBOX_TIMEOUT_SECONDS
+    url = sb.tunnels()[SANDBOX_SERVER_PORT].url
+
+    if not is_healthy(url, wait_for_container_start=True):
+        raise Exception("Health check failed")
+
+    pool_queue.put(SandboxReference(id=sb.object_id, url=url, expires_at=expires_at))
+
+```
+
+We also have a utility function that can be `.spawn()`ed to terminate Sandboxes.
+
+```python
+@app.function()
+def terminate_sandboxes(sandbox_ids: list[str]) -> int:
+    num_terminated = 0
+    for id in sandbox_ids:
+        sb = modal.Sandbox.from_id(id)
+        sb.terminate()
+        num_terminated += 1
+
+    print(f"Terminated {num_terminated} Sandboxes")
+    return num_terminated
+
+```
+
+### Claiming a Sandbox from the pool
+
+We expose two ways to claim a Sandbox from the pool and get a URL to the server:
+
+- a web endpoint
+- a Function that can be called using the Modal SDK for [Python][1], [Go, or JS][2].
+
+[1]: https://github.com/modal-labs/modal-client
+[2]: https://github.com/modal-labs/libmodal
+
+The web endpoint is deployed as a [Modal web endpoint][3], and calls the
+`claim_sandbox` Function using `claim_sandbox.local()`, meaning that it's called in
+the same process as the web endpoint.
+
+The Function can be called using the Modal SDK for [Python][1], [Go, or JS][2].
+
+[1]: https://github.com/modal-labs/modal-client
+[2]: https://github.com/modal-labs/libmodal
+[3]: https://modal.com/docs/guide/webhook-urls
+
+```python
+@app.function(image=server_image)
+@modal.fastapi_endpoint()
+@modal.concurrent(max_inputs=20)
+def claim_sandbox_web_endpoint(check_health: bool = True) -> str:
+    return claim_sandbox.local(check_health=check_health)
+
+@app.function(image=server_image)
+def claim_sandbox(check_health: bool = True) -> str:
+    to_terminate: list[str] = []
+
+    # Remove any expiring or unhealthy sandboxes, and return the first good one:
+    while True:
+        print(
+            "Adding a new Sandbox to the pool to backfill "
+            "(and ensure we have at least one)..."
+        )
+        add_sandbox_to_queue.spawn()
+
+        # timeout=None here means we block in case we need to wait for the backfill:
+        sr = pool_queue.get(timeout=None)
+        if sr is None:
+            continue
+
+        if not is_still_good(sr, check_health):
+            print(f"Sandbox '{sr.id}' was not good - terminating and trying another...")
+            to_terminate.append(sr.id)
+            continue
+
+        break
+
+    if to_terminate:
+        terminate_sandboxes.spawn(to_terminate)
+
+    print(f"Claimed Sandbox '{sr.id}', with URL: {sr.url}")
+    return sr.url
+
+```
+
+### Maintaining the pool
+
+This function grows or shrinks the pool to SANDBOX_POOL_SIZE. It first removes any
+expiring or unhealthy sandboxes, then adjusts the pool size to reach the target.
+
+It runs on a schedule to ensure the pool doesn't drift too far from the target size.
+
+```python
+@app.function(
+    image=server_image,
+    schedule=POOL_MAINTENANCE_SCHEDULE,
+)
+def maintain_pool():
+    to_terminate: list[str] = []
+
+    # First remove expiring and unhealthy sandboxes
+    while True:
+        sr = pool_queue.get(block=False)
+
+        if sr is None:
+            break
+
+        if not is_still_good(sr, check_health=True):
+            to_terminate.append(sr.id)
+            continue
+
+        # Found first good sandbox, but don't put it back in the queue to preserve
+        # queue ordering.
+        to_terminate.append(sr.id)
+        break
+
+    if to_terminate:
+        print(f"Terminating {len(to_terminate)} expiring/unhealthy sandboxes...")
+        terminate_sandboxes.spawn(to_terminate)
+
+    # Now resize to target
+    diff = POOL_SIZE - pool_queue.len()
+
+    if diff > 0:
+        for _ in add_sandbox_to_queue.starmap(() for _ in range(diff)):
+            pass
+    elif diff < 0:
+        terminate_sandboxes.spawn(
+            [sr.id for sr in pool_queue.get_many(n_values=-diff, timeout=0)]
+        )
+
+    print(f"Pool size after maintenance: {pool_queue.len()}")
+
+```
+
+## Local commands for interacting with the pool
+
+### Deploy the app
+
+This also runs the `maintain_pool` function to ensure the pool is at the correct size
+without having to wait for the first scheduled maintenance run.
+
+Run it with `python 13_sandboxes/sandbox_pool.py deploy`.
+
+```python
+def deploy():
+    print("Deploying the app...")
+    app.deploy()
+    print("Done.")
+
+    print("\nRunning initial pool maintenance...")
+    maintain_pool.remote()
+    print("Done.")
+
+```
+
+### Check the current state of the pool
+
+Run it with `python 13_sandboxes/sandbox_pool.py check`.
+
+```python
+def check():
+    print(f"Number of Sandboxes in the pool: {pool_queue.len()}")
+
+    for sr in pool_queue.iterate():
+        seconds_left = sr.expires_at - time.time()
+        print(
+            f"- Sandbox '{sr.id}' is at {sr.url} and expires at "
+            f"{datetime.fromtimestamp(sr.expires_at).isoformat()} "
+            f"({int(seconds_left)} seconds left)"
+        )
+
+```
+
+### Claiming a Sandbox from the pool and print its URL
+
+This is implemented as if you wanted to call the Function from a Python backend
+application using the Modal SDK, i.e. using `.from_name()` to get the Function, etc.
+
+Run it with `python 13_sandboxes/sandbox_pool.py claim`.
+
+```python
+def claim() -> None:
+    deployed_claim_sandbox = modal.Function.from_name(
+        "example-sandbox-pool", "claim_sandbox"
+    )
+    print(deployed_claim_sandbox.remote())
+
+```
+
+### Run a demo of the Sandbox pool.
+
+This is implemented as if you wanted to call the Function from a Python backend
+application using the Modal SDK, i.e. using `.from_name()` to get the Function, etc.
+
+Run it with `python 13_sandboxes/sandbox_pool.py demo`.
+
+```python
+def demo():
+    import urllib.request
+
+    deploy()
+
+    check()
+
+    print("\nClaiming a Sandbox using the `claim_sandbox` Function...")
+    deployed_claim_sandbox = modal.Function.from_name(
+        "example-sandbox-pool", "claim_sandbox"
+    )
+    sandbox_url = deployed_claim_sandbox.remote()
+    print(f"Claimed Sandbox URL: {sandbox_url}")
+
+    print("\nCall the server in the Sandbox...")
+    with urllib.request.urlopen(sandbox_url) as response:
+        result = response.read().decode("utf-8")
+        print(f"Sandbox server response:\n{result}")
+
+    time.sleep(2)  # wait for the pool to be backfilled in the background
+    check()
+
+    deployed_web_endpoint = modal.Function.from_name(
+        "example-sandbox-pool", "claim_sandbox_web_endpoint"
+    )
+    web_endpoint_url = deployed_web_endpoint.get_web_url()
+    print(f"\nClaiming a Sandbox using the web endpoint at '{web_endpoint_url}'...")
+    with urllib.request.urlopen(web_endpoint_url) as response:
+        sandbox_url = response.read().decode("utf-8").strip(' "')
+        print(f"Claimed Sandbox URL: {sandbox_url}")
+
+    print("\nCall the server in the Sandbox...")
+    with urllib.request.urlopen(sandbox_url) as response:
+        result = response.read().decode("utf-8")
+        print(f"Sandbox server response:\n{result}")
+
+    time.sleep(2)
+    check()
+
+def main():
+    parser = argparse.ArgumentParser(description="Manage Sandbox pool")
+    parser.add_argument(
+        "command",
+        choices=["check", "deploy", "claim", "demo"],
+        help="Command to execute",
+    )
+    args = parser.parse_args()
+
+    if args.command == "check":
+        check()
+    elif args.command == "claim":
+        claim()
+    elif args.command == "deploy":
+        deploy()
+    elif args.command == "demo":
+        demo()
+    else:
+        parser.print_help()
+
+if __name__ == "__main__":
+    main()
+
+```
+
+### Schedule Simple
+
+# Scheduling remote jobs
+
+This example shows how you can schedule remote jobs on Modal.
+You can do this either with:
+
+- [`modal.Period`](https://modal.com/docs/reference/modal.Period) - a time interval between function calls.
+- [`modal.Cron`](https://modal.com/docs/reference/modal.Cron) - a cron expression to specify the schedule.
+
+In the code below, the first function runs every
+5 seconds, and the second function runs every minute. We use the `schedule`
+argument to specify the schedule for each function. The `schedule` argument can
+take a `modal.Period` object to specify a time interval or a `modal.Cron` object
+to specify a cron expression.
+
+```python
+import time
+from datetime import datetime
+
+import modal
+
+app = modal.App("example-schedule-simple")
+
+@app.function(schedule=modal.Period(seconds=5))
+def print_time_1():
+    print(
+        f"Printing with period 5 seconds: {datetime.now().strftime('%m/%d/%Y, %H:%M:%S')}"
+    )
+
+@app.function(schedule=modal.Cron("* * * * *"))
+def print_time_2():
+    print(
+        f"Printing with cron every minute: {datetime.now().strftime('%m/%d/%Y, %H:%M:%S')}"
+    )
+
+if __name__ == "__main__":
+    with modal.enable_output():
+        with app.run():
+            time.sleep(60)
+
+```
+
+### Segment Anything
+
+# Run Facebook's Segment Anything Model 2 (SAM 2) on Modal
+
+This example demonstrates how to deploy Facebook's [SAM 2](https://github.com/facebookresearch/sam2)
+on Modal. SAM2 is a powerful, flexible image and video segmentation model that can be used
+for various computer vision tasks like object detection, instance segmentation,
+and even as a foundation for more complex computer vision applications.
+SAM2 extends the capabilities of the original SAM to include video segmentation.
+
+In particular, this example segments [this video](https://www.youtube.com/watch?v=WAz1406SjVw) of a man jumping off the cliff.
+
+The output should look something like this:
+
+<center>
+<video controls autoplay loop muted>
+<source src="https://modal-cdn.com/example-segmented-video.mp4" type="video/mp4">
+</video>
+</center>
+
+## Set up dependencies for SAM 2
+
+First, we set up the necessary dependencies, including `torch`,
+`opencv`, `huggingface_hub`, `torchvision`, and the `sam2` library.
+
+We also install `ffmpeg`, which we will use to manipulate videos,
+and a Python wrapper called `ffmpeg-python` for a clean interface.
+
+```python
+from pathlib import Path
+
+import modal
+
+MODEL_TYPE = "facebook/sam2-hiera-large"
+SAM2_GIT_SHA = (
+    "c2ec8e14a185632b0a5d8b161928ceb50197eddc"  # pin commit! research code is fragile
+)
+
+image = (
+    modal.Image.debian_slim(python_version="3.10")
+    .apt_install("git", "wget", "python3-opencv", "ffmpeg")
+    .uv_pip_install(
+        "torch~=2.4.1",
+        "torchvision==0.19.1",
+        "opencv-python==4.10.0.84",
+        "pycocotools~=2.0.8",
+        "matplotlib~=3.9.2",
+        "onnxruntime==1.19.2",
+        "onnx==1.17.0",
+        "huggingface_hub==0.25.2",
+        "ffmpeg-python==0.2.0",
+        f"git+https://github.com/facebookresearch/sam2.git@{SAM2_GIT_SHA}",
+    )
+)
+app = modal.App("example-segment-anything", image=image)
+
+```
+
+## Wrapping the SAM 2 model in a Modal class
+
+Next, we define the `Model` class that will handle SAM 2 operations for both image and video.
+
+We use the `@modal.enter()` decorators here for optimization: it makes sure the initialization
+method runs only once, when a new container starts, instead of in the path of every call.
+We'll also use a modal Volume to cache the model weights so that they don't need to be downloaded
+repeatedly when we start new containers. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+video_vol = modal.Volume.from_name("sam2-inputs", create_if_missing=True)
+cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+cache_dir = "/cache"
+
+@app.cls(
+    image=image.env({"HF_HUB_CACHE": cache_dir}),
+    volumes={"/root/videos": video_vol, cache_dir: cache_vol},
+    gpu="A100",
+)
+class Model:
+    @modal.enter()
+    def initialize_model(self):
+        """Download and initialize model."""
+        from sam2.sam2_video_predictor import SAM2VideoPredictor
+
+        self.video_predictor = SAM2VideoPredictor.from_pretrained(MODEL_TYPE)
+
+    @modal.method()
+    def generate_video_masks(self, video="/root/videos/input.mp4", point_coords=None):
+        """Generate masks for a video."""
+        import ffmpeg
+        import numpy as np
+        import torch
+        from PIL import Image
+
+        frames_dir = convert_video_to_frames(video)
+
+        # scan all the JPEG files in this directory
+        frame_names = [
+            p
+            for p in frames_dir.iterdir()
+            if p.suffix in [".jpg", ".jpeg", ".JPG", ".JPEG"]
+        ]
+        frame_names.sort(key=lambda p: int(p.stem))
+
+        # We are hardcoding the input point and label here
+        # In a real-world scenario, you would want to display the video
+        # and allow the user to click on the video to select the point
+        if point_coords is None:
+            width, height = Image.open(frame_names[0]).size
+            point_coords = [[width // 2, height // 2]]
+
+        points = np.array(point_coords, dtype=np.float32)
+        # for labels, `1` means positive click and `0` means negative click
+        labels = np.array([1] * len(points), np.int32)
+
+        # run the model on GPU
+        with (
+            torch.inference_mode(),
+            torch.autocast("cuda", dtype=torch.bfloat16),
+        ):
+            self.inference_state = self.video_predictor.init_state(
+                video_path=str(frames_dir)
+            )
+
+            # add new prompts and instantly get the output on the same frame
+            (
+                frame_idx,
+                object_ids,
+                masks,
+            ) = self.video_predictor.add_new_points_or_box(
+                inference_state=self.inference_state,
+                frame_idx=0,
+                obj_id=1,
+                points=points,
+                labels=labels,
+            )
+
+            print(f"frame_idx: {frame_idx}, object_ids: {object_ids}, masks: {masks}")
+
+            # run propagation throughout the video and collect the results in a dict
+            video_segments = {}  # video_segments contains the per-frame segmentation results
+            for (
+                out_frame_idx,
+                out_obj_ids,
+                out_mask_logits,
+            ) in self.video_predictor.propagate_in_video(self.inference_state):
+                video_segments[out_frame_idx] = {
+                    out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
+                    for i, out_obj_id in enumerate(out_obj_ids)
+                }
+
+        out_dir = Path("/root/mask_frames")
+        out_dir.mkdir(exist_ok=True)
+
+        vis_frame_stride = 5  # visualize every 5th frame
+        save_segmented_frames(
+            video_segments,
+            frames_dir,
+            out_dir,
+            frame_names,
+            stride=vis_frame_stride,
+        )
+
+        ffmpeg.input(
+            f"{out_dir}/frame_*.png",
+            pattern_type="glob",
+            framerate=30 / vis_frame_stride,
+        ).filter(
+            "scale",
+            "trunc(iw/2)*2",
+            "trunc(ih/2)*2",  # round to even dimensions to encode for "dumb players", https://trac.ffmpeg.org/wiki/Encode/H.264#Encodingfordumbplayers
+        ).output(str(out_dir / "out.mp4"), format="mp4", pix_fmt="yuv420p").run()
+
+        return (out_dir / "out.mp4").read_bytes()
+
+```
+
+## Segmenting videos from the command line
+
+Finally, we define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps)
+to run the segmentation from our local machine's terminal.
+
+There are several ways to pass files between the local machine and the Modal Function.
+
+One way is to upload the files onto a Modal [Volume](https://modal.com/docs/guide/volumes),
+which acts as a distributed filesystem.
+
+The other way is to convert the file to bytes and pass the bytes back and forth as the input or output of Python functions.
+We use this method to get the video file with the segmentation results in it back to the local machine.
+
+```python
+@app.local_entrypoint()
+def main(
+    input_video=Path(__file__).parent / "cliff_jumping.mp4",
+    x_point=250,
+    y_point=200,
+):
+    with video_vol.batch_upload(force=True) as batch:
+        batch.put_file(input_video, "input.mp4")
+
+    model = Model()
+
+    if x_point is not None and y_point is not None:
+        point_coords = [[x_point, y_point]]
+    else:
+        point_coords = None
+
+    print(f"Running SAM 2 on {input_video}")
+    video_bytes = model.generate_video_masks.remote(point_coords=point_coords)
+
+    dir = Path("/tmp/sam2_outputs")
+    dir.mkdir(exist_ok=True, parents=True)
+    output_path = dir / "segmented_video.mp4"
+    output_path.write_bytes(video_bytes)
+    print(f"Saved output video to {output_path}")
+
+```
+
+## Helper functions for SAM 2 inference
+
+Above, we used some helper functions to for some of the details, like breaking the video into frames.
+These are defined below.
+
+```python
+def convert_video_to_frames(self, input_video="/root/videos/input.mp4"):
+    import ffmpeg
+
+    input_video = Path(input_video)
+    output_dir = (  # output on local filesystem, not on the remote Volume
+        input_video.parent.parent / input_video.stem / "video_frames"
+    )
+    output_dir.mkdir(exist_ok=True, parents=True)
+
+    ffmpeg.input(input_video).output(
+        f"{output_dir}/%05d.jpg", qscale=2, start_number=0
+    ).run()
+
+    return output_dir
+
+def show_mask(mask, ax, obj_id=None, random_color=False):
+    import matplotlib.pyplot as plt
+    import numpy as np
+
+    if random_color:
+        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
+    else:
+        cmap = plt.get_cmap("tab10")
+        cmap_idx = 0 if obj_id is None else obj_id
+        color = np.array([*cmap(cmap_idx)[:3], 0.6])
+    h, w = mask.shape[-2:]
+    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
+    ax.imshow(mask_image)
+
+def save_segmented_frames(video_segments, frames_dir, out_dir, frame_names, stride=5):
+    import io
+
+    import matplotlib.pyplot as plt
+    from PIL import Image
+
+    frames_dir, out_dir = Path(frames_dir), Path(out_dir)
+
+    frame_images = []
+    inches_per_px = 1 / plt.rcParams["figure.dpi"]
+    for out_frame_idx in range(0, len(frame_names), stride):
+        frame = Image.open(frames_dir / frame_names[out_frame_idx])
+        width, height = frame.size
+        width, height = width - width % 2, height - height % 2
+        fig, ax = plt.subplots(figsize=(width * inches_per_px, height * inches_per_px))
+        ax.axis("off")
+        ax.imshow(frame)
+
+        [
+            show_mask(mask, ax, obj_id=obj_id)
+            for (obj_id, mask) in video_segments[out_frame_idx].items()
+        ]
+
+        # Convert plot to PNG bytes
+        buf = io.BytesIO()
+        fig.savefig(buf, format="png", bbox_inches="tight", pad_inches=0)
+        # fig.savefig(buf, format="png")
+        buf.seek(0)
+        frame_images.append(buf.getvalue())
+        plt.close(fig)
+
+    for ii, frame in enumerate(frame_images):
+        (out_dir / f"frame_{str(ii).zfill(3)}.png").write_bytes(frame)
+
+```
+
+### Serve Streamlit
+
+# Run and share Streamlit apps
+
+This example shows you how to run a Streamlit app with `modal serve`, and then deploy it as a serverless web app.
+
+![example streamlit app](./streamlit.png)
+
+This example is structured as two files:
+
+1. This module, which defines the Modal objects (name the script `serve_streamlit.py` locally).
+
+2. `app.py`, which is any Streamlit script to be mounted into the Modal
+function ([download script](https://github.com/modal-labs/modal-examples/blob/main/10_integrations/streamlit/app.py)).
+
+```python
+import shlex
+import subprocess
+from pathlib import Path
+
+import modal
+
+```
+
+## Define container dependencies
+
+The `app.py` script imports three third-party packages, so we include these in the example's
+image definition and then add the `app.py` file itself to the image.
+
+```python
+streamlit_script_local_path = Path(__file__).parent / "app.py"
+streamlit_script_remote_path = "/root/app.py"
+
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install("streamlit~=1.35.0", "numpy~=1.26.4", "pandas~=2.2.2")
+    .add_local_file(
+        streamlit_script_local_path,
+        streamlit_script_remote_path,
+    )
+)
+
+app = modal.App(name="example-serve-streamlit", image=image)
+
+if not streamlit_script_local_path.exists():
+    raise RuntimeError(
+        "app.py not found! Place the script with your streamlit app in the same directory."
+    )
+
+```
+
+## Spawning the Streamlit server
+
+Inside the container, we will run the Streamlit server in a background subprocess using
+`subprocess.Popen`. We also expose port 8000 using the `@web_server` decorator.
+
+```python
+@app.function()
+@modal.concurrent(max_inputs=100)
+@modal.web_server(8000)
+def run():
+    target = shlex.quote(streamlit_script_remote_path)
+    cmd = f"streamlit run {target} --server.port 8000 --server.enableCORS=false --server.enableXsrfProtection=false"
+    subprocess.Popen(cmd, shell=True)
+
+```
+
+## Iterate and Deploy
+
+While you're iterating on your screamlit app, you can run it "ephemerally" with `modal serve`. This will
+run a local process that watches your files and updates the app if anything changes.
+
+```shell
+modal serve serve_streamlit.py
+```
+
+Once you're happy with your changes, you can deploy your application with
+
+```shell
+modal deploy serve_streamlit.py
+```
+
+If successful, this will print a URL for your app that you can navigate to from
+your browser 🎉 .
+
+### Sgl Vlm
+
+# Run Qwen2-VL on SGLang for Visual QA
+
+Vision-Language Models (VLMs) are like LLMs with eyes:
+they can generate text based not just on other text,
+but on images as well.
+
+This example shows how to run a VLM on Modal using the
+[SGLang](https://github.com/sgl-project/sglang) library.
+
+Here's a sample inference, with the image rendered directly (and at low resolution) in the terminal:
+
+![Sample output answering a question about a photo of the Statue of Liberty](https://modal-public-assets.s3.amazonaws.com/sgl_vlm_qa_sol.png)
+
+## Setup
+
+First, we'll import the libraries we need locally
+and define some constants.
+
+```python
+import os
+import time
+import warnings
+from pathlib import Path
+from typing import Optional
+from uuid import uuid4
+
+import modal
+
+```
+
+VLMs are generally larger than LLMs with the same cognitive capability.
+LLMs are already hard to run effectively on CPUs, so we'll use a GPU here.
+We find that inference for a single input takes about 3-4 seconds on an A10G.
+
+You can customize the GPU type and count using the `GPU_TYPE` and `GPU_COUNT` environment variables.
+If you want to see the model really rip, try an `"a100-80gb"` or an `"h100"`
+on a large batch.
+
+```python
+GPU_TYPE = os.environ.get("GPU_TYPE", "l40s")
+GPU_COUNT = os.environ.get("GPU_COUNT", 1)
+
+GPU_CONFIG = f"{GPU_TYPE}:{GPU_COUNT}"
+
+SGL_LOG_LEVEL = "error"  # try "debug" or "info" if you have issues
+
+MINUTES = 60  # seconds
+
+```
+
+We use the [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
+model by Alibaba.
+
+```python
+MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct"
+MODEL_REVISION = "a7a06a1cc11b4514ce9edcde0e3ca1d16e5ff2fc"
+TOKENIZER_PATH = "Qwen/Qwen2-VL-7B-Instruct"
+MODEL_CHAT_TEMPLATE = "qwen2-vl"
+
+```
+
+We download it from the Hugging Face Hub using the Python function below.
+We'll store it in a [Modal Volume](https://modal.com/docs/guide/volumes)
+so that it's not downloaded every time the container starts.
+
+```python
+MODEL_VOL_PATH = Path("/models")
+MODEL_VOL = modal.Volume.from_name("sgl-cache", create_if_missing=True)
+volumes = {MODEL_VOL_PATH: MODEL_VOL}
+
+def download_model():
+    from huggingface_hub import snapshot_download
+
+    snapshot_download(
+        MODEL_PATH,
+        local_dir=str(MODEL_VOL_PATH / MODEL_PATH),
+        revision=MODEL_REVISION,
+        ignore_patterns=["*.pt", "*.bin"],
+    )
+
+```
+
+Modal runs Python functions on containers in the cloud.
+The environment those functions run in is defined by the container's `Image`.
+The block of code below defines our example's `Image`.
+
+```python
+cuda_version = "12.8.0"  # should be no greater than host CUDA version
+flavor = "devel"  #  includes full CUDA toolkit
+operating_sys = "ubuntu22.04"
+tag = f"{cuda_version}-{flavor}-{operating_sys}"
+
+vlm_image = (
+    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.11")
+    .entrypoint([])  # removes chatty prints on entry
+    .apt_install("libnuma-dev")  # Add NUMA library for sgl_kernel
+    .uv_pip_install(  # add sglang and some Python dependencies
+        "transformers==4.54.1",
+        "numpy<2",
+        "fastapi[standard]==0.115.4",
+        "pydantic==2.9.2",
+        "requests==2.32.3",
+        "starlette==0.41.2",
+        "torch==2.7.1",
+        "sglang[all]==0.4.10.post2",
+        "sgl-kernel==0.2.8",
+        "hf-xet==1.1.5",
+        pre=True,
+    )
+    .env(
+        {
+            "HF_HOME": str(MODEL_VOL_PATH),
+            "HF_XET_HIGH_PERFORMANCE": "1",
+        }
+    )
+    .run_function(  # download the model by running a Python function
+        download_model, volumes=volumes
+    )
+    .uv_pip_install(  # add an optional extra that renders images in the terminal
+        "term-image==0.7.1"
+    )
+)
+
+```
+
+## Defining a Visual QA service
+
+Running an inference service on Modal is as easy as writing inference in Python.
+
+The code below adds a modal `Cls` to an `App` that runs the VLM.
+
+We define a method `generate` that takes a URL for an image and a question
+about the image as inputs and returns the VLM's answer.
+
+By decorating it with `@modal.fastapi_endpoint`, we expose it as an HTTP endpoint,
+so it can be accessed over the public Internet from any client.
+
+```python
+app = modal.App("example-sgl-vlm")
+
+@app.cls(
+    gpu=GPU_CONFIG,
+    timeout=20 * MINUTES,
+    scaledown_window=20 * MINUTES,
+    image=vlm_image,
+    volumes=volumes,
+)
+@modal.concurrent(max_inputs=100)
+class Model:
+    @modal.enter()  # what should a container do after it starts but before it gets input?
+    def start_runtime(self):
+        """Starts an SGL runtime to execute inference."""
+        import sglang as sgl
+
+        self.runtime = sgl.Runtime(
+            model_path=MODEL_PATH,
+            tokenizer_path=TOKENIZER_PATH,
+            tp_size=GPU_COUNT,  # t_ensor p_arallel size, number of GPUs to split the model over
+            log_level=SGL_LOG_LEVEL,
+        )
+        self.runtime.endpoint.chat_template = sgl.lang.chat_template.get_chat_template(
+            MODEL_CHAT_TEMPLATE
+        )
+        sgl.set_default_backend(self.runtime)
+
+    @modal.fastapi_endpoint(method="POST", docs=True)
+    def generate(self, request: dict) -> str:
+        from pathlib import Path
+
+        import requests
+        import sglang as sgl
+        from term_image.image import from_file
+
+        start = time.monotonic_ns()
+        request_id = uuid4()
+        print(f"Generating response to request {request_id}")
+
+        image_url = request.get("image_url")
+        if image_url is None:
+            image_url = (
+                "https://modal-public-assets.s3.amazonaws.com/golden-gate-bridge.jpg"
+            )
+
+        response = requests.get(image_url)
+        response.raise_for_status()
+
+        image_filename = image_url.split("/")[-1]
+        image_path = Path(f"/tmp/{uuid4()}-{image_filename}")
+        image_path.write_bytes(response.content)
+
+        @sgl.function
+        def image_qa(s, image_path, question):
+            s += sgl.user(sgl.image(str(image_path)) + question)
+            s += sgl.assistant(sgl.gen("answer"))
+
+        question = request.get("question")
+        if question is None:
+            question = "What is this?"
+
+        state = image_qa.run(
+            image_path=image_path, question=question, max_new_tokens=128
+        )
+        # show the question and image in the terminal for demonstration purposes
+        print(Colors.BOLD, Colors.GRAY, "Question: ", question, Colors.END, sep="")
+        terminal_image = from_file(image_path)
+        terminal_image.draw()
+        print(
+            f"request {request_id} completed in {round((time.monotonic_ns() - start) / 1e9, 2)} seconds"
+        )
+
+        return state["answer"]
+
+    @modal.exit()  # what should a container do before it shuts down?
+    def shutdown_runtime(self):
+        self.runtime.shutdown()
+
+```
+
+## Asking questions about images via POST
+
+Now, we can send this Modal Function a POST request with an image and a question
+and get back an answer.
+
+The code below will start up the inference service
+so that it can be run from the terminal as a one-off,
+like a local script would be, using `modal run`:
+
+```bash
+modal run sgl_vlm.py
+```
+
+By default, we hit the endpoint twice to demonstrate how much faster
+the inference is once the server is running.
+
+```python
+@app.local_entrypoint()
+def main(
+    image_url: Optional[str] = None, question: Optional[str] = None, twice: bool = True
+):
+    import json
+    import urllib.request
+
+    model = Model()
+
+    payload = json.dumps(
+        {
+            "image_url": image_url,
+            "question": question,
+        },
+    )
+
+    req = urllib.request.Request(
+        model.generate.get_web_url(),
+        data=payload.encode("utf-8"),
+        headers={"Content-Type": "application/json"},
+        method="POST",
+    )
+
+    with urllib.request.urlopen(req) as response:
+        assert response.getcode() == 200, response.getcode()
+        print(json.loads(response.read().decode()))
+
+    if twice:
+        # second response is faster, because the Function is already running
+        with urllib.request.urlopen(req) as response:
+            assert response.getcode() == 200, response.getcode()
+            print(json.loads(response.read().decode()))
+
+```
+
+## Deployment
+
+To set this up as a long-running, but serverless, service, we can deploy it to Modal:
+
+```bash
+modal deploy sgl_vlm.py
+```
+
+And then send requests from anywhere. See the [docs](https://modal.com/docs/guide/webhook-urls)
+for details on the `web_url` of the function, which also appears in the terminal output
+when running `modal deploy`.
+
+You can also find interactive documentation for the endpoint at the `/docs` route of the web endpoint URL.
+
+## Addenda
+
+The rest of the code in this example is just utility code.
+
+```python
+warnings.filterwarnings(  # filter warning from the terminal image library
+    "ignore",
+    message="It seems this process is not running within a terminal. Hence, some features will behave differently or be disabled.",
+    category=UserWarning,
+)
+
+class Colors:
+    """ANSI color codes"""
+
+    GREEN = "\033[0;32m"
+    BLUE = "\033[0;34m"
+    GRAY = "\033[0;90m"
+    BOLD = "\033[1m"
+    END = "\033[0m"
+
+```
+
+### Sglang Low Latency
+
+# Low Latency Qwen 3-8B with SGLang and Modal
+
+In this example, we show how to serve Qwen 3-8B with SGLang on Modal using @modal.experimental.http_server.
+This is a new low latency routing service on Modal which offers significantly reduced overheads, higher throughput, and session based routing.
+These features make `http_server` especially useful for inference workloads.
+
+We also include instructions for cutting cold start times by an order of magnitude using Modal's [CPU + GPU memory snapshots](https://modal.com/docs/guide/memory-snapshot).
+
+## Set up the container image
+
+Our first order of business is to define the environment our server will run in:
+the [container `Image`](https://modal.com/docs/guide/custom-container).
+We'll use the [SGLang inference server](https://github.com/sgl-project/sglang).
+Note that we need to build the SGLang image from source since the official image does not support the `--enable-cpu-backup` flag.
+
+```python
+import subprocess
+import time
+
+import aiohttp
+import modal
+import modal.experimental
+
+APP_NAME = "example-sglang-low-latency"
+MODEL_NAME = "Qwen/Qwen3-8B"
+MODEL_REVISION = (
+    "b968826d9c46dd6066d109eabc6255188de91218"  # Latest commit as of 2025-12-16
+)
+
+PORT = 8000
+MIN_CONTAINERS = 1
+MINUTE = 60
+
+HF_CACHE_VOL: modal.Volume = modal.Volume.from_name(
+    "huggingface-cache", create_if_missing=True
+)
+HF_CACHE_PATH: str = "/root/.cache/huggingface"
+MODEL_PATH: str = f"{HF_CACHE_PATH}/{MODEL_NAME}"
+sglang_image: modal.Image = (
+    modal.Image.from_registry("lmsysorg/sglang:v0.5.6.post2-cu129-amd64-runtime")
+    .uv_pip_install("huggingface-hub==0.36.0")
+    .env(
+        {
+            "HF_HUB_CACHE": HF_CACHE_PATH,
+            "HF_XET_HIGH_PERFORMANCE": "1",
+            "TORCHINDUCTOR_COMPILE_THREADS": "1",
+            "TMS_INIT_ENABLE_CPU_BACKUP": "1",
+            "TORCHINDUCTOR_CACHE_DIR": "/root/.cache/torch/",
+        }
+    )
+)
+
+with sglang_image.imports():
+    import requests
+
+app = modal.App(name=APP_NAME)
+
+@app.cls(
+    image=sglang_image,
+    gpu="H100",
+    volumes={HF_CACHE_PATH: HF_CACHE_VOL},
+    enable_memory_snapshot=True,
+    experimental_options={"enable_gpu_snapshot": True},
+    region="us-east",
+    min_containers=MIN_CONTAINERS,
+    timeout=4 * MINUTE,
+)
+@modal.experimental.http_server(
+    port=PORT, proxy_regions=["us-east"], exit_grace_period=5
+)
+@modal.concurrent(target_inputs=20)
+class SGLang:
+    """Serve a HuggingFace model via SGLang with readiness check."""
+
+    @modal.enter(snap=True)
+    def startup(self) -> None:
+        """Start the SGLang server and block until it is healthy."""
+
+        cmd = [
+            "python",
+            "-m",
+            "sglang.launch_server",
+            "--model-path",
+            MODEL_NAME,
+            "--revision",
+            MODEL_REVISION,
+            "--served-model-name",
+            MODEL_NAME,
+            "--host",
+            "0.0.0.0",
+            "--port",
+            f"{PORT}",
+            "--enable-metrics",
+            "--enable-memory-saver",
+            "--enable-weights-cpu-backup",
+        ]
+
+        self.process = subprocess.Popen(cmd)
+        self._wait_ready(self.process)
+        self._warmup()
+        self._sleep()
+
+    @modal.enter(snap=False)
+    def wake_up(self):
+        self._wake_up()
+
+    @modal.exit()
+    def stop(self):
+        self.process.terminate()
+
+    @staticmethod
+    def _warmup():
+        payload = {
+            "model": MODEL_NAME,
+            "messages": [{"role": "user", "content": "Hello, how are you?"}],
+            "max_tokens": 16,
+        }
+        for _ in range(3):
+            requests.post(
+                f"http://127.0.0.1:{PORT}/v1/chat/completions", json=payload, timeout=10
+            ).raise_for_status()
+
+    @staticmethod
+    def _wait_ready(process: subprocess.Popen, timeout: int = 3 * MINUTE):
+        def check_process_is_running() -> Exception | None:
+            if process is not None and process.poll() is not None:
+                return Exception(
+                    f"Process {process.pid} exited with code {process.returncode}"
+                )
+            return None
+
+        deadline: float = time.time() + timeout
+        while time.time() < deadline:
+            try:
+                if error := check_process_is_running():
+                    raise error
+                response = requests.get(f"http://127.0.0.1:{PORT}/health")
+                if response.status_code == 200:
+                    print("Server is healthy")
+                    break
+            except Exception:
+                pass
+
+    @staticmethod
+    def _sleep():
+        headers = {"Content-Type": "application/json"}
+        requests.post(
+            f"http://127.0.0.1:{PORT}/release_memory_occupation",
+            headers=headers,
+            json={},
+        ).raise_for_status()
+
+    @staticmethod
+    def _wake_up():
+        headers = {"Content-Type": "application/json"}
+        requests.post(
+            f"http://127.0.0.1:{PORT}/resume_memory_occupation",
+            headers=headers,
+            json={},
+        ).raise_for_status()
+
+## Deploy the server
+
+```
+
+To deploy the API on Modal, just run
+```bash
+modal deploy sglang_low_latency.py
+```
+
+This will create a new app on Modal, build the container image for it if it hasn't been built yet,
+and deploy the app.
+
+```python
+## Interact with the server
+
+```
+
+Once it is deployed, you'll see a URL appear in the command line,
+something like `https://your-workspace-name--example-sglang-low-latency-serve.us-east.modal.direct`.
+
+You can find [interactive Swagger UI docs](https://swagger.io/tools/swagger-ui/)
+at the `/docs` route of that URL, i.e. `https://your-workspace-name--example-sglang-low-latency-serve.us-east.modal.direct/docs`.
+These docs describe each route and indicate the expected input and output
+and translate requests into `curl` commands.
+
+For simple routes like `/health`, which checks whether the server is responding,
+you can even send a request directly from the docs.
+
+```python
+## Test the server
+
+```
+
+To make it easier to test the server setup, we also include a `local_entrypoint`
+that does a healthcheck and then hits the server.
+
+If you execute the command
+
+```bash
+modal run sglang_low_latency.py
+```
+
+a fresh replica of the server will be spun up on Modal while
+the code below executes on your local machine.
+
+Think of this like writing simple tests inside of the `if __name__ == "__main__"`
+block of a Python script, but for cloud deployments!
+
+```python
+@app.local_entrypoint()
+async def test(test_timeout=10 * MINUTE, content=None, twice=True):
+    url = SGLang._experimental_get_flash_urls()[0]
+
+    system_prompt = {
+        "role": "system",
+        "content": "You are a pirate who can't help but drop sly reminders that he went to Harvard.",
+    }
+    if content is None:
+        image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
+
+        content = [
+            {
+                "type": "text",
+                "text": "What action do you think I should take in this situation?"
+                " List all the possible actions and explain why you think they are good or bad.",
+            },
+            {"type": "image_url", "image_url": {"url": image_url}},
+        ]
+
+    messages = [  # OpenAI chat format
+        system_prompt,
+        {"role": "user", "content": content},
+    ]
+
+    start_time = time.time()
+    async with aiohttp.ClientSession(base_url=url) as session:
+        while time.time() - start_time < test_timeout:
+            print(f"Running health check for server at {url}")
+            async with session.get(
+                "/health", timeout=test_timeout - 1 * MINUTE
+            ) as resp:
+                if resp.status == 200:
+                    print(f"Successful health check for server at {url}")
+                    break
+                time.sleep(10)
+        print(f"Sending messages to {url}:", *messages, sep="\n\t")
+        await _send_request(session, "llm", messages, timeout=1 * MINUTE)
+
+async def _send_request(
+    session: aiohttp.ClientSession, model: str, messages: list, timeout: int
+) -> None:
+    async with session.post(
+        "/v1/chat/completions",
+        json={"messages": messages, "model": model, "temperature": 0.15},
+        timeout=timeout,
+    ) as resp:
+        resp.raise_for_status()
+        print((await resp.json())["choices"][0]["message"]["content"])
+
+if __name__ == "__main__":
+    import asyncio
+
+    # after deployment, we can use the class from anywhere
+    sglang_server = modal.Cls.from_name("example-sglang-low-latency", "SGLang")
+
+    async def test(url):
+        messages = [{"role": "user", "content": "Tell me a joke."}]
+        async with aiohttp.ClientSession(base_url=url) as session:
+            await _send_request(session, MODEL_NAME, messages, timeout=10 * MINUTE)
+
+    print("calling inference server")
+    asyncio.run(test(sglang_server._experimental_get_flash_urls()[0]))
+
+```
+
+### Simple Code Interpreter
+
+# Build a stateful, sandboxed code interpreter
+
+This example demonstrates how to build a stateful code interpreter using a Modal
+[Sandbox](https://modal.com/docs/guide/sandbox).
+
+We'll create a Modal Sandbox that listens for code to execute and then
+executes the code in a Python interpreter. Because we're running in a sandboxed
+environment, we can safely use the "unsafe" `exec()` to execute the code.
+
+## Setting up a code interpreter in a Modal Sandbox
+
+Our code interpreter uses a Python "driver program" to listen for code
+sent in JSON format to its standard input (`stdin`), execute the code,
+and then return the results in JSON format on standard output (`stdout`).
+
+```python
+import inspect
+import json
+import sys
+from typing import Any, Iterator
+
+import modal
+
+def driver_program():
+    import json
+    import sys
+    from contextlib import redirect_stderr, redirect_stdout
+    from io import StringIO
+
+    # When you `exec` code in Python, you can pass in a dictionary
+    # that defines the global variables the code has access to.
+
+    # We'll use that to store state.
+
+    globals: dict[str, Any] = {}
+    while True:
+        command = json.loads(input())  # read a line of JSON from stdin
+        if (code := command.get("code")) is None:
+            print(json.dumps({"error": "No code to execute"}))
+            continue
+
+        # Capture the executed code's outputs
+        stdout_io, stderr_io = StringIO(), StringIO()
+        with redirect_stdout(stdout_io), redirect_stderr(stderr_io):
+            try:
+                exec(code, globals)
+            except Exception as e:
+                print(f"Execution Error: {e}", file=sys.stderr)
+
+        print(
+            json.dumps(
+                {"stdout": stdout_io.getvalue(), "stderr": stderr_io.getvalue()}
+            ),
+            flush=True,
+        )
+
+```
+
+We run this driver program in a [Modal Sandbox](https://modal.com/docs/guide/sandboxes).
+
+```python
+app = modal.App.lookup("example-simple-code-interpreter", create_if_missing=True)
+sb = modal.Sandbox.create(app=app)
+
+```
+
+We have to convert the driver program to a string to pass it to the Sandbox.
+Here we use `inspect.getsource` to get the source code as a string,
+but you could also keep the driver program in a separate file and read it in.
+
+```python
+driver_program_text = inspect.getsource(driver_program)
+driver_program_command = f"""{driver_program_text}\n\ndriver_program()"""
+
+```
+
+We then kick off the program with [`Sandbox.exec`](https://modal.com/docs/reference/modal.Sandbox#exec),
+which creates a process inside the Sandbox (see [`modal.container_process`](https://modal.com/docs/reference/modal.container_process)
+for details).
+
+```python
+p = sb.exec("python", "-c", driver_program_command, bufsize=1)
+
+```
+
+## Running code in a Modal Sandbox
+
+Now we need a way to run code inside that running driver process.
+Our driver program already defined a JSON interface on its `stdin` and `stdout`,
+so we just need to write a quick wrapper to write to the remote `stdin`
+and read from the remote `stdout`.
+
+```python
+reader, writer = p.stdin, iter(p.stdout)
+
+def run_code(writer: modal.io_streams.StreamWriter, reader: Iterator[str], code: str):
+    writer.write(json.dumps({"code": code}) + "\n")
+    writer.drain()
+    result = json.loads(next(reader))
+    print(result["stdout"], end="")
+    if result["stderr"]:
+        print("\033[91m" + result["stderr"] + "\033[0m", end="", file=sys.stderr)
+
+```
+
+Now we can execute some code in the Sandbox!
+
+```python
+run_code(reader, writer, "print('hello, world!')")  # hello, world!
+
+```
+
+The Sandbox and our code interpreter are stateful,
+so we can define variables and use them in subsequent code.
+
+```python
+run_code(reader, writer, "x = 10")
+run_code(reader, writer, "y = 5")
+run_code(reader, writer, "result = x + y")
+run_code(reader, writer, "print(f'The result is: {result}')")  # The result is: 15
+
+```
+
+We can also see errors when code fails.
+
+```python
+run_code(reader, writer, "print('Attempting to divide by zero...')")
+run_code(reader, writer, "1 / 0")  # Execution Error: division by zero
+
+```
+
+Finally, let's clean up after ourselves and terminate the Sandbox.
+
+```python
+sb.terminate()
+
+```
+
+### Simple Torch Cluster
+
+# Simple PyTorch cluster
+
+This example shows how you can perform distributed computation with PyTorch.
+It is a kind of 'hello world' example for distributed ML training: setting up a cluster
+and executing a broadcast operation to share a single tensor.
+
+## Basic setup: Imports, dependencies, and a script
+
+Let's get the imports out of the way first.
+We need to import `modal.experimental` to use this feature, since it's still under development.
+Let us know if you run into any issues!
+
+```python
+import os
+from pathlib import Path
+
+import modal
+import modal.experimental
+
+```
+
+Communicating between nodes in a cluster requires communication libraries.
+We'll use `torch`, so we add it to our container's [Image](https://modal.com/docs/guide/images) here.
+
+```python
+image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+    "torch~=2.5.1", "numpy~=2.2.1"
+)
+
+```
+
+The approach we're going to take is to use a Modal [Function](https://modal.com/docs/reference/modal.Function)
+to launch the underlying script we want to distribute over the cluster nodes.
+The script is located in another file in the same directory
+of [our examples repo](https://github.com/modal-labs/modal-examples/).
+In order to use it in our remote Modal Function,
+we need to duplicate it remotely, which we do with `add_local_file`.
+
+```python
+this_directory = Path(__file__).parent
+
+image = image.add_local_file(
+    this_directory / "simple_torch_cluster_script.py",
+    remote_path="/root/script.py",
+)
+
+app = modal.App("example-simple-torch-cluster", image=image)
+
+```
+
+## Configuring a test cluster
+
+First, we set the size of the cluster in containers/nodes. This can be between 1 and 8.
+This is part of our Modal configuration, since Modal is responsible for spinning up our cluster.
+
+```python
+n_nodes = 4
+
+```
+
+Next, we set the number of processes we run per node.
+The usual practice is to run one process per GPU,
+so we set those two values to be equal.
+Note that `N_GPU` is Modal configuration ("how many GPUs should we spin up for you?")
+while `nproc_per_node` is `torch.distributed` configuration ("how many processes should we spawn for you?").
+
+```python
+n_proc_per_node = N_GPU = 1
+GPU_CONFIG = f"A100:{N_GPU}"
+
+```
+
+Lastly, we need to select our communications library: the software that will handle
+sending messages between nodes in our cluster.
+Since we are running on GPUs, we use the
+[NVIDIA Collective Communications Library](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html)
+(`nccl`, pronounced "nickle").
+
+This is part of `torch.distributed` configuration --
+Modal handles the networking infrastructure but not the communication protocol.
+
+```python
+backend = "nccl"  # or "gloo" on CPU, see https://pytorch.org/docs/stable/distributed.html#which-backend-to-use
+
+```
+
+This cluster configurations is nice for testing, but typically
+you'll want to run a cluster with the maximum number of GPUs per container --
+8 if you're running on H100s, the beefiest GPUs we offer on Modal.
+
+## Launching the script
+
+Our Modal Function is merely a 'launcher' that sets up the distributed
+cluster environment and then calls `torch.distributed.run`,
+the underlying Python code exposed by the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html)
+command line tool.
+
+So executing this distributed job is easy! Just run
+
+```bash
+modal run simple_torch_cluster.py
+```
+
+in your terminal.
+
+In addition to the values set in code above, you can pass additional arguments to `torch.distributed.run`
+via the command line:
+
+```bash
+modal run simple_torch_cluster.py --max-restarts=1
+```
+
+```python
+@app.function(gpu=GPU_CONFIG)
+@modal.experimental.clustered(size=n_nodes)
+def dist_run_script(*args):
+    from torch.distributed.run import parse_args, run
+
+    cluster_info = (  # we populate this data for you
+        modal.experimental.get_cluster_info()
+    )
+    # which container am I?
+    container_rank = cluster_info.rank
+    # how many containers are in this cluster?
+    world_size = len(cluster_info.container_ips)
+    # what's the leader/master/main container's address?
+    main_addr = cluster_info.container_ips[0]
+    # what's the identifier of this cluster task in Modal?
+    task_id = os.environ["MODAL_TASK_ID"]
+    print(f"hello from {container_rank=}")
+    if container_rank == 0:
+        print(
+            f"reporting cluster state from rank0/main: {main_addr=}, {world_size=}, {task_id=}"
+        )
+
+    run(
+        parse_args(
+            [
+                f"--nnodes={n_nodes}",
+                f"--node_rank={cluster_info.rank}",
+                f"--master_addr={main_addr}",
+                f"--nproc-per-node={n_proc_per_node}",
+                "--master_port=1234",
+            ]
+            + list(args)
+            + ["/root/script.py", "--backend", backend]
+        )
+    )
+
+```
+
+### Simple Torch Cluster Script
+
+```python
+import argparse
+import os
+from contextlib import contextmanager
+
+import torch
+import torch.distributed as dist
+
+```
+
+Environment variables set by torch.distributed.run.
+
+```python
+LOCAL_RANK = int(os.environ["LOCAL_RANK"])
+WORLD_SIZE = int(os.environ["WORLD_SIZE"])
+WORLD_RANK = int(os.environ["RANK"])
+```
+
+The master (or leader) rank is always 0 with torch.distributed.run.
+
+```python
+MASTER_RANK = 0
+
+```
+
+This `run` function performs a simple distributed data transfer between containers
+using the specified distributed communication backend.
+
+An example topology of the cluster when WORLD_SIZE=4 is shown below:
+
+       +---------+
+       | Master  |
+       | Rank 0  |
+       +----+----+
+            |
+            |
+   +--------+--------+
+   |        |        |
+   |        |        |
++--+--+  +--+--+  +--+--+
+|Rank 1| |Rank 2| |Rank 3|
++-----+  +-----+  +-----+
+
+A broadcast operation (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#broadcast)
+is performed between the master container (rank 0) and all other containers.
+
+The master container (rank 0) sends a tensor to all other containers.
+Each container then receives that tensor from the master container.
+
+```python
+def run(backend):
+    # Helper function providing a vanity name for each container based on its world (i.e. global) rank.
+    def container_name(wrld_rank: int) -> str:
+        return (
+            f"container-{wrld_rank} (main)"
+            if wrld_rank == 0
+            else f"container-{wrld_rank}"
+        )
+
+    tensor = torch.zeros(1)
+
+    # Need to put tensor on a GPU device for NCCL backend.
+    if backend == "nccl":
+        device = torch.device("cuda:{}".format(LOCAL_RANK))
+        tensor = tensor.to(device)
+
+    if WORLD_RANK == MASTER_RANK:
+        print(f"{container_name(WORLD_RANK)} sending data to all other containers...\n")
+        for rank_recv in range(1, WORLD_SIZE):
+            dist.send(tensor=tensor, dst=rank_recv)
+            print(
+                f"{container_name(WORLD_RANK)} sent data to {container_name(rank_recv)}\n"
+            )
+    else:
+        dist.recv(tensor=tensor, src=MASTER_RANK)
+        print(
+            f"{container_name(WORLD_RANK)} has received data from {container_name(MASTER_RANK)}\n"
+        )
+
+```
+
+In order for the broadcast operation to happen across the cluster, we need to have the master container (rank 0)
+learn the network addresses of all other containers.
+
+This is done by calling `dist.init_process_group` with the specified backend.
+
+See https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group for more details.
+
+```python
+@contextmanager
+def init_processes(backend):
+    try:
+        dist.init_process_group(backend, rank=WORLD_RANK, world_size=WORLD_SIZE)
+        yield
+    finally:
+        dist.barrier()  # ensure any async work is done before cleaning up
+        # Remove this if it causes program to hang. ref: https://github.com/pytorch/pytorch/issues/75097.
+        dist.destroy_process_group()
+
+if __name__ == "__main__":
+    # This is a minimal CLI interface adhering to the requirements of torch.distributed.run (torchrun).
+    #
+    # Our Modal Function will use torch.distributed.run to launch this script.
+    #
+    # See https://pytorch.org/docs/stable/elastic/run.html for more details on the CLI interface.
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--local-rank",
+        "--local_rank",
+        type=int,
+        help="Local rank. Necessary for using the torch.distributed.launch utility.",
+    )
+    parser.add_argument("--backend", type=str, default="gloo", choices=["nccl", "gloo"])
+    args = parser.parse_args()
+
+    with init_processes(backend=args.backend):
+        run(backend=args.backend)
+
+```
+
+### Sortformer2 1
+
+modified from:
+https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py
+
+```python
+import math
+from dataclasses import dataclass
+from typing import Tuple
+
+import nemo.collections.asr as nemo_asr
+import numpy as np
+import torch
+from nemo.collections.asr.models import SortformerEncLabelModel
+from nemo.collections.asr.modules.sortformer_modules import StreamingSortformerState
+from omegaconf import DictConfig
+from torch import Tensor
+
+LOG_MEL_ZERO = -16.635
+
+class AudioBufferer:
+    def __init__(self, sample_rate: int, buffer_size_in_secs: float):
+        self.buffer_size = int(buffer_size_in_secs * sample_rate)
+        self.sample_buffer = torch.zeros(self.buffer_size, dtype=torch.float32)
+
+    def reset(self) -> None:
+        """
+        Reset the buffer to zero
+        """
+        self.sample_buffer.zero_()
+
+    def update(self, audio: np.ndarray) -> None:
+        """
+        Update the buffer with the new frame
+        Args:
+            frame (Frame): frame to update the buffer with
+        """
+        if not isinstance(audio, torch.Tensor):
+            audio = torch.from_numpy(audio)
+
+        audio_size = audio.shape[0]
+        if audio_size > self.buffer_size:
+            raise ValueError(
+                f"Frame size ({audio_size}) exceeds buffer size ({self.buffer_size})"
+            )
+
+        shift = audio_size
+        self.sample_buffer[:-shift] = self.sample_buffer[shift:].clone()
+        self.sample_buffer[-shift:] = audio.clone()
+
+    def get_buffer(self) -> torch.Tensor:
+        """
+        Get the current buffer
+        Returns:
+            torch.Tensor: current state of the buffer
+        """
+        return self.sample_buffer.clone()
+
+    def is_buffer_empty(self) -> bool:
+        """
+        Check if the buffer is empty
+        Returns:
+            bool: True if the buffer is empty, False otherwise
+        """
+        return self.sample_buffer.sum() == 0
+
+class CacheFeatureBufferer:
+    def __init__(
+        self,
+        sample_rate: int,
+        buffer_size_in_secs: float,
+        chunk_size_in_secs: float,
+        preprocessor_cfg: DictConfig,
+        device: torch.device,
+        fill_value: float = LOG_MEL_ZERO,
+    ):
+        if buffer_size_in_secs < chunk_size_in_secs:
+            raise ValueError(
+                f"Buffer size ({buffer_size_in_secs}s) should be no less than chunk size ({chunk_size_in_secs}s)"
+            )
+
+        self.sample_rate = sample_rate
+        self.buffer_size_in_secs = buffer_size_in_secs
+        self.chunk_size_in_secs = chunk_size_in_secs
+        self.device = device
+
+        if hasattr(preprocessor_cfg, "log") and preprocessor_cfg.log:
+            self.ZERO_LEVEL_SPEC_DB_VAL = (
+                LOG_MEL_ZERO  # Log-Mel spectrogram value for zero signals
+            )
+        else:
+            self.ZERO_LEVEL_SPEC_DB_VAL = fill_value
+
+        self.n_feat = preprocessor_cfg.features
+        self.timestep_duration = preprocessor_cfg.window_stride
+        self.n_chunk_look_back = int(self.timestep_duration * self.sample_rate)
+        self.chunk_size = int(self.chunk_size_in_secs * self.sample_rate)
+        self.sample_buffer = AudioBufferer(sample_rate, buffer_size_in_secs)
+
+        self.feature_buffer_len = int(buffer_size_in_secs / self.timestep_duration)
+        self.feature_chunk_len = int(chunk_size_in_secs / self.timestep_duration)
+        self.feature_buffer = torch.full(
+            [self.n_feat, self.feature_buffer_len],
+            self.ZERO_LEVEL_SPEC_DB_VAL,
+            dtype=torch.float32,
+            device=self.device,
+        )
+
+        self.preprocessor = nemo_asr.models.ASRModel.from_config_dict(preprocessor_cfg)
+        self.preprocessor.to(self.device)
+
+    def is_buffer_empty(self) -> bool:
+        """
+        Check if the buffer is empty
+        Returns:
+            bool: True if the buffer is empty, False otherwise
+        """
+        return self.sample_buffer.is_buffer_empty()
+
+    def reset(self) -> None:
+        """
+        Reset the buffer to zero
+        """
+        self.sample_buffer.reset()
+        self.feature_buffer.fill_(self.ZERO_LEVEL_SPEC_DB_VAL)
+
+    def _update_feature_buffer(self, feat_chunk: torch.Tensor) -> None:
+        """
+        Add an extracted feature to `feature_buffer`
+        """
+        self.feature_buffer[:, : -self.feature_chunk_len] = self.feature_buffer[
+            :, self.feature_chunk_len :
+        ].clone()
+        self.feature_buffer[:, -self.feature_chunk_len :] = feat_chunk.clone()
+
+    def preprocess(self, audio_signal: torch.Tensor) -> torch.Tensor:
+        """
+        Preprocess the audio signal using the preprocessor
+        Args:
+            audio_signal (torch.Tensor): audio signal
+        Returns:
+            torch.Tensor: preprocessed features
+        """
+        audio_signal = audio_signal.unsqueeze_(0).to(self.device)
+        audio_signal_len = torch.tensor([audio_signal.shape[1]], device=self.device)
+        features, _ = self.preprocessor(
+            input_signal=audio_signal,
+            length=audio_signal_len,
+        )
+        features = features.squeeze()
+        return features
+
+    def update(self, audio: np.ndarray) -> None:
+        """
+        Update the sample anf feature buffers with the new frame
+        Args:
+            frame (Frame): frame to update the buffer with
+        """
+
+        # Update the sample buffer with the new frame
+        self.sample_buffer.update(audio)
+
+        if math.isclose(self.buffer_size_in_secs, self.chunk_size_in_secs):
+            # If the buffer size is equal to the chunk size, just take the whole buffer
+            samples = self.sample_buffer.sample_buffer.clone()
+        else:
+            # Add look_back to have context for the first feature
+            samples = self.sample_buffer.sample_buffer[
+                -(self.n_chunk_look_back + self.chunk_size) :
+            ]
+
+        # Get the mel spectrogram
+        features = self.preprocess(samples)
+
+        # If the features are longer than supposed to be, drop the last frames
+        # Drop the last diff frames because they might be incomplete
+        if (diff := features.shape[1] - self.feature_chunk_len - 1) > 0:
+            features = features[:, :-diff]
+
+        # Update the feature buffer with the new features
+        self._update_feature_buffer(features[:, -self.feature_chunk_len :])
+
+    def get_buffer(self) -> torch.Tensor:
+        """
+        Get the current sample buffer
+        Returns:
+            torch.Tensor: current state of the buffer
+        """
+        return self.sample_buffer.get_buffer()
+
+    def get_feature_buffer(self) -> torch.Tensor:
+        """
+        Get the current feature buffer
+        Returns:
+            torch.Tensor: current state of the feature buffer
+        """
+        return self.feature_buffer.clone()
+
+@dataclass
+class DiarizationConfig:
+    """Diarization configuration parameters for inference."""
+
+    model_path: str = "nvidia/diar_streaming_sortformer_4spk-v2"
+    device: str = "cuda"
+
+    log: bool = False  # If True, log will be printed
+    max_num_speakers: int = 4
+    spkcache_len: int = 188
+    spkcache_refresh_rate: int = 144
+    fifo_len: int = 188
+    chunk_len: int = 6
+    chunk_left_context: int = 1
+    chunk_right_context: int = 7
+
+class NeMoStreamingDiarizer:
+    def __init__(
+        self,
+        cfg: DiarizationConfig,
+        model: str,
+        frame_len_in_secs: float = 0.08,
+        sample_rate: int = 16000,
+        left_offset: int = 8,
+        right_offset: int = 8,
+        use_amp: bool = False,
+        compute_dtype: torch.dtype = torch.float32,
+    ):
+        self.model = model
+        self.cfg = cfg
+        self.cfg.model_path = model
+        self.diarizer = self.build_diarizer()
+        self.device = cfg.device
+        self.use_amp = use_amp
+        self.compute_dtype = compute_dtype
+        self.frame_len_in_secs = frame_len_in_secs
+        self.left_offset = left_offset
+        self.right_offset = right_offset
+        self.chunk_size = self.cfg.chunk_len
+        self.buffer_size_in_secs = (
+            self.cfg.chunk_len * self.frame_len_in_secs
+            + (self.left_offset + self.right_offset) * 0.01
+        )
+        self.max_num_speakers = self.cfg.max_num_speakers
+
+        self.feature_bufferer = CacheFeatureBufferer(
+            sample_rate=sample_rate,
+            buffer_size_in_secs=self.buffer_size_in_secs,
+            chunk_size_in_secs=self.cfg.chunk_len * self.frame_len_in_secs,
+            preprocessor_cfg=self.diarizer.cfg.preprocessor,
+            device=self.device,
+        )
+        self.streaming_state = self.init_streaming_state(batch_size=1)
+        self.total_preds = torch.zeros(
+            (1, 0, self.max_num_speakers), device=self.diarizer.device
+        )
+
+        print(
+            f"NeMoStreamingDiarService initialized with model `{model}` on device `{self.device}`"
+        )
+
+    def build_diarizer(self):
+        if self.cfg.model_path.endswith(".nemo"):
+            diar_model = SortformerEncLabelModel.restore_from(
+                self.cfg.model_path, map_location=self.cfg.device
+            )
+        else:
+            diar_model = SortformerEncLabelModel.from_pretrained(
+                self.cfg.model_path, map_location=self.cfg.device
+            )
+
+        # Steaming mode setup
+        diar_model.sortformer_modules.chunk_len = self.cfg.chunk_len
+        diar_model.sortformer_modules.spkcache_len = self.cfg.spkcache_len
+        diar_model.sortformer_modules.chunk_left_context = self.cfg.chunk_left_context
+        diar_model.sortformer_modules.chunk_right_context = self.cfg.chunk_right_context
+        diar_model.sortformer_modules.fifo_len = self.cfg.fifo_len
+        diar_model.sortformer_modules.log = self.cfg.log
+        diar_model.sortformer_modules.spkcache_refresh_rate = (
+            self.cfg.spkcache_refresh_rate
+        )
+        diar_model.eval()
+
+        return diar_model
+
+    def print_diar_result(self, diar_result: np.ndarray):
+        full_result = []
+        for t in range(diar_result.shape[0]):
+            spk_probs = ""
+            for s in range(diar_result.shape[1]):
+                spk_probs += f"{diar_result[t, s]:.2f} "
+            full_result.append(f"Time {t}: {spk_probs}")
+        print(full_result)
+        return full_result
+
+    def diarize(self, audio: bytes, stream_id: str = "default"):
+        audio_array = np.frombuffer(audio, dtype=np.int16).astype(np.float32) / 32768.0
+
+        self.feature_bufferer.update(audio_array)
+
+        features = self.feature_bufferer.get_feature_buffer()
+        feature_buffers = features.unsqueeze(0)  # add batch dimension
+        feature_buffers = feature_buffers.transpose(
+            1, 2
+        )  # [batch, feature, time] -> [batch, time, feature]
+        feature_buffer_lens = torch.tensor(
+            [feature_buffers.shape[1]], device=self.device
+        )
+        self.streaming_state, chunk_preds = self.stream_step(
+            processed_signal=feature_buffers,
+            processed_signal_length=feature_buffer_lens,
+            streaming_state=self.streaming_state,
+            total_preds=self.total_preds,
+            left_offset=self.left_offset,
+            right_offset=self.right_offset,
+        )
+        self.total_preds = chunk_preds
+        diar_result = chunk_preds[:, -self.chunk_size :, :].clone().cpu().numpy()
+        return diar_result[0]  # tensor of shape [6, 4]
+
+    def reset_state(self, stream_id: str = "default"):
+        self.feature_bufferer.reset()
+        self.streaming_state = self.init_streaming_state(batch_size=1)
+        self.total_preds = torch.zeros(
+            (1, 0, self.max_num_speakers), device=self.diarizer.device
+        )
+
+    def init_streaming_state(self, batch_size: int = 1) -> StreamingSortformerState:
+        """
+        Initialize the streaming state for the diarization model.
+
+        Args:
+            batch_size: The batch size to use.
+
+        Returns:
+            SortformerStreamingState: The initialized streaming state.
+        """
+        # Use the model's init_streaming_state method but convert to SortformerStreamingState format
+        nemo_state = self.diarizer.sortformer_modules.init_streaming_state(
+            batch_size=batch_size,
+            async_streaming=self.diarizer.async_streaming,
+            device=self.device,
+        )
+
+        return nemo_state
+
+    def stream_step(
+        self,
+        processed_signal: Tensor,
+        processed_signal_length: Tensor,
+        streaming_state: StreamingSortformerState,
+        total_preds: Tensor,
+        left_offset: int = 0,
+        right_offset: int = 0,
+    ) -> Tuple[StreamingSortformerState, Tensor]:
+        """
+        Execute a single streaming step for diarization.
+
+        Args:
+            processed_signal: The processed audio signal.
+            processed_signal_length: The length of the processed signal.
+            streaming_state: The current streaming state.
+            total_preds: The total predictions so far.
+            left_offset: The left offset for the current chunk.
+            right_offset: The right offset for the current chunk.
+
+        Returns:
+            Tuple[SortformerStreamingState, Tensor]: The updated streaming state and predictions.
+        """
+        # Move tensors to correct device
+        if processed_signal.device != self.device:
+            processed_signal = processed_signal.to(self.device)
+
+        if processed_signal_length.device != self.device:
+            processed_signal_length = processed_signal_length.to(self.device)
+
+        if total_preds is not None and total_preds.device != self.device:
+            total_preds = total_preds.to(self.device)
+
+        with (
+            torch.amp.autocast(
+                device_type=self.device, dtype=self.compute_dtype, enabled=self.use_amp
+            ),
+            torch.inference_mode(),
+            torch.no_grad(),
+        ):
+            try:
+                # Call the model's forward_streaming_step method
+                streaming_state, diar_pred_out_stream = (
+                    self.diarizer.forward_streaming_step(
+                        processed_signal=processed_signal,
+                        processed_signal_length=processed_signal_length,
+                        streaming_state=streaming_state,
+                        total_preds=total_preds,
+                        left_offset=left_offset,
+                        right_offset=right_offset,
+                    )
+                )
+            except Exception as e:
+                print(f"Error in diarizer streaming step: {e}")
+                # print the stack trace
+                import traceback
+
+                traceback.print_exc()
+                # Return the existing state and preds if there's an error
+                return streaming_state, total_preds
+
+        return streaming_state, diar_pred_out_stream
+
+```
+
+### Sortformer2 1 Speaker Diarization
+
+# Streaming Speaker Diarization with Sortformer2.1
+
+In this example, we show how to deploy a streaming speaker diarization service with [NVIDIA's Sortformer2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1) on Modal.
+Sortformer2.1 is a state-of-the-art speaker diarization model that is designed to operate on streams of audio.
+
+Try it yourself! Click the "View on GitHub" button to see the code. And [sign up for a Modal account](https://modal.com/signup) if you haven't already.
+
+## Setup
+
+We start by importing some basic packages and the Modal SDK. As well as setting up our Modal App, Volume, and Image.
+
+```python
+from pathlib import Path
+from typing import Literal
+
+import modal
+
+app = modal.App("sortformer2-1-speaker-diarization")
+
+CACHE_PATH = "/model"
+cache_vol = modal.Volume.from_name("sortformer2_1-cache", create_if_missing=True)
+
+image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:13.0.1-cudnn-devel-ubuntu22.04", add_python="3.12"
+    )
+    .env(
+        {
+            "HF_HUB_ENABLE_HF_TRANSFER": "1",
+            "HF_HOME": CACHE_PATH,  # cache directory for Hugging Face models
+            "CXX": "g++",
+            "CC": "g++",
+            "TORCH_HOME": CACHE_PATH,
+        }
+    )
+    .apt_install("git", "libsndfile1", "ffmpeg")
+    .uv_pip_install(
+        "hf_transfer==0.1.9",
+        "huggingface_hub[hf-xet]==0.31.2",
+        "cuda-python==13.0.1",
+        "numpy<2",
+        "fastapi",
+        "nemo_toolkit[asr]@git+https://github.com/NVIDIA/NeMo.git@main",
+    )
+)
+
+with image.imports():
+    import asyncio
+    import json
+    import time
+
+    from fastapi import FastAPI, WebSocket, WebSocketDisconnect
+    from starlette.websockets import WebSocketState
+
+    from .sortformer2_1 import DiarizationConfig, NeMoStreamingDiarizer
+
+```
+
+## Run Sortformer2.1 speaker diarization
+
+Now we're ready to add the code that runs the Sortformer2.1 speaker diarization model.
+
+We use a Modal [Cls](https://modal.com/docs/guide/lifecycle-functions)
+so that we can separate out the model loading and setup code from the inference.
+For more on lifecycle management with Clses and cold start penalty reduction on Modal, see
+[this guide](https://modal.com/docs/guide/cold-start). In particular, the Sortformer2.1 model
+is amenable to GPU snapshots which can significantly reduce cold start times.
+
+We also include two configurations. The low latency configuration is used for real-time diarization,
+and the high latency configuration is used for non-real-time diarization with higher accuracy.
+
+## Using WebSockets to stream audio and diarization results
+
+We use a Modal [ASGI](https://modal.com/docs/guide/asgi) app to serve the diarization results
+over WebSockets. This allows us to stream the diarization results to the client in real-time.
+
+We use a simple queue-based architecture to handle the audio and diarization results.
+
+The audio is received from the client over WebSockets and added to a queue.
+The diarization results are then processed and added to a queue.
+The diarization results are then sent to the client over WebSockets.
+
+```python
+@app.cls(
+    image=image,
+    volumes={CACHE_PATH: cache_vol},
+    gpu="L4",
+    secrets=[modal.Secret.from_name("huggingface-secret")],
+)
+class Sortformer2_1_Speaker_Diarization:
+    @modal.enter()
+    def enter(self):
+        self._LOW_LATENCY_CONFIG = DiarizationConfig(
+            max_num_speakers=4,
+            chunk_len=6,
+            chunk_right_context=7,
+            fifo_len=188,
+            spkcache_refresh_rate=144,
+            spkcache_len=188,
+        )
+        self._HIGH_LATENCY_CONFIG = DiarizationConfig(
+            max_num_speakers=4,
+            chunk_len=340,
+            chunk_right_context=40,
+            fifo_len=40,
+            spkcache_refresh_rate=300,
+            spkcache_len=188,
+        )
+        self.latency: Literal["low", "high"] = "low"
+        self._SORTFORMER_FRAME_SIZE_BYTES = (
+            16000 * 0.08 * 2
+        )  # sample rate * frame size in seconds * 2 bytes (16 bit)
+        if self.latency == "low":
+            self.config = self._LOW_LATENCY_CONFIG
+        else:
+            self.config = self._HIGH_LATENCY_CONFIG
+        # load model from Hugging Face model card directly (You need a Hugging Face token)
+        self.diarizer = NeMoStreamingDiarizer(
+            cfg=self.config, model="nvidia/diar_streaming_sortformer_4spk-v2.1"
+        )
+
+        self.web_app = FastAPI()
+
+        @self.web_app.websocket("/ws")
+        async def run_with_websocket(ws: WebSocket):
+            audio_queue = asyncio.Queue()
+            output_queue = asyncio.Queue()
+
+            async def recv_loop(ws, audio_queue):
+                audio_buffer = bytearray()
+                while True:
+                    data = await ws.receive_bytes()
+                    audio_buffer.extend(data)
+                    if len(audio_buffer) > self._SORTFORMER_FRAME_SIZE_BYTES:
+                        await audio_queue.put(audio_buffer)
+                        audio_buffer = bytearray()
+
+            async def inference_loop(audio_queue, output_queue):
+                while True:
+                    audio_data = await audio_queue.get()
+
+                    start_time = time.perf_counter()
+                    diar_result = self.diarizer.diarize(audio_data)
+
+                    probs = self._get_speaker_probabilities(diar_result)
+                    await output_queue.put(json.dumps(probs))
+
+                    end_time = time.perf_counter()
+                    print(
+                        f"time taken to diarize audio segment: {end_time - start_time} seconds"
+                    )
+
+            async def send_loop(output_queue, ws):
+                while True:
+                    output = await output_queue.get()
+                    print(f"sending diarize result: {output}")
+                    await ws.send_text(output)
+
+            await ws.accept()
+
+            try:
+                tasks = [
+                    asyncio.create_task(recv_loop(ws, audio_queue)),
+                    asyncio.create_task(inference_loop(audio_queue, output_queue)),
+                    asyncio.create_task(send_loop(output_queue, ws)),
+                ]
+                await asyncio.gather(*tasks)
+            except WebSocketDisconnect:
+                print("WebSocket disconnected")
+                ws = None
+            except Exception as e:
+                print("Exception:", e)
+            finally:
+                self.diarizer.reset_state()
+                if ws and ws.application_state is WebSocketState.CONNECTED:
+                    await ws.close(code=1011)  # internal error
+                    ws = None
+                for task in tasks:
+                    if not task.done():
+                        try:
+                            task.cancel()
+                            await task
+                        except asyncio.CancelledError:
+                            pass
+
+    @modal.asgi_app()
+    def webapp(self):
+        return self.web_app
+
+    def _get_speaker_probabilities(self, spk_pred):
+        # spk_pred is a 6x4 matrix of probabilities
+        # We want to return a 1x4 vector of probabilities for the total time window
+        # We can take the mean across the time dimension (axis 0)
+        return spk_pred.mean(axis=0).tolist()
+
+```
+
+## Serving the diarization results to a frontend
+
+We use a simple HTML frontend to display the diarization results.
+
+```python
+web_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .pip_install("fastapi")
+    .add_local_dir(
+        Path(__file__).parent / "streaming-diarization-frontend", "/root/frontend"
+    )
+)
+
+with web_image.imports():
+    from fastapi import FastAPI, WebSocket
+    from fastapi.responses import HTMLResponse, Response
+    from fastapi.staticfiles import StaticFiles
+
+@app.cls(image=web_image)
+@modal.concurrent(max_inputs=20)
+class WebServer:
+    @modal.asgi_app()
+    def web(self):
+        web_app = FastAPI()
+        web_app.mount("/static", StaticFiles(directory="frontend"))
+
+        @web_app.get("/status")
+        async def status():
+            return Response(status_code=200)
+
+        # serve frontend
+        @web_app.get("/")
+        async def index():
+            html_content = open("frontend/index.html").read()
+
+            # Get the base WebSocket URL (without transcriber parameters)
+            cls_instance = modal.Cls.from_name(
+                "sortformer2-1-speaker-diarization", "Sortformer2_1_Speaker_Diarization"
+            )()
+            ws_base_url = (
+                cls_instance.webapp.get_web_url().replace("http", "ws") + "/ws"
+            )
+            script_tag = f'<script>window.WS_BASE_URL = "{ws_base_url}";</script>'
+            html_content = html_content.replace(
+                '<script src="/static/sortformer2_1.js"></script>',
+                f'{script_tag}\n<script src="/static/sortformer2_1.js"></script>',
+            )
+            return HTMLResponse(content=html_content)
+
+        return web_app
+
+```
+
+### Streaming
+
+# Deploy a FastAPI app with streaming responses
+
+This example shows how you can deploy a [FastAPI](https://fastapi.tiangolo.com/) app with Modal that streams results back to the client.
+
+```python
+import asyncio
+import time
+
+import modal
+from fastapi import FastAPI
+from fastapi.responses import StreamingResponse
+
+image = modal.Image.debian_slim().uv_pip_install("fastapi[standard]")
+app = modal.App("example-streaming", image=image)
+
+web_app = FastAPI()
+
+```
+
+This asynchronous generator function simulates
+progressively returning data to the client. The `asyncio.sleep`
+is not necessary, but makes it easier to see the iterative behavior
+of the response.
+
+```python
+async def fake_video_streamer():
+    for i in range(10):
+        yield f"frame {i}: hello world!".encode()
+        await asyncio.sleep(1.0)
+
+```
+
+ASGI app with streaming handler.
+
+This `fastapi_app` also uses the fake video streamer async generator,
+passing it directly into `StreamingResponse`.
+
+```python
+@web_app.get("/")
+async def main():
+    return StreamingResponse(fake_video_streamer(), media_type="text/event-stream")
+
+@app.function()
+@modal.asgi_app()
+def fastapi_app():
+    return web_app
+
+```
+
+This `hook` web endpoint Modal function calls *another* Modal function,
+and it just works!
+
+```python
+@app.function()
+def sync_fake_video_streamer():
+    for i in range(10):
+        yield f"frame {i}: some data\n".encode()
+        time.sleep(1)
+
+@app.function()
+@modal.fastapi_endpoint()
+def hook():
+    return StreamingResponse(
+        sync_fake_video_streamer.remote_gen(), media_type="text/event-stream"
+    )
+
+```
+
+This `mapped` web endpoint Modal function does a parallel `.map` on a simple
+Modal function. Using `.starmap` also would work in the same fashion.
+
+```python
+@app.function()
+def map_me(i):
+    time.sleep(i)  # stagger the results for demo purposes
+    return f"hello from {i}\n"
+
+@app.function()
+@modal.fastapi_endpoint()
+def mapped():
+    return StreamingResponse(map_me.map(range(10)), media_type="text/event-stream")
+
+```
+
+To try for yourself, run
+
+```shell
+modal serve streaming.py
+```
+
+and then send requests to the URLs that appear in the terminal output.
+
+Make sure that your client is not buffering the server response
+until it gets newline (\n) characters. By default browsers and `curl` are buffering,
+though modern browsers should respect the "text/event-stream" content type header being set.
+
+### Streaming Kyutai Stt
+
+# Stream transcriptions with Kyutai STT
+
+This example demonstrates the deployment of a streaming audio transcription service with Kyutai STT on Modal.
+
+[Kyutai STT](https://kyutai.org/next/stt) is an automated speech recognition/transcription model
+that is designed to operate on streams of audio, rather than on complete audio files.
+See the linked blog post for details on their "delayed streams" architecture.
+
+## Setup
+
+We start by importing some basic packages and the Modal SDK.
+
+```python
+import asyncio
+import base64
+import time
+from pathlib import Path
+
+import modal
+
+```
+
+Then we define a Modal App and an
+[Image](https://modal.com/docs/guide/images)
+with the dependencies of our speech-to-text system.
+
+```python
+app = modal.App(name="example-streaming-kyutai-stt")
+
+stt_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "moshi==0.2.9", "fastapi==0.116.1", "huggingface-hub==0.33.5", "julius==0.2.7"
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1"})
+)
+
+```
+
+One dependency is missing: the model weights.
+
+Instead of including them in the Image or loading them every time the Function starts,
+we add them to a Modal [Volume](https://modal.com/docs/guide/volumes).
+Volumes are like a shared disk that all Modal Functions can access.
+
+For more details on patterns for handling model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+MODEL_NAME = "kyutai/stt-1b-en_fr"
+
+hf_cache_vol = modal.Volume.from_name(f"{app.name}-hf-cache", create_if_missing=True)
+hf_cache_vol_path = Path("/root/.cache/huggingface")
+volumes = {hf_cache_vol_path: hf_cache_vol}
+
+```
+
+## Run Kyutai STT inference on Modal
+
+Now we're ready to add the code that runs the speech-to-text model.
+
+We use a Modal [Cls](https://modal.com/docs/guide/lifecycle-functions)
+so that we can separate out the model loading and setup code from the inference.
+
+For more on lifecycle management with Clses and cold start penalty reduction on Modal, see
+[this guide](https://modal.com/docs/guide/cold-start).
+
+We also define multiple ways to access the underlying streaming STT service --
+via a [WebSocket](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API),
+for Web clients like browsers,
+and via a Modal [Queue](https://modal.com/docs/guide/queues)
+for Python clients.
+
+That plus the code for manipulating the streams of audio bytes and output text
+leads to a pretty big class! But there's not anything too complex here.
+
+```python
+MINUTES = 60
+
+@app.cls(image=stt_image, gpu="l40s", volumes=volumes, timeout=10 * MINUTES)
+class STT:
+    BATCH_SIZE = 1
+
+    @modal.enter()
+    def enter(self):
+        import torch
+        from huggingface_hub import snapshot_download
+        from moshi.models import LMGen, loaders
+
+        start_time = time.monotonic_ns()
+
+        print("Loading model...")
+        snapshot_download(MODEL_NAME)
+
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+
+        checkpoint_info = loaders.CheckpointInfo.from_hf_repo(MODEL_NAME)
+        self.mimi = checkpoint_info.get_mimi(device=self.device)
+        self.frame_size = int(self.mimi.sample_rate / self.mimi.frame_rate)
+
+        self.moshi = checkpoint_info.get_moshi(device=self.device)
+        self.lm_gen = LMGen(self.moshi, temp=0, temp_text=0)
+
+        self.mimi.streaming_forever(self.BATCH_SIZE)
+        self.lm_gen.streaming_forever(self.BATCH_SIZE)
+
+        self.text_tokenizer = checkpoint_info.get_text_tokenizer()
+
+        self.audio_silence_prefix_seconds = checkpoint_info.stt_config.get(
+            "audio_silence_prefix_seconds", 1.0
+        )
+        self.audio_delay_seconds = checkpoint_info.stt_config.get(
+            "audio_delay_seconds", 5.0
+        )
+        self.padding_token_id = checkpoint_info.raw_config.get(
+            "text_padding_token_id", 3
+        )
+
+        # warmup gpus
+        for _ in range(4):
+            codes = self.mimi.encode(
+                torch.zeros(self.BATCH_SIZE, 1, self.frame_size).to(self.device)
+            )
+            for c in range(codes.shape[-1]):
+                tokens = self.lm_gen.step(codes[:, :, c : c + 1])
+                if tokens is None:
+                    continue
+        torch.cuda.synchronize()
+
+        print(f"Model loaded in {round((time.monotonic_ns() - start_time) / 1e9, 2)}s")
+
+    def reset_state(self):
+        # reset llm chat history for this input
+        self.mimi.reset_streaming()
+        self.lm_gen.reset_streaming()
+
+    async def transcribe(self, pcm, all_pcm_data):
+        import numpy as np
+        import torch
+
+        if pcm is None:
+            yield all_pcm_data
+            return
+        if len(pcm) == 0:
+            yield all_pcm_data
+            return
+
+        if pcm.shape[-1] == 0:
+            yield all_pcm_data
+            return
+
+        if all_pcm_data is None:
+            all_pcm_data = pcm
+        else:
+            all_pcm_data = np.concatenate((all_pcm_data, pcm))
+
+        # infer on each frame
+        while all_pcm_data.shape[-1] >= self.frame_size:
+            chunk = all_pcm_data[: self.frame_size]
+            all_pcm_data = all_pcm_data[self.frame_size :]
+
+            with torch.no_grad():
+                chunk = torch.from_numpy(chunk)
+                chunk = chunk.unsqueeze(0).unsqueeze(0)  # (1, 1, frame_size)
+                chunk = chunk.expand(
+                    self.BATCH_SIZE, -1, -1
+                )  # (batch_size, 1, frame_size)
+                chunk = chunk.to(device=self.device)
+
+                # inference on audio chunk
+                codes = self.mimi.encode(chunk)
+
+                # language model inference against encoded audio
+                for c in range(codes.shape[-1]):
+                    text_tokens, vad_heads = self.lm_gen.step_with_extra_heads(
+                        codes[:, :, c : c + 1]
+                    )
+                    if text_tokens is None:
+                        # model is silent
+                        yield all_pcm_data
+                        return
+                    if vad_heads:
+                        pr_vad = vad_heads[2][0, 0, 0].cpu().item()
+                        if pr_vad > 0.5:
+                            # end of turn detected
+                            yield all_pcm_data
+                            return
+
+                    assert text_tokens.shape[1] == self.lm_gen.lm_model.dep_q + 1
+
+                    text_token = text_tokens[0, 0, 0].item()
+                    if text_token not in (0, 3):
+                        text = self.text_tokenizer.id_to_piece(text_token)
+                        text = text.replace("▁", " ")
+                        yield text
+
+        yield all_pcm_data
+
+    @modal.asgi_app()
+    def api(self):
+        import sphn
+        from fastapi import FastAPI, Response, WebSocket, WebSocketDisconnect
+
+        web_app = FastAPI()
+
+        @web_app.get("/status")
+        async def status():
+            return Response(status_code=200)
+
+        @web_app.websocket("/ws")
+        async def transcribe_websocket(ws: WebSocket):
+            await ws.accept()
+
+            opus_stream_inbound = sphn.OpusStreamReader(self.mimi.sample_rate)
+            transcription_queue = asyncio.Queue()
+
+            print("Session started")
+            tasks = []
+
+            # asyncio to run multiple loops concurrently within single websocket connection
+            async def recv_loop():
+                """
+                Receives Opus stream across websocket, appends into inbound queue.
+                """
+                nonlocal opus_stream_inbound
+                while True:
+                    data = await ws.receive_bytes()
+
+                    if not isinstance(data, bytes):
+                        print("received non-bytes message")
+                        continue
+                    if len(data) == 0:
+                        print("received empty message")
+                        continue
+                    opus_stream_inbound.append_bytes(data)
+
+            async def inference_loop():
+                """
+                Runs streaming inference on inbound data, and if any response audio is created, appends it to the outbound stream.
+                """
+                nonlocal opus_stream_inbound, transcription_queue
+                all_pcm_data = None
+
+                while True:
+                    await asyncio.sleep(0.001)
+
+                    pcm = opus_stream_inbound.read_pcm()
+                    async for msg in self.transcribe(pcm, all_pcm_data):
+                        if isinstance(msg, str):
+                            transcription_queue.put_nowait(msg)
+                        else:
+                            all_pcm_data = msg
+
+            async def send_loop():
+                """
+                Reads outbound data, and sends it across websocket
+                """
+                nonlocal transcription_queue
+                while True:
+                    data = await transcription_queue.get()
+
+                    if data is None:
+                        continue
+
+                    msg = b"\x01" + bytes(
+                        data, encoding="utf8"
+                    )  # prepend "\x01" as a tag to indicate text
+                    await ws.send_bytes(msg)
+
+            # run all loops concurrently
+            try:
+                tasks = [
+                    asyncio.create_task(recv_loop()),
+                    asyncio.create_task(inference_loop()),
+                    asyncio.create_task(send_loop()),
+                ]
+                await asyncio.gather(*tasks)
+
+            except WebSocketDisconnect:
+                print("WebSocket disconnected")
+                await ws.close(code=1000)
+            except Exception as e:
+                print("Exception:", e)
+                await ws.close(code=1011)  # internal error
+                raise e
+            finally:
+                for task in tasks:
+                    task.cancel()
+                await asyncio.gather(*tasks, return_exceptions=True)
+                self.reset_state()
+
+        return web_app
+
+    @modal.method()
+    async def transcribe_queue(self, q: modal.Queue):
+        import tempfile
+
+        import sphn
+
+        all_pcm_data = None
+
+        while True:
+            chunk = await q.get.aio(partition="audio")
+            if chunk is None:
+                await q.put.aio(None, partition="transcription")
+                break
+
+            # to avoid having to encode the audio and retrieve with OpusStreamReader:
+            with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
+                tmp.write(chunk)
+                tmp.flush()
+                pcm, _ = sphn.read(tmp.name)
+                pcm = pcm.squeeze(0)
+
+            async for msg in self.transcribe(pcm, all_pcm_data):
+                if isinstance(msg, str):
+                    await q.put.aio(msg, partition="transcription")
+                else:
+                    all_pcm_data = msg
+
+```
+
+## Run a local Python client to test streaming STT
+
+We can test this code on the same production Modal infra
+that we'll be deploying it on by writing a quick `local_entrypoint` for testing.
+
+We just need a few helper functions to control the streaming of audio bytes
+and transcribed text from local Python.
+
+These communicate asynchronously with the deployed Function using a Modal Queue.
+
+```python
+async def chunk_audio(data: bytes, chunk_size: int):
+    for i in range(0, len(data), chunk_size):
+        yield data[i : i + chunk_size]
+
+async def send_audio(audio_bytes: bytes, q: modal.Queue, chunk_size: int, rtf: int):
+    async for chunk in chunk_audio(audio_bytes, chunk_size):
+        await q.put.aio(chunk, partition="audio")
+        await asyncio.sleep(chunk_size / chunk_size / rtf)
+    await q.put.aio(None, partition="audio")
+
+async def receive_text(q: modal.Queue):
+    break_counter, break_every = 0, 20
+    while True:
+        data = await q.get.aio(partition="transcription")
+        if data is None:
+            break
+        print(data, end="")
+        break_counter += 1
+        if break_counter >= break_every:
+            print()
+            break_counter = 0
+
+```
+
+Now we write our quick test, which loads in audio from a URL
+and then passes it to the remote Function via a
+
+If you run this example with
+
+```bash
+modal run streaming_kyutai_stt.py
+```
+
+you will
+
+1. deploy the latest version of the code on Modal
+2. spin up a new GPU to handle transcription
+3. load the model from Hugging Face or the Modal Volume cache
+4. send the audio out to the new GPU container, transcribe it, and receive it locally to be printed.
+
+Not bad for a single Python file with no dependencies except Modal!
+
+```python
+@app.local_entrypoint()
+async def test(
+    chunk_size: int = 24_000,  # bytes
+    rtf: int = 1000,
+    audio_url: str = "https://github.com/kyutai-labs/delayed-streams-modeling/raw/refs/heads/main/audio/bria.mp3",
+):
+    from urllib.request import urlopen
+
+    print(f"Downloading audio file from {audio_url}")
+    audio_bytes = urlopen(audio_url).read()
+    print(f"Downloaded {len(audio_bytes)} bytes")
+
+    print("Starting transcription")
+    start_time = time.monotonic_ns()
+    with modal.Queue.ephemeral() as q:
+        STT().transcribe_queue.spawn(q)
+        send = asyncio.create_task(send_audio(audio_bytes, q, chunk_size, rtf))
+        recv = asyncio.create_task(receive_text(q))
+        await asyncio.gather(send, recv)
+    print(
+        f"\nTranscription complete in {round((time.monotonic_ns() - start_time) / 1e9, 2)}s"
+    )
+
+```
+
+## Deploy a streaming STT service on the Web
+
+We've already written a Web backend for our streaming STT service --
+that's the FastAPI API with the WebSocket in the Modal Cls above.
+
+We can also deploy a Web frontend. To keep things almost entirely "pure Python",
+we here use the [FastHTML](https://www.fastht.ml/) library,
+but you can also deploy a JavaScript frontend with a FastAPI or Node backend.
+
+We do use a bit of JS for the audio processing in the browser.
+We add it to the Modal Image using `add_local_dir`.
+You can find the frontend files [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/speech-to-text/streaming-kyutai-stt-frontend).
+
+```python
+web_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install("python-fasthtml==0.12.20")
+    .add_local_dir(
+        Path(__file__).parent / "streaming-kyutai-stt-frontend", "/root/frontend"
+    )
+)
+
+```
+
+You can deploy this frontend with
+
+```bash
+modal deploy streaming_kyutai_stt.py
+```
+
+and then interact with it at the printed `ui` URL.
+
+```python
+@app.function(image=web_image, timeout=10 * MINUTES)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def ui():
+    import fasthtml.common as fh
+
+    modal_logo_svg = open("/root/frontend/modal-logo.svg").read()
+    modal_logo_base64 = base64.b64encode(modal_logo_svg.encode()).decode()
+    app_js = open("/root/frontend/audio.js").read()
+
+    fast_app, rt = fh.fast_app(
+        hdrs=[
+            # audio recording libraries
+            fh.Script(
+                src="https://cdn.jsdelivr.net/npm/opus-recorder@latest/dist/recorder.min.js"
+            ),
+            fh.Script(
+                src="https://cdn.jsdelivr.net/npm/opus-recorder@latest/dist/encoderWorker.min.js"
+            ),
+            fh.Script(
+                src="https://cdn.jsdelivr.net/npm/ogg-opus-decoder/dist/ogg-opus-decoder.min.js"
+            ),
+            # styling
+            fh.Link(
+                href="https://fonts.googleapis.com/css?family=Inter:300,400,600",
+                rel="stylesheet",
+            ),
+            fh.Script(src="https://cdn.tailwindcss.com"),
+            fh.Script("""
+                tailwind.config = {
+                    theme: {
+                        extend: {
+                            colors: {
+                                ground: "#0C0F0B",
+                                primary: "#9AEE86",
+                                "accent-pink": "#FC9CC6",
+                                "accent-blue": "#B8E4FF",
+                            },
+                        },
+                    },
+                };
+            """),
+        ],
+    )
+
+    @rt("/")
+    def get():
+        return (
+            fh.Title("Kyutai Streaming STT"),
+            fh.Body(
+                fh.Div(
+                    fh.Div(
+                        fh.Div(
+                            id="text-output",
+                            cls="flex flex-col-reverse overflow-y-auto max-h-64 pr-2",
+                        ),
+                        cls="w-full overflow-y-auto max-h-64",
+                    ),
+                    cls="bg-gray-800 rounded-lg shadow-lg w-full max-w-xl p-6",
+                ),
+                fh.Footer(
+                    fh.Span(
+                        "Built with ",
+                        fh.A(
+                            "Kyutai",
+                            href="https://github.com/kyutai-labs/delayed-streams-modeling",
+                            target="_blank",
+                            rel="noopener noreferrer",
+                            cls="underline",
+                        ),
+                        " and",
+                        cls="text-sm font-medium text-gray-300 mr-2",
+                    ),
+                    fh.A(
+                        fh.Img(
+                            src=f"data:image/svg+xml;base64,{modal_logo_base64}",
+                            alt="Modal logo",
+                            cls="w-24",
+                        ),
+                        cls="flex items-center p-2 rounded-lg bg-gray-800 shadow-lg hover:bg-gray-700 transition-colors duration-200",
+                        href="https://modal.com",
+                        target="_blank",
+                        rel="noopener noreferrer",
+                    ),
+                    cls="fixed bottom-4 inline-flex items-center justify-center",
+                ),
+                fh.Script(app_js),
+                cls="relative bg-gray-900 text-white min-h-screen flex flex-col items-center justify-center p-4",
+            ),
+        )
+
+    return fast_app
+
+```
+
+### Streaming Parakeet
+
+# Streaming audio transcription using Parakeet
+
+This examples demonstrates the use of Parakeet ASR models for streaming speech-to-text on Modal.
+
+[Parakeet](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#parakeet)
+is the name of a family of ASR models built using [NVIDIA's NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).
+We'll show you how to use Parakeet for streaming audio transcription on Modal GPUs,
+with simple Python and browser clients.
+
+This example uses the `nvidia/parakeet-tdt-0.6b-v2` model which, as of June 2025, sits at the
+top of Hugging Face's [Open ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
+
+To try out transcription from your terminal,
+provide a URL for a `.wav` file to `modal run`:
+
+```bash
+modal run 06_gpu_and_ml/speech-to-text/streaming_parakeet.py --audio-url="https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
+```
+
+You should see output like the following:
+
+```bash
+🎤 Starting Transcription
+A Dream Within A Dream Edgar Allan Poe
+take this kiss upon the brow, And in parting from you now, Thus much let me avow You are not wrong who deem That my days have been a dream.
+...
+```
+
+Running a web service you can hit from any browser isn't any harder -- Modal handles the deployment of both the frontend and backend in a single App!
+Just run
+
+```bash
+modal serve 06_gpu_and_ml/speech-to-text/streaming_parakeet.py
+```
+
+and go to the link printed in your terminal.
+
+The full frontend code can be found [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/speech-to-text/streaming-parakeet-frontend/).
+
+## Setup
+
+```python
+import asyncio
+import os
+import sys
+from pathlib import Path
+
+import modal
+
+app = modal.App("example-streaming-parakeet")
+
+```
+
+## Volume for caching model weights
+
+We use a [Modal Volume](https://modal.com/docs/guide/volumes) to cache the model weights.
+This allows us to avoid downloading the model weights every time we start a new instance.
+
+For more on storing models on Modal, see [this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+model_cache = modal.Volume.from_name("parakeet-model-cache", create_if_missing=True)
+
+```
+
+## Configuring dependencies
+
+The model runs remotely inside a container on Modal. We can define the environment
+and install our Python dependencies in that container's [`Image`](https://modal.com/docs/guide/images).
+
+For finicky setups like NeMO's, we recommend using the official NVIDIA CUDA Docker images from Docker Hub.
+You'll need to install Python and pip with the `add_python` option because the image
+doesn't have these by default.
+
+Additionally, we install `ffmpeg` for handling audio data and `fastapi` to create a web
+server for our WebSocket.
+
+```python
+image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04", add_python="3.12"
+    )
+    .env(
+        {
+            "HF_XET_HIGH_PERFORMANCE": "1",
+            "HF_HOME": "/cache",  # cache directory for Hugging Face models
+            "DEBIAN_FRONTEND": "noninteractive",
+            "CXX": "g++",
+            "CC": "g++",
+        }
+    )
+    .apt_install("ffmpeg")
+    .uv_pip_install(
+        "hf_transfer==0.1.9",
+        "huggingface-hub==0.36.0",
+        "nemo_toolkit[asr]==2.3.0",
+        "cuda-python==12.8.0",
+        "fastapi==0.115.12",
+        "numpy<2",
+        "pydub==0.25.1",
+    )
+    .entrypoint([])  # silence chatty logs by container on start
+    .add_local_dir(  # changes fastest, so make this the last layer
+        Path(__file__).parent / "streaming-parakeet-frontend",
+        remote_path="/frontend",
+    )
+)
+
+```
+
+## Implementing streaming audio transcription on Modal
+
+Now we're ready to implement transcription. We wrap inference in a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions) that
+ensures models are loaded and then moved to the GPU once when a new container starts.
+
+A couple of notes about this code:
+- The `transcribe` method takes bytes of audio data and returns the transcribed text.
+- The `web` method creates a FastAPI app using [`modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app#modalasgi_app) that serves a
+[WebSocket](https://modal.com/docs/guide/webhooks#websockets) endpoint for streaming audio transcription and a browser frontend for transcribing audio from your microphone.
+- The `run_with_queue` method takes a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue) and passes audio data and transcriptions between our local machine and the GPU container.
+
+Parakeet tries really hard to transcribe everything to English!
+Hence it tends to output utterances like "Yeah" or "Mm-hmm" when it runs on silent audio.
+We pre-process the incoming audio in the server using `pydub`'s silence detection,
+ensuring that we don't pass silence into our model.
+
+```python
+END_OF_STREAM = (
+    b"END_OF_STREAM_8f13d09"  # byte sequence indicating a stream is finished
+)
+
+@app.cls(volumes={"/cache": model_cache}, gpu="a10g", image=image)
+@modal.concurrent(max_inputs=14, target_inputs=10)
+class ParakeetModel:
+    @modal.enter()
+    def load(self):
+        import logging
+
+        import nemo.collections.asr as nemo_asr
+
+        # silence chatty logs from nemo
+        logging.getLogger("nemo_logger").setLevel(logging.CRITICAL)
+
+        self.model = nemo_asr.models.ASRModel.from_pretrained(
+            model_name="nvidia/parakeet-tdt-0.6b-v2"
+        )
+
+    def transcribe(self, audio_bytes: bytes) -> str:
+        import numpy as np
+
+        audio_data = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32)
+
+        with NoStdStreams():  # hide output, see https://github.com/NVIDIA/NeMo/discussions/3281#discussioncomment-2251217
+            output = self.model.transcribe([audio_data])
+
+        return output[0].text
+
+    @modal.method()
+    async def handle_audio_chunk(
+        self,
+        chunk: bytes,
+        audio_segment,
+        silence_thresh=-45,  # dB
+        min_silence_len=1000,  # ms
+    ):
+        from pydub import AudioSegment, silence
+
+        new_audio_segment = AudioSegment(
+            data=chunk,
+            channels=1,
+            sample_width=2,
+            frame_rate=TARGET_SAMPLE_RATE,
+        )
+
+        # append the new audio segment to the existing audio segment
+        audio_segment += new_audio_segment
+
+        # detect windows of silence
+        silent_windows = silence.detect_silence(
+            audio_segment,
+            min_silence_len=min_silence_len,
+            silence_thresh=silence_thresh,
+        )
+
+        # if there are no silent windows, continue
+        if len(silent_windows) == 0:
+            return audio_segment, None
+
+        # get the last silent window because
+        # we want to transcribe until the final pause
+        last_window = silent_windows[-1]
+
+        # if the entire audio segment is silent, reset the audio segment
+        if last_window[0] == 0 and last_window[1] == len(audio_segment):
+            audio_segment = AudioSegment.empty()
+            return audio_segment, None
+
+        # get the segment to transcribe: beginning until last pause
+        segment_to_transcribe = audio_segment[: last_window[1]]
+
+        # remove the segment to transcribe from the audio segment
+        audio_segment = audio_segment[last_window[1] :]
+        try:
+            text = self.transcribe(segment_to_transcribe.raw_data)
+            return audio_segment, text
+        except Exception as e:
+            print("❌ Transcription error:", e)
+            raise e
+
+    @modal.method()
+    async def run_with_queue(self, q: modal.Queue):
+        from pydub import AudioSegment
+
+        # initialize an empty audio segment
+        audio_segment = AudioSegment.empty()
+
+        try:
+            while True:
+                # receive a chunk of audio data and convert it to an audio segment
+                chunk = await q.get.aio(partition="audio")
+
+                if chunk == END_OF_STREAM:
+                    await q.put.aio(END_OF_STREAM, partition="transcription")
+                    break
+
+                audio_segment, text = await self.handle_audio_chunk.remote.aio(
+                    chunk, audio_segment
+                )
+                if text:
+                    await q.put.aio(text, partition="transcription")
+        except Exception as e:
+            print(f"Error handling queue: {type(e)}: {e}")
+            return
+
+@app.cls(image=image)
+@modal.concurrent(max_inputs=100)
+class WebServer:
+    @modal.asgi_app()
+    def web(self):
+        from fastapi import FastAPI, Response, WebSocket
+        from fastapi.responses import HTMLResponse
+        from fastapi.staticfiles import StaticFiles
+
+        web_app = FastAPI()
+        web_app.mount("/static", StaticFiles(directory="/frontend"))
+
+        @web_app.get("/status")
+        async def status():
+            return Response(status_code=200)
+
+        # serve frontend
+        @web_app.get("/")
+        async def index():
+            return HTMLResponse(content=open("/frontend/index.html").read())
+
+        @web_app.websocket("/ws")
+        async def run_with_websocket(ws: WebSocket):
+            from fastapi import WebSocketDisconnect
+
+            await ws.accept()
+
+            from pydub import AudioSegment
+
+            model = ParakeetModel()
+
+            # initialize an empty audio segment
+            audio_segment = AudioSegment.empty()
+
+            try:
+                while True:
+                    # receive a chunk of audio data and convert it to an audio segment
+                    chunk = await ws.receive_bytes()
+                    if chunk == END_OF_STREAM:
+                        await ws.send_bytes(END_OF_STREAM)
+                        break
+                    (
+                        audio_segment,
+                        text,
+                    ) = await model.handle_audio_chunk.remote.aio(chunk, audio_segment)
+                    if text:
+                        await ws.send_text(text)
+            except Exception as e:
+                if not isinstance(e, WebSocketDisconnect):
+                    print(f"Error handling websocket: {type(e)}: {e}")
+                try:
+                    await ws.close(code=1011, reason="Internal server error")
+                except Exception as e:
+                    print(f"Error closing websocket: {type(e)}: {e}")
+
+        return web_app
+
+```
+
+## Running transcription from a local Python client
+
+Next, let's test the model with a [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint) that streams audio data to the server and prints
+out the transcriptions to our terminal as they arrive.
+
+Instead of using the WebSocket endpoint like the browser frontend,
+we'll use a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue)
+to pass audio data and transcriptions between our local machine and the GPU container.
+
+```python
+AUDIO_URL = "https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
+TARGET_SAMPLE_RATE = 16_000
+CHUNK_SIZE = 16_000  # send one second of audio at a time
+
+@app.local_entrypoint()
+async def main(audio_url: str = AUDIO_URL):
+    from urllib.request import urlopen
+
+    print(f"🌐 Downloading audio file from {audio_url}")
+    audio_bytes = urlopen(audio_url).read()
+    print(f"🎧 Downloaded {len(audio_bytes)} bytes")
+
+    audio_data = preprocess_audio(audio_bytes)
+
+    print("🎤 Starting Transcription")
+    with modal.Queue.ephemeral() as q:
+        ParakeetModel().run_with_queue.spawn(q)
+        send = asyncio.create_task(send_audio(q, audio_data))
+        recv = asyncio.create_task(receive_text(q))
+        await asyncio.gather(send, recv)
+    print("✅ Transcription complete!")
+
+```
+
+Below are the two functions that coordinate streaming audio and receiving transcriptions.
+
+`send_audio` transmits chunks of audio data with a slight delay,
+as though it was being streamed from a live source, like a microphone.
+`receive_text` waits for transcribed text to arrive and prints it.
+
+```python
+async def send_audio(q, audio_bytes):
+    for chunk in chunk_audio(audio_bytes, CHUNK_SIZE):
+        await q.put.aio(chunk, partition="audio")
+        await asyncio.sleep(CHUNK_SIZE / TARGET_SAMPLE_RATE / 8)
+    await q.put.aio(END_OF_STREAM, partition="audio")
+
+async def receive_text(q):
+    while True:
+        message = await q.get.aio(partition="transcription")
+        if message == END_OF_STREAM:
+            break
+
+        print(message)
+
+```
+
+## Addenda
+
+The remainder of the code in this example is boilerplate,
+mostly for handling Parakeet's input format.
+
+```python
+def preprocess_audio(audio_bytes: bytes) -> bytes:
+    import array
+    import io
+    import wave
+
+    with wave.open(io.BytesIO(audio_bytes), "rb") as wav_in:
+        n_channels = wav_in.getnchannels()
+        sample_width = wav_in.getsampwidth()
+        frame_rate = wav_in.getframerate()
+        n_frames = wav_in.getnframes()
+        frames = wav_in.readframes(n_frames)
+
+    # Convert frames to array based on sample width
+    if sample_width == 1:
+        audio_data = array.array("B", frames)  # unsigned char
+    elif sample_width == 2:
+        audio_data = array.array("h", frames)  # signed short
+    elif sample_width == 4:
+        audio_data = array.array("i", frames)  # signed int
+    else:
+        raise ValueError(f"Unsupported sample width: {sample_width}")
+
+    # Downmix to mono if needed
+    if n_channels > 1:
+        mono_data = array.array(audio_data.typecode)
+        for i in range(0, len(audio_data), n_channels):
+            chunk = audio_data[i : i + n_channels]
+            mono_data.append(sum(chunk) // n_channels)
+        audio_data = mono_data
+
+    # Resample to 16kHz if needed
+    if frame_rate != TARGET_SAMPLE_RATE:
+        ratio = TARGET_SAMPLE_RATE / frame_rate
+        new_length = int(len(audio_data) * ratio)
+        resampled_data = array.array(audio_data.typecode)
+
+        for i in range(new_length):
+            # Linear interpolation
+            pos = i / ratio
+            pos_int = int(pos)
+            pos_frac = pos - pos_int
+
+            if pos_int >= len(audio_data) - 1:
+                sample = audio_data[-1]
+            else:
+                sample1 = audio_data[pos_int]
+                sample2 = audio_data[pos_int + 1]
+                sample = int(sample1 + (sample2 - sample1) * pos_frac)
+
+            resampled_data.append(sample)
+
+        audio_data = resampled_data
+
+    return audio_data.tobytes()
+
+def chunk_audio(data: bytes, chunk_size: int):
+    for i in range(0, len(data), chunk_size):
+        yield data[i : i + chunk_size]
+
+class NoStdStreams(object):
+    def __init__(self):
+        self.devnull = open(os.devnull, "w")
+
+    def __enter__(self):
+        self._stdout, self._stderr = sys.stdout, sys.stderr
+        self._stdout.flush(), self._stderr.flush()
+        sys.stdout, sys.stderr = self.devnull, self.devnull
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        sys.stdout, sys.stderr = self._stdout, self._stderr
+        self.devnull.close()
+
+```
+
+### Streaming Whisper
+
+```python
+import asyncio
+import pathlib
+import re
+import tempfile
+import time
+import urllib
+from typing import Iterator
+
+import modal
+
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("git", "ffmpeg")
+    .uv_pip_install(
+        "fastapi==0.116.1",
+        "ffmpeg-python==0.2.0",
+        "https://github.com/openai/whisper/archive/v20230314.tar.gz",
+        "numpy<2",
+    )
+)
+app = modal.App(name="example-streaming-whisper", image=image)
+SAMPLE_URL = (
+    "https://modal-cdn.com/history-of-rome-podcast-duncan-001-in-the-beginning.mp3"
+)
+
+CACHE_DIR = "/root/.cache/whisper"
+whisper_cache = modal.Volume.from_name("whisper-cache", create_if_missing=True)
+
+def load_audio(data: bytes, start=None, end=None, sr: int = 16000):
+    import ffmpeg
+    import numpy as np
+
+    try:
+        fp = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
+        fp.write(data)
+        fp.close()
+        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
+        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
+        if start is None and end is None:
+            out, _ = (
+                ffmpeg.input(fp.name, threads=0)
+                .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
+                .run(
+                    cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True
+                )
+            )
+        else:
+            out, _ = (
+                ffmpeg.input(fp.name, threads=0)
+                .filter("atrim", start=start, end=end)
+                .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
+                .run(
+                    cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True
+                )
+            )
+    except ffmpeg.Error as e:
+        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
+
+    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
+
+def split_silences(
+    path: str, min_segment_length: float = 30.0, min_silence_length: float = 0.8
+) -> Iterator[tuple[float, float]]:
+    """
+    Split audio file into contiguous chunks using the ffmpeg `silencedetect` filter.
+    Yields tuples (start, end) of each chunk in seconds.
+
+    Parameters
+    ----------
+    path: str
+        path to the audio file on disk.
+    min_segment_length : float
+        The minimum acceptable length for an audio segment in seconds. Lower values
+        allow for more splitting and increased parallelizing, but decrease transcription
+        accuracy. Whisper models expect to transcribe in 30 second segments.
+    min_silence_length : float
+        Minimum silence to detect and split on, in seconds. Lower values are more likely to split
+        audio in middle of phrases and degrade transcription accuracy.
+    """
+    import ffmpeg
+
+    silence_end_re = re.compile(
+        r" silence_end: (?P<end>[0-9]+(\.?[0-9]*)) \| silence_duration: (?P<dur>[0-9]+(\.?[0-9]*))"
+    )
+
+    metadata = ffmpeg.probe(path)
+    duration = float(metadata["format"]["duration"])
+
+    reader = (
+        ffmpeg.input(str(path))
+        .filter("silencedetect", n="-10dB", d=min_silence_length)
+        .output("pipe:", format="null")
+        .run_async(pipe_stderr=True)
+    )
+
+    cur_start = 0.0
+    num_segments = 0
+
+    while True:
+        line = reader.stderr.readline().decode("utf-8")
+        if not line:
+            break
+        match = silence_end_re.search(line)
+        if match:
+            silence_end, silence_dur = match.group("end"), match.group("dur")
+            split_at = float(silence_end) - (float(silence_dur) / 2)
+
+            if (split_at - cur_start) < min_segment_length:
+                continue
+
+            yield cur_start, split_at
+            cur_start = split_at
+            num_segments += 1
+
+    # silencedetect can place the silence end *after* the end of the full audio segment.
+    # Such segments definitions are negative length and invalid.
+    if duration > cur_start and (duration - cur_start) > min_segment_length:
+        yield cur_start, duration
+        num_segments += 1
+    print(f"Split {path} into {num_segments} segments")
+
+@app.function(gpu="a10", volumes={CACHE_DIR: whisper_cache})
+def transcribe_segment(start: float, end: float, audio_data: bytes, model: str):
+    import torch
+    import whisper
+
+    print(
+        f"Transcribing segment {start:.2f} to {end:.2f} ({end - start:.2f}s duration)"
+    )
+
+    t0 = time.time()
+    use_gpu = torch.cuda.is_available()
+    device = "cuda" if use_gpu else "cpu"
+    model = whisper.load_model(model, device=device)
+    np_array = load_audio(audio_data, start=start, end=end)
+    result = model.transcribe(np_array, language="en", fp16=use_gpu)  # type: ignore
+    print(
+        f"Transcribed segment {start:.2f} to {end:.2f} ({end - start:.2f}s duration) in {time.time() - t0:.2f} seconds."
+    )
+
+    # Add back offsets.
+    for segment in result["segments"]:
+        segment["start"] += start
+        segment["end"] += start
+
+    return result
+
+async def stream_whisper(audio_data: bytes):
+    with tempfile.NamedTemporaryFile(delete=False) as f:
+        f.write(audio_data)
+        f.flush()
+        segment_gen = split_silences(f.name)
+
+    async for result in transcribe_segment.starmap(
+        segment_gen, kwargs=dict(audio_data=audio_data, model="base.en")
+    ):
+        # Must cooperatively yield here otherwise `StreamingResponse` will not iteratively return stream parts.
+        # see: https://github.com/python/asyncio/issues/284#issuecomment-154162668
+        await asyncio.sleep(0)
+        yield result["text"]
+
+@app.function()
+@modal.asgi_app()
+def api():
+    from fastapi import FastAPI, HTTPException
+    from fastapi.responses import StreamingResponse
+
+    web_app = FastAPI()
+
+    @web_app.get("/transcribe")
+    async def transcribe(url: str):
+        """
+        Usage:
+
+        ```sh
+        curl --no-buffer \
+            https://modal-labs-examples--example-streaming-whisper-api.modal.run/transcribe?url=https://modal-cdn.com/history-of-rome-podcast-duncan-001-in-the-beginning.mp3
+        ```
+
+        This endpoint will stream back the audio transcription as it makes progress.
+        """
+        print(f"downloading {url}")
+        try:
+            with urllib.request.urlopen(url) as response:
+                assert response.getcode() == 200, response.getcode()
+                audio_data = response.read()
+        except AssertionError:
+            raise HTTPException(status_code=422, detail=f"Could not process url {url}")
+        print(f"streaming transcription of {url} audio to client...")
+        return StreamingResponse(
+            stream_whisper(audio_data), media_type="text/event-stream"
+        )
+
+    return web_app
+
+@app.function()
+async def transcribe_cli(data: bytes, suffix: str):
+    async for result in stream_whisper(data):
+        print(result)
+
+@app.local_entrypoint()
+def main(path: str = SAMPLE_URL):
+    if path.startswith("http"):
+        with urllib.request.urlopen(path) as response:
+            assert response.getcode() == 200, response.getcode()
+            data = response.read()
+        suffix = path.rsplit(".")[-1]
+    else:
+        filepath = pathlib.Path(path)
+        data = filepath.read_bytes()
+        suffix = filepath.suffix
+    transcribe_cli.remote(data, suffix=suffix)
+
+```
+
+### Tensorflow Tutorial
+
+# TensorFlow tutorial
+
+This is essentially a version of the
+[image classification example in the TensorFlow documentation](https://www.tensorflow.org/tutorials/images/classification)
+running inside Modal on a GPU.
+If you run this script, it will also create an TensorBoard URL you can go to to watch the model train and review the results:
+
+![tensorboard](./tensorboard.png)
+
+## Setting up the dependencies
+
+Configuring a system to properly run GPU-accelerated TensorFlow can be challenging.
+Luckily, Modal makes it easy to stand on the shoulders of giants and
+[use a pre-built Docker container image](https://modal.com/docs/guide/custom-container#use-an-existing-container-image-with-from_registry) from a registry like Docker Hub.
+We recommend TensorFlow's [official base Docker container images](https://hub.docker.com/r/tensorflow/tensorflow), which come with `tensorflow` and its matching CUDA libraries already installed.
+
+If you want to install TensorFlow some other way, check out [their docs](https://www.tensorflow.org/install) for options and instructions.
+GPU-enabled containers on Modal will always have NVIDIA drivers available, but you will need to add higher-level tools like CUDA and cuDNN yourself.
+See the [Modal guide on customizing environments](https://modal.com/docs/guide/custom-container) for options we support.
+
+```python
+import time
+
+import modal
+
+dockerhub_image = modal.Image.from_registry(
+    "tensorflow/tensorflow:2.15.0-gpu",
+)
+
+app = modal.App("example-tensorflow-tutorial", image=dockerhub_image)
+
+```
+
+## Logging data to TensorBoard
+
+Training ML models takes time. Just as we need to monitor long-running systems like databases or web servers for issues,
+we also need to monitor the training process of our ML models. TensorBoard is a tool that comes with TensorFlow that helps you visualize
+the state of your ML model training. It is packaged as a web server.
+
+We want to run the web server for TensorBoard at the same time as we are training the
+TensorFlow model. The easiest way to share data between the training function and the
+web server is by creating a
+[Modal Volume](https://modal.com/docs/guide/volumes)
+that we can attach to both
+[Functions](https://modal.com/docs/reference/modal.Function).
+
+```python
+volume = modal.Volume.from_name("tensorflow-tutorial", create_if_missing=True)
+LOGDIR = "/tensorboard"
+
+```
+
+## Training function
+
+This is basically the same code as [the official example](https://www.tensorflow.org/tutorials/images/classification) from the TensorFlow docs.
+A few Modal-specific things are worth pointing out:
+
+* We attach the Volume for sharing data with TensorBoard in the `app.function`
+  decorator.
+
+* We also annotate this function with `gpu="T4"` to make sure it runs on a GPU.
+
+* We put all the TensorFlow imports inside the function body.
+  This makes it possible to run this example even if you don't have TensorFlow installed on your local computer -- a key benefit of Modal!
+
+You may notice some warnings in the logs about certain CPU performance optimizations (NUMA awareness and AVX/SSE instruction set support) not being available.
+While these optimizations can be important for some workloads, especially if you are running ML models on a CPU, they are not critical for most cases.
+
+```python
+@app.function(volumes={LOGDIR: volume}, gpu="T4", timeout=600)
+def train():
+    import pathlib
+
+    import tensorflow as tf
+    from tensorflow.keras import layers
+    from tensorflow.keras.models import Sequential
+
+    # load raw data from storage
+    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
+    data_dir = tf.keras.utils.get_file(
+        "flower_photos.tar", origin=dataset_url, extract=True
+    )
+    data_dir = pathlib.Path(data_dir).with_suffix("")
+
+    # construct Keras datasets from raw data
+    batch_size = 32
+    img_height = img_width = 180
+
+    train_ds = tf.keras.utils.image_dataset_from_directory(
+        data_dir,
+        validation_split=0.2,
+        subset="training",
+        seed=123,
+        image_size=(img_height, img_width),
+        batch_size=batch_size,
+    )
+
+    val_ds = tf.keras.utils.image_dataset_from_directory(
+        data_dir,
+        validation_split=0.2,
+        subset="validation",
+        seed=123,
+        image_size=(img_height, img_width),
+        batch_size=batch_size,
+    )
+
+    class_names = train_ds.class_names
+    train_ds = (
+        train_ds.cache().shuffle(1000).prefetch(buffer_size=tf.data.AUTOTUNE)  # type: ignore
+    )
+    val_ds = val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)  # type: ignore
+    num_classes = len(class_names)
+
+    model = Sequential(
+        [
+            layers.Rescaling(1.0 / 255, input_shape=(img_height, img_width, 3)),
+            layers.Conv2D(16, 3, padding="same", activation="relu"),
+            layers.MaxPooling2D(),
+            layers.Conv2D(32, 3, padding="same", activation="relu"),
+            layers.MaxPooling2D(),
+            layers.Conv2D(64, 3, padding="same", activation="relu"),
+            layers.MaxPooling2D(),
+            layers.Flatten(),
+            layers.Dense(128, activation="relu"),
+            layers.Dense(num_classes),
+        ]
+    )
+
+    model.compile(
+        optimizer="adam",
+        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+        metrics=["accuracy"],
+    )
+
+    model.summary()
+
+    tensorboard_callback = tf.keras.callbacks.TensorBoard(
+        log_dir=LOGDIR,
+        histogram_freq=1,
+    )
+
+    model.fit(
+        train_ds,
+        validation_data=val_ds,
+        epochs=20,
+        callbacks=[tensorboard_callback],
+    )
+
+```
+
+## Running TensorBoard
+
+TensorBoard is compatible with a Python web server standard called [WSGI](https://www.fullstackpython.com/wsgi-servers.html),
+the same standard used by [Flask](https://flask.palletsprojects.com/).
+Modal [speaks WSGI too](https://modal.com/docs/guide/webhooks#wsgi), so it's straightforward to run TensorBoard in a Modal app.
+
+We will attach the same Volume that we attached to our training function so that
+TensorBoard can read the logs. For this to work with Modal, we will first
+create some
+[WSGI Middleware](https://peps.python.org/pep-3333/)
+to check the Modal Volume for updates any time the page is reloaded.
+
+```python
+class VolumeMiddleware:
+    def __init__(self, app):
+        self.app = app
+
+    def __call__(self, environ, start_response):
+        if (route := environ.get("PATH_INFO")) in ["/", "/modal-volume-reload"]:
+            try:
+                volume.reload()
+            except Exception as e:
+                print("Exception while re-loading traces: ", e)
+            if route == "/modal-volume-reload":
+                environ["PATH_INFO"] = "/"  # redirect
+        return self.app(environ, start_response)
+
+```
+
+The WSGI app isn't exposed directly through the TensorBoard library, but we can build it
+the same way it's built internally --
+[see the TensorBoard source code for details](https://github.com/tensorflow/tensorboard/blob/0c5523f4b27046e1ca7064dd75347a5ee6cc7f79/tensorboard/program.py#L466-L476).
+
+Note that the TensorBoard server runs in a different container.
+The server does not need GPU support.
+Note that this server will be exposed to the public internet!
+
+```python
+@app.function(
+    volumes={LOGDIR: volume},
+    max_containers=1,  # single replica
+    scaledown_window=5 * 60,  # five minute idle time
+)
+@modal.concurrent(max_inputs=100)  # 100 concurrent request threads
+@modal.wsgi_app()
+def tensorboard_app():
+    import tensorboard
+
+    board = tensorboard.program.TensorBoard()
+    board.configure(logdir=LOGDIR)
+    (data_provider, deprecated_multiplexer) = board._make_data_provider()
+    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
+        board.flags,
+        board.plugin_loaders,
+        data_provider,
+        board.assets_zip_provider,
+        deprecated_multiplexer,
+        experimental_middlewares=[VolumeMiddleware],
+    )
+    return wsgi_app
+
+```
+
+## Local entrypoint code
+
+Let's kick everything off.
+Everything runs in an ephemeral "app" that gets destroyed once it's done.
+In order to keep the TensorBoard web server running, we sleep in an infinite loop
+until the user hits ctrl-c.
+
+The script will take a few minutes to run, although each epoch is quite fast since it runs on a GPU.
+The first time you run it, it might have to build the image, which can take an additional few minutes.
+
+```python
+@app.local_entrypoint()
+def main(just_run: bool = False):
+    train.remote()
+    if not just_run:
+        print(
+            "Training is done, but the app is still running TensorBoard until you hit ctrl-c."
+        )
+        try:
+            while True:
+                time.sleep(1)
+        except KeyboardInterrupt:
+            print("Terminating app")
+
+```
+
+### Test Case Generator
+
+```python
+import subprocess
+import time
+
+import modal
+
+app = modal.App(
+    name="sandbox-test-case-generator",
+)
+model_volume = modal.Volume.from_name("deepseek-model-volume", create_if_missing=True)
+files_volume = modal.Volume.from_name("files-volume", create_if_missing=True)
+
+MODEL_NAME = "deepseek-ai/deepseek-coder-6.7b-instruct"
+MODEL_REVISION = "e5d64addd26a6a1db0f9b863abf6ee3141936807"
+
+model_image = (
+    modal.Image.from_registry("lmsysorg/sglang:v0.4.9.post3-cu126", add_python="3.12")
+    .uv_pip_install(
+        "sglang[all]==0.4.9.post3",
+        "accelerate==1.8.1",
+        "hf_transfer==0.1.9",
+    )
+    .env(
+        {
+            "HF_HUB_ENABLE_HF_TRANSFER": "1",
+            "HF_HOME": "/cache",
+        }
+    )
+    .entrypoint([])  # silence noisy logs
+)
+
+@app.cls(
+    image=model_image,
+    volumes={
+        "/cache": model_volume,
+        "/data": files_volume,
+    },
+    gpu="L40S",
+    timeout=600,
+)
+@modal.concurrent(max_inputs=3)  # Each container runs up to 3 requests at once.
+class TestCaseServer:
+    @modal.enter()
+    def download_model(self):
+        from huggingface_hub import snapshot_download
+
+        snapshot_download(
+            MODEL_NAME,
+            local_dir=f"/cache/{MODEL_NAME}",  # similar to cache_dir, but with less unused metadata
+            revision=MODEL_REVISION,
+            ignore_patterns=["*.pt", "*.bin"],
+        )
+
+    @modal.enter()
+    def start_model_server(self):
+        import subprocess
+
+        serve_params = {
+            "host": "0.0.0.0",
+            "port": 8000,
+            "model": f"/cache/{MODEL_NAME}",
+            "log-level": "error",
+        }
+        serve_cmd = "python -m sglang.launch_server " + " ".join(
+            [f"--{k} {v}" for k, v in serve_params.items()]
+        )
+
+        self.serve_process = subprocess.Popen(serve_cmd, shell=True)
+        wait_for_port(self.serve_process, 8000)
+
+        print("SGLang server is ready!")
+
+    @modal.web_server(port=8000, startup_timeout=240)
+    def serve(self):
+        return
+
+@app.cls(
+    image=modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+        "openai==1.97.1"
+    ),
+    volumes={
+        "/data": files_volume,
+    },
+)
+class TestCaseClient:
+    url: str = modal.parameter()
+
+    def load_inputs(self, file_name: str) -> tuple[str, str]:
+        import os
+
+        if not os.path.exists("/data/inputs"):
+            raise Exception(
+                "Inputs directory does not exist. Make sure to run download_files_to_volume first."
+            )
+
+        with open(f"/data/inputs/{file_name}", "r") as f:
+            file_contents = f.read()
+
+        with open(f"/data/inputs/test_{file_name}", "r") as f:
+            test_file_contents = f.read()
+        return file_contents, test_file_contents
+
+    def write_outputs(self, output_file_name: str, output_contents: str) -> str:
+        import os
+
+        os.makedirs("/data/outputs", exist_ok=True)
+        with open(f"/data/outputs/{output_file_name}", "w") as f:
+            f.write(output_contents)
+        return output_file_name
+
+    @modal.method()
+    def generate(self, file_name: str) -> str:
+        import json
+
+        import openai
+
+        file_contents, test_file_contents = self.load_inputs(file_name)
+
+        system_prompt = get_system_prompt()
+        user_prompt = get_user_prompt(file_contents, test_file_contents)
+
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt},
+        ]
+
+        client = openai.Client(base_url=f"{self.url}/v1", api_key="EMPTY")
+
+        json_schema = {
+            "type": "object",
+            "properties": {"file_contents": {"type": "string"}},
+            "required": ["file_contents"],
+        }
+
+        response = client.chat.completions.create(
+            model="default",
+            messages=messages,
+            temperature=0,
+            max_tokens=1024,
+            response_format={
+                "type": "json_schema",
+                "json_schema": {
+                    "name": "test_file",
+                    "schema": json_schema,
+                },
+            },
+        )
+        output = response.choices[0].message.content
+        try:
+            output_contents = json.loads(output)["file_contents"]
+            return self.write_outputs(f"test_{file_name}", output_contents)
+        except Exception as e:
+            print(f"Error generating test file: {e}")
+            return None
+
+@app.function(
+    image=modal.Image.debian_slim(python_version="3.12").uv_pip_install(
+        "requests==2.32.3"
+    ),
+    volumes={"/data": files_volume},
+)
+def download_files_to_volume(
+    folder_paths: list[str],
+    gh_owner: str,
+    gh_repo_name: str,
+    gh_branch: str,
+) -> list[str]:
+    import os
+
+    import requests
+
+    os.makedirs("/data/inputs", exist_ok=True)
+    all_files = []
+    for folder_path in folder_paths:
+        response = requests.get(
+            f"https://api.github.com/repos/{gh_owner}/{gh_repo_name}/contents/{folder_path}?ref={gh_branch}"
+        )
+        files = response.json()
+        all_files.extend(files)
+
+    file_to_download_urls = []
+    for _file in all_files:
+        if (
+            _file["type"] == "file"
+            and _file["name"].endswith(".py")
+            and _file["name"] != "__init__.py"
+        ):
+            file_to_download_urls.append((_file["name"], _file["download_url"]))
+
+    file_to_text = {}
+    for name, url in file_to_download_urls:
+        response = requests.get(url)
+        file_to_text[name] = response.text
+
+    for name, text in file_to_text.items():
+        with open(f"/data/inputs/{name}", "w") as f:
+            f.write(text)
+    print("Files downloaded to volume!")
+    return [name for name in file_to_text.keys() if not name.startswith("test_")]
+
+def get_sandbox_image(gh_owner: str, gh_repo_name: str):
+    ALLURE_VERSION = "2.34.1"
+    MODULE_URL = f"https://github.com/{gh_owner}/{gh_repo_name}"
+
+    image = (
+        modal.Image.debian_slim()
+        .apt_install("git", "curl", "tar", "default-jre")
+        .uv_pip_install("webdiff")
+        .run_commands(
+            f"git clone {MODULE_URL}",
+            "curl -sSL https://install.python-poetry.org | python3 -",
+            "mkdir -p /opt/allure",
+            f"curl -sL https://github.com/allure-framework/allure2/releases/download/{ALLURE_VERSION}/allure-{ALLURE_VERSION}.tgz | tar xz -C /opt/allure --strip-components=1",
+        )
+        .env({"PATH": "$PATH:/root/.local/bin:/opt/allure/bin"})
+    )
+
+    return image
+
+def run_sandbox(image: modal.Image, file_name: str):
+    new_file_name = file_name.replace(".py", "_llm.py")
+
+    cmd = (
+        f"webdiff password-analyzer/tests/{file_name} /data/outputs/{file_name}  --host 0.0.0.0 --port 8001 &&"
+        + "cd password-analyzer && "
+        + "poetry install --no-root && "
+        + "poetry run pytest --alluredir allure-results || true && "
+        + f"cp /data/outputs/{file_name} tests/{new_file_name} && "
+        + "poetry run pytest --alluredir allure-results || true && "
+        + "allure serve allure-results --host 0.0.0.0 --port 8000"
+    )
+
+    sb = modal.Sandbox.create(
+        "sh",
+        "-c",
+        cmd,
+        app=app,
+        image=image,
+        volumes={
+            "/data": files_volume,
+        },
+        encrypted_ports=[8000, 8001],
+        timeout=300,  # 5 minutes
+    )
+    return sb
+
+@app.local_entrypoint()
+async def main(
+    gh_owner: str,
+    gh_repo_name: str,
+    gh_module_path: str,
+    gh_tests_path: str,
+    gh_branch: str,
+):
+    import asyncio
+
+    # Start server
+    sg_lang_server = TestCaseServer()
+
+    # Download files to volume
+    input_files = download_files_to_volume.remote(
+        folder_paths=[gh_module_path, gh_tests_path],
+        gh_owner=gh_owner,
+        gh_repo_name=gh_repo_name,
+        gh_branch=gh_branch,
+    )
+
+    # Initialize client and generate test files
+    generator = TestCaseClient(url=sg_lang_server.serve.get_web_url())  # type: ignore
+    output_generator = generator.generate.map.aio(input_files)
+    output_files = []
+    async for f in output_generator:
+        if f is not None:
+            output_files.append(f)
+    print("Test case files generated successfully! Creating sandboxes...")
+
+    # Create sandboxes to run the generated test files
+    sandboxes = create_sandboxes(output_files, gh_owner, gh_repo_name)
+    await asyncio.gather(
+        *[sb.wait.aio(raise_on_termination=False) for sb in sandboxes],
+        return_exceptions=True,
+    )
+
+```
+
+# Addenda
+The below functions are utility functions.
+
+```python
+def create_sandboxes(filenames: list[str], gh_owner: str, gh_repo_name: str):
+    file_to_sandbox: dict[str, modal.Sandbox] = {}
+    for filename in filenames:
+        print(f"Running sandbox for {filename}")
+        image = get_sandbox_image(gh_owner, gh_repo_name)
+        sb = run_sandbox(image, filename)
+        file_to_sandbox[filename] = sb
+    time.sleep(20)  # Hack to make sure URLs show up at the very end
+
+    for filename, sb in file_to_sandbox.items():
+        tunnel1 = sb.tunnels()[8000]
+        tunnel2 = sb.tunnels()[8001]
+        print(f"Sandbox created and run for generated test file: {filename}")
+        print(f"✨ View diff: {tunnel2.url}")
+        print(f"✨ View test results: {tunnel1.url}\n")
+
+    return file_to_sandbox.values()
+
+def get_user_prompt(file_text: str, test_file_text: str) -> str:
+    return f"""
+    Your task is to improve an existing test file using `pytest`.
+
+    Step-by-step:
+    1. Carefully read the existing test file (below) and understand the current test cases.
+    2. Then read the source file (also below) and understand the function behavior, focusing on docstrings, edge cases, and argument types.
+    3. Based on that understanding, **add** new test cases to the test file to increase coverage—especially edge cases, boundary conditions, and untested branches.
+    4. Use `pytest` idioms, but do **not** add or change import statements—**use only what is already imported**.
+    5. Do **not** explain your reasoning—just return the final modified test file.
+
+    ### Requirements:
+    - Your output must be a valid, complete Python file with the added test cases.
+    - Do not modify existing test logic unless necessary to support your new test cases.
+    - Do not import any additional modules.
+    - Limit each line to a maximum of 100 characters to avoid output truncation or formatting errors.
+    - Limit your output to around 25 lines. Make sure to complete any functions or blocks you start.
+
+    --- BEGIN TEST FILE ---
+    {test_file_text}
+    --- END TEST FILE ---
+
+    --- BEGIN SOURCE FILE ---
+    {file_text}
+    --- END SOURCE FILE ---
+    """
+
+def get_system_prompt():
+    return (
+        "You are a senior software engineer with expertise in test-driven development and Python unit testing. "
+        "Your task is to enhance an existing test file by adding more test cases. "
+        "Do not change or add import statements. Do not explain your reasoning. Output only a complete, valid Python file. "
+        "Do not change existing code and only add new test cases that follow the same formatting as the existing test cases. "
+        "Limit each line to a maximum of 100 characters to avoid output truncation or formatting errors."
+        "Limit your output to around 25 lines. Make sure to complete any functions or blocks you start."
+    )
+
+def wait_for_port(process: subprocess.Popen, port: int):
+    import socket
+
+    while True:
+        try:
+            with socket.create_connection(("0.0.0.0", port), timeout=1):
+                break
+        except (ConnectionRefusedError, OSError):
+            if process.poll() is not None:
+                raise Exception(
+                    f"Process {process.pid} exited with code {process.returncode}"
+                )
+
+```
+
+### Text Embeddings Inference
+
+# Run TextEmbeddingsInference (TEI) on Modal
+
+This example runs the [Text Embedding Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) toolkit on the Hacker News BigQuery public dataset.
+
+```python
+import json
+import os
+import socket
+import subprocess
+from pathlib import Path
+
+import modal
+
+GPU_CONFIG = "A10G"
+MODEL_ID = "BAAI/bge-base-en-v1.5"
+BATCH_SIZE = 32
+DOCKER_IMAGE = (
+    "ghcr.io/huggingface/text-embeddings-inference:86-1.7"  # Ampere 86 for A10s.
+    # "ghcr.io/huggingface/text-embeddings-inference:1.7" # Ampere 80 for A100s.
+    # "ghcr.io/huggingface/text-embeddings-inference:turing-1.7"  # Turing for T4s.
+)
+PORT = 8000
+
+DATA_PATH = Path("/data/dataset.jsonl")
+
+LAUNCH_FLAGS = [
+    "--model-id",
+    MODEL_ID,
+    "--port",
+    "8000",
+]
+
+def spawn_server() -> subprocess.Popen:
+    process = subprocess.Popen(["text-embeddings-router"] + LAUNCH_FLAGS)
+
+    # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
+    while True:
+        try:
+            socket.create_connection(("127.0.0.1", PORT), timeout=1).close()
+            print("Webserver ready!")
+            return process
+        except (socket.timeout, ConnectionRefusedError):
+            # Check if launcher webserving process has exited.
+            # If so, a connection can never be made.
+            retcode = process.poll()
+            if retcode is not None:
+                raise RuntimeError(f"launcher exited unexpectedly with code {retcode}")
+
+def download_model():
+    # Wait for server to start. This downloads the model weights when not present.
+    spawn_server().terminate()
+
+volume = modal.Volume.from_name("tei-hn-data", create_if_missing=True)
+
+app = modal.App("example-text-embeddings-inference")
+
+tei_image = (
+    modal.Image.from_registry(
+        DOCKER_IMAGE,
+        add_python="3.10",
+    )
+    .dockerfile_commands("ENTRYPOINT []")
+    .run_function(download_model, gpu=GPU_CONFIG)
+    .uv_pip_install("httpx")
+)
+
+with tei_image.imports():
+    from httpx import AsyncClient
+
+@app.cls(
+    gpu=GPU_CONFIG,
+    image=tei_image,
+    max_containers=20,  # Use up to 20 GPU containers at once.
+)
+@modal.concurrent(
+    max_inputs=10
+)  # Allow each container to process up to 10 batches at once.
+class TextEmbeddingsInference:
+    @modal.enter()
+    def setup_server(self):
+        self.process = spawn_server()
+        self.client = AsyncClient(base_url="http://127.0.0.1:8000")
+
+    @modal.exit()
+    def teardown_server(self):
+        self.process.terminate()
+
+    @modal.method()
+    async def embed(self, inputs_with_ids: list[tuple[int, str]]):
+        ids, inputs = zip(*inputs_with_ids)
+        resp = await self.client.post("/embed", json={"inputs": inputs})
+        resp.raise_for_status()
+        outputs = resp.json()
+
+        return list(zip(ids, outputs))
+
+def download_data():
+    service_account_info = json.loads(os.environ["SERVICE_ACCOUNT_JSON"])
+    credentials = service_account.Credentials.from_service_account_info(
+        service_account_info
+    )
+
+    client = bigquery.Client(credentials=credentials)
+
+    iterator = client.list_rows(
+        "bigquery-public-data.hacker_news.full",
+        max_results=100_000,
+    )
+    df = iterator.to_dataframe(progress_bar_type="tqdm").dropna()
+
+    df["id"] = df["id"].astype(int)
+    df["text"] = df["text"].apply(lambda x: x[:512])
+
+    data = list(zip(df["id"], df["text"]))
+
+    with open(DATA_PATH, "w") as f:
+        json.dump(data, f)
+
+    volume.commit()
+
+image = modal.Image.debian_slim(python_version="3.10").uv_pip_install(
+    "google-cloud-bigquery", "pandas", "db-dtypes", "tqdm"
+)
+
+with image.imports():
+    from google.cloud import bigquery
+    from google.oauth2 import service_account
+
+@app.function(
+    image=image,
+    secrets=[modal.Secret.from_name("bigquery")],
+    volumes={DATA_PATH.parent: volume},
+)
+def embed_dataset():
+    model = TextEmbeddingsInference()
+
+    if not DATA_PATH.exists():
+        print("Downloading data. This takes a while...")
+        download_data()
+
+    with open(DATA_PATH) as f:
+        data = json.loads(f.read())
+
+    def generate_batches():
+        batch = []
+        for item in data:
+            batch.append(item)
+
+            if len(batch) == BATCH_SIZE:
+                yield batch
+                batch = []
+
+    # data is of type list[tuple[str, str]].
+    # starmap spreads the tuples into positional arguments.
+    for output_batch in model.embed.map(generate_batches(), order_outputs=False):
+        # Do something with the outputs.
+        pass
+
+```
+
+### Text To Image
+
+# Run Stable Diffusion 3.5 Large Turbo as a CLI, API, and web UI
+
+This example shows how to run [Stable Diffusion 3.5 Large Turbo](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo) on Modal
+to generate images from your local command line, via an API, and as a web UI.
+
+Inference takes about one minute to cold start,
+at which point images are generated at a rate of one image every 1-2 seconds
+for batch sizes between one and 16.
+
+Below are four images produced by the prompt
+"A princess riding on a pony".
+
+![stable diffusion montage](https://modal-cdn.com/cdnbot/sd-montage-princess-yxu2vnbl_e896a9c0.webp)
+
+## Basic setup
+
+```python
+import io
+import random
+import time
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+MINUTES = 60
+
+```
+
+All Modal programs need an [`App`](https://modal.com/docs/reference/modal.App) — an object that acts as a recipe for
+the application. Let's give it a friendly name.
+
+```python
+app = modal.App("example-text-to-image")
+
+```
+
+## Configuring dependencies
+
+The model runs remotely inside a [container](https://modal.com/docs/guide/custom-container).
+That means we need to install the necessary dependencies in that container's image.
+
+Below, we start from a lightweight base Linux image
+and then install our Python dependencies, like Hugging Face's `diffusers` library and `torch`.
+
+```python
+CACHE_DIR = "/cache"
+
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install(
+        "accelerate==0.33.0",
+        "diffusers==0.31.0",
+        "fastapi[standard]==0.115.4",
+        "huggingface-hub==0.36.0",
+        "sentencepiece==0.2.0",
+        "torch==2.5.1",
+        "torchvision==0.20.1",
+        "transformers~=4.44.0",
+    )
+    .env(
+        {
+            "HF_XET_HIGH_PERFORMANCE": "1",  # faster downloads
+            "HF_HUB_CACHE": CACHE_DIR,
+        }
+    )
+)
+
+with image.imports():
+    import diffusers
+    import torch
+    from fastapi import Response
+
+```
+
+## Implementing SD3.5 Large Turbo inference on Modal
+
+We wrap inference in a Modal [Cls](https://modal.com/docs/guide/lifecycle-functions)
+that ensures models are loaded and then moved to the GPU once when a new container
+starts, before the container picks up any work.
+
+The `run` function just wraps a `diffusers` pipeline.
+It sends the output image back to the client as bytes.
+
+We also include a `web` wrapper that makes it possible
+to trigger inference via an API call.
+See the `/docs` route of the URL ending in `inference-web.modal.run`
+that appears when you deploy the app for details.
+
+```python
+MODEL_ID = "adamo1139/stable-diffusion-3.5-large-turbo-ungated"
+MODEL_REVISION_ID = "9ad870ac0b0e5e48ced156bb02f85d324b7275d2"
+
+cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
+
+@app.cls(
+    image=image,
+    gpu="H100",
+    timeout=10 * MINUTES,
+    volumes={CACHE_DIR: cache_volume},
+)
+class Inference:
+    @modal.enter()
+    def load_pipeline(self):
+        self.pipe = diffusers.StableDiffusion3Pipeline.from_pretrained(
+            MODEL_ID,
+            revision=MODEL_REVISION_ID,
+            torch_dtype=torch.bfloat16,
+        ).to("cuda")
+
+    @modal.method()
+    def run(
+        self, prompt: str, batch_size: int = 4, seed: Optional[int] = None
+    ) -> list[bytes]:
+        seed = seed if seed is not None else random.randint(0, 2**32 - 1)
+        print("seeding RNG with", seed)
+        torch.manual_seed(seed)
+        images = self.pipe(
+            prompt,
+            num_images_per_prompt=batch_size,  # outputting multiple images per prompt is much cheaper than separate calls
+            num_inference_steps=4,  # turbo is tuned to run in four steps
+            guidance_scale=0.0,  # turbo doesn't use CFG
+            max_sequence_length=512,  # T5-XXL text encoder supports longer sequences, more complex prompts
+        ).images
+
+        image_output = []
+        for image in images:
+            with io.BytesIO() as buf:
+                image.save(buf, format="PNG")
+                image_output.append(buf.getvalue())
+        torch.cuda.empty_cache()  # reduce fragmentation
+        return image_output
+
+    @modal.fastapi_endpoint(docs=True)
+    def web(self, prompt: str, seed: Optional[int] = None):
+        return Response(
+            content=self.run.local(  # run in the same container
+                prompt, batch_size=1, seed=seed
+            )[0],
+            media_type="image/png",
+        )
+
+```
+
+## Generating Stable Diffusion images from the command line
+
+This is the command we'll use to generate images. It takes a text `prompt`,
+a `batch_size` that determines the number of images to generate per prompt,
+and the number of times to run image generation (`samples`).
+
+You can also provide a `seed` to make sampling more deterministic.
+
+Run it with
+
+```bash
+modal run text_to_image.py
+```
+
+and pass `--help` to see more options.
+
+```python
+@app.local_entrypoint()
+def entrypoint(
+    samples: int = 4,
+    prompt: str = "A princess riding on a pony",
+    batch_size: int = 4,
+    seed: Optional[int] = None,
+):
+    print(
+        f"prompt => {prompt}",
+        f"samples => {samples}",
+        f"batch_size => {batch_size}",
+        f"seed => {seed}",
+        sep="\n",
+    )
+
+    output_dir = Path("/tmp/stable-diffusion")
+    output_dir.mkdir(exist_ok=True, parents=True)
+
+    inference_service = Inference()
+
+    for sample_idx in range(samples):
+        start = time.time()
+        images = inference_service.run.remote(prompt, batch_size, seed)
+        duration = time.time() - start
+        print(f"Run {sample_idx + 1} took {duration:.3f}s")
+        if sample_idx:
+            print(
+                f"\tGenerated {len(images)} image(s) at {(duration) / len(images):.3f}s / image."
+            )
+        for batch_idx, image_bytes in enumerate(images):
+            output_path = (
+                output_dir
+                / f"output_{slugify(prompt)[:64]}_{str(sample_idx).zfill(2)}_{str(batch_idx).zfill(2)}.png"
+            )
+            if not batch_idx:
+                print("Saving outputs", end="\n\t")
+            print(
+                output_path,
+                end="\n" + ("\t" if batch_idx < len(images) - 1 else ""),
+            )
+            output_path.write_bytes(image_bytes)
+
+```
+
+## Generating Stable Diffusion images via an API
+
+The Modal `Cls` above also included a [`fastapi_endpoint`](https://modal.com/docs/examples/basic_web),
+which adds a simple web API to the inference method.
+
+To try it out, run
+
+```bash
+modal deploy text_to_image.py
+```
+
+copy the printed URL ending in `inference-web.modal.run`,
+and add `/docs` to the end. This will bring up the interactive
+Swagger/OpenAPI docs for the endpoint.
+
+## Generating Stable Diffusion images in a web UI
+
+Lastly, we add a simple front-end web UI (written in Alpine.js) for
+our image generation backend.
+
+This is also deployed by running
+
+```bash
+modal deploy text_to_image.py.
+```
+
+The `Inference` class will serve multiple users from its own auto-scaling pool of warm GPU containers automatically.
+
+```python
+frontend_path = Path(__file__).parent / "frontend"
+
+web_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .uv_pip_install("jinja2==3.1.4", "fastapi[standard]==0.115.4")
+    .add_local_dir(frontend_path, remote_path="/assets")
+)
+
+@app.function(image=web_image)
+@modal.concurrent(max_inputs=100)
+@modal.asgi_app()
+def ui():
+    import fastapi.staticfiles
+    from fastapi import FastAPI, Request
+    from fastapi.templating import Jinja2Templates
+
+    web_app = FastAPI()
+    templates = Jinja2Templates(directory="/assets")
+
+    @web_app.get("/")
+    async def read_root(request: Request):
+        return templates.TemplateResponse(
+            "index.html",
+            {
+                "request": request,
+                "inference_url": Inference.web.get_web_url(),
+                "model_name": "Stable Diffusion 3.5 Large Turbo",
+                "default_prompt": "A cinematic shot of a baby raccoon wearing an intricate italian priest robe.",
+            },
+        )
+
+    web_app.mount(
+        "/static",
+        fastapi.staticfiles.StaticFiles(directory="/assets"),
+        name="static",
+    )
+
+    return web_app
+
+def slugify(s: str) -> str:
+    return "".join(c if c.isalnum() else "-" for c in s).strip("-")
+
+```
+
+### Tokasaurus Throughput
+
+# High-throughput LLM inference with Tokasaurus (LLama 3.2 1B Instruct)
+
+In this example, we demonstrate how to use Tokasaurus, an LLM inference framework designed for maximum throughput.
+
+It maps the [Large Language Monkeys GSM8K demo](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/standalone_monkeys_gsm8k.py)
+from the [Tokasaurus release blog post](https://scalingintelligence.stanford.edu/blogs/tokasaurus/) onto Modal
+and replicates the core result: sustained inference at >80k tok/s throughput,
+exceeding their reported numbers for vLLM and SGLang by ~3x.
+
+In the "Large Language Monkeys" inference-time compute scaling paradigm,
+[also introduced by the same Stanford labs](https://arxiv.org/abs/2407.21787),
+the response quality of a system using a small model is improved to match or exceed a system using a large model
+by running many requests in parallel.
+Here, it's applied to the Grade School Math (GSM8K) dataset.
+
+For more on this LLM inference pattern
+(and an explainer on why it's such a natural fit for current parallel computing systems)
+see [our blog post reproducing and extending their results](https://modal.com/blog/llama-human-eval).
+
+## Set up the container image
+
+Our first order of business is to define the environment our LLM engine will run in:
+the [container `Image`](https://modal.com/docs/guide/custom-container).
+
+We translate the [recipe](https://github.com/ScalingIntelligence/tokasaurus/blob/main/logs/blog_commands.md)
+the authors used to build their Tokasaurus environment into methods of `modal.Image`.
+
+This requires, for instance, picking a base Image that includes the right version of the
+[CUDA toolkit](https://modal.com/gpu-glossary/host-software/cuda-software-platform).
+
+```python
+import random
+import time
+
+import aiohttp
+import modal
+
+toka_image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:12.4.1-devel-ubuntu22.04", add_python="3.12"
+    ).entrypoint([])  # silence chatty logs on container start
+)
+
+```
+
+We also set an environment variable that directs Torch-based libraries to only compile kernels for the
+[GPU SM architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture)
+we are targeting, Hopper. This isn't strictly necessary, but it silences some paranoid logs.
+
+```python
+GPU_CONFIG = "H100!"  # ! means "strictly", no upgrades to H200
+TORCH_CUDA_ARCH_LIST = "9.0 9.0a"  # Hopper, aka H100/H200
+
+```
+
+From there, Tokasaurus can be installed like any normal Python package,
+since Modal [provides the host CUDA drivers](https://modal.com/docs/guide/cuda).
+
+```python
+toka_image = toka_image.env(
+    {"HF_XET_HIGH_PERFORMANCE": "1", "TORCH_CUDA_ARCH_LIST": TORCH_CUDA_ARCH_LIST}
+).uv_pip_install(
+    "tokasaurus==0.0.2",
+    "huggingface-hub==0.36.0",
+    "datasets==3.6.0",
+)
+
+```
+
+## Download the model weights
+
+For this demo, we run Meta's Llama 3.2 1B Instruct model, downloaded from Hugging Face.
+Since this is a gated model, you'll need to
+[accept the terms of use](https://huggingface.co/meta-llama/Llama-3.2-1B)
+and create a [Secret](https://modal.com/secrets/)
+with your Hugging Face token to download the weights.
+
+```python
+secrets = [modal.Secret.from_name("huggingface-secret")]
+
+MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"
+MODEL_REVISION = "4e20de362430cd3b72f300e6b0f18e50e7166e08"  # avoid nasty surprises when repos update!
+
+```
+
+Although Tokasaurus will download weights from Hugging Face on-demand,
+we want to cache them so we don't do it every time our server starts.
+We'll use a [Modal Volume](https://modal.com/docs/guide/volumes) for our cache. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+app_name = "example-tokasaurus-throughput"
+hf_cache_vol = modal.Volume.from_name(f"{app_name}-hf-cache", create_if_missing=True)
+volumes = {"/root/.cache/huggingface": hf_cache_vol}
+
+```
+
+## Configure Tokasaurus for maximum throughput on this workload
+
+On throughput-focused benchmarks with high prefix sharing workloads, Tokasaurus can outperform vLLM and SGLang nearly three-fold.
+
+For small models like the one we are running, it reduces CPU overhead by maintaining a deep input queue
+and exposing shared prefixes to the GPU [Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core)
+with [Hydragen](https://arxiv.org/abs/2402.05099).
+
+```python
+USE_HYDRAGEN = "T"
+HYDRAGEN_MIN_GROUP_SIZE = 129  # sic
+
+```
+
+We start by maximizing the number of tokens processed per forward pass by adjusting the following parameters:
+
+- `kv_cache_num_tokens`: max tokens in the KV cache, higher values increase throughput but consume GPU memory
+- `max_tokens_per_forward`: max tokens/seq processed per forward pass, higher values increase throughput but use more GPU memory
+- `max_seqs_per_forward`: max sequences processed per forward pass, higher values increase batch size and throughput, but require more GPU memory
+
+We also set a few other parameters with less obvious impacts -- the KV cache page size and the stop token behavior.
+All values are derived from
+[this version of the official benchmarking script](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/standalone_monkeys_gsm8k.py),
+except the `KV_CACHE_NUM_TOKENS`, which we increase to the maximum the GPU can handle.
+The value in the script is set to `(1024 + 512) * 1024`, which is the maximum that the other engines can handle, lower than that of Tokasaurus.
+
+```python
+KV_CACHE_NUM_TOKENS = (1024 + 768) * 1024  # tuned for H100, 80 GB RAM
+MAX_TOKENS_PER_FORWARD = 32768
+MAX_SEQS_PER_FORWARD = 8192
+PAGE_SIZE = 16
+STOP_STRING_NUM_TOKEN_LOOKBACK = 5
+
+```
+
+We could apply the Torch compiler to the model to make it faster and, via kernel fusion, reduce the amount of used activation memory,
+leaving space for a larger KV cache. However, it dramatically increases the startup time of the server,
+and we only see modest (20%, not 2x) improvements to throughput, so we don't use it here.
+
+```python
+TORCH_COMPILE = "F"
+
+```
+
+Lastly, we need to set a few of the parameters for the client requests,
+again based on the official benchmarking script.
+
+```python
+MAX_TOKENS = 1024
+TEMPERATURE = 0.6
+TOP_P = 1.0
+STOP_STRING = "Question:"
+N = 1024
+
+```
+
+## Serve Tokasaurus with an OpenAI-compatible API
+
+The function below spawns a Tokasaurus instance listening at port `10210`,
+serving an OpenAI-compatible API.
+We wrap it in the [`@modal.web_server` decorator](https://modal.com/docs/guide/webhooks#non-asgi-web-servers)
+to connect it to the Internet.
+
+The server runs in an independent process, via `subprocess.Popen`.
+If it hasn't started listening on the `PORT` within the `startup_timeout`,
+the server start will be marked as failed.
+
+```python
+app = modal.App(app_name)
+
+MINUTES = 60  # seconds
+PORT = 10210
+
+@app.function(
+    image=toka_image,
+    gpu=GPU_CONFIG,
+    scaledown_window=60 * MINUTES,  # how long should we stay up with no requests?
+    timeout=60 * MINUTES,  # how long should we allow requests to take?
+    # long, because we're doing batched inference
+    volumes=volumes,
+    secrets=secrets,
+)
+@modal.concurrent(max_inputs=100)
+@modal.web_server(port=PORT, startup_timeout=10 * MINUTES)
+def serve():
+    import subprocess
+
+    cmd = " ".join(
+        [
+            "tksrs",
+            f"model={MODEL_NAME}",
+            f"kv_cache_num_tokens={KV_CACHE_NUM_TOKENS}",
+            f"max_seqs_per_forward={MAX_SEQS_PER_FORWARD}",
+            f"max_tokens_per_forward={MAX_TOKENS_PER_FORWARD}",
+            f"torch_compile={TORCH_COMPILE}",
+            f"use_hydragen={USE_HYDRAGEN}",
+            f"hydragen_min_group_size={HYDRAGEN_MIN_GROUP_SIZE}",
+            f"stop_string_num_token_lookback={STOP_STRING_NUM_TOKEN_LOOKBACK}",
+            "page_size=16",
+            "stats_report_seconds=5.0",
+            "uvicorn_log_level=info",
+        ]
+    )
+
+    print(cmd)
+
+    subprocess.Popen(cmd, shell=True)
+
+```
+
+The code we have so far is enough to deploy Tokasaurus on Modal.
+Just run:
+
+```bash
+modal deploy tokasaurus_throughput.py
+```
+
+And you can hit the server with your favorite OpenAI-compatible API client,
+like the `openai` Python SDK.
+
+## Run the Large Language Monkeys GSM8K benchmark
+
+To make it easier to check the performance and to provide a simple test
+that can be used when setting up/configuring a Tokasaurus deployment,
+we include a simple `benchmark` function that acts as a `local_entrypoint`.
+If you target this script with `modal run`, this code will execute,
+spinning up a new replica and sending some test requests to it.
+
+Because the API responses don't include token counts, we need a quick helper function to
+calculate token counts from a prompt or completion.
+We add [automatic dynamic batching](https://modal.com/docs/guide/dynamic-batching)
+with `modal.batched`, so that we can send single strings but still take advantage
+of batched encoding.
+
+```python
+@app.function(image=toka_image, volumes=volumes)
+@modal.batched(max_batch_size=128, wait_ms=100)
+def count_tokens(texts: list[str]) -> list[int]:
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+    return [len(ids) for ids in tokenizer(texts)["input_ids"]]
+
+```
+
+You can run the benchmark with
+
+```bash
+modal run tokasaurus_throughput.py
+```
+
+or pass the `--help` flag to see options.
+
+```python
+@app.local_entrypoint()
+async def benchmark(seed: int = 42, limit: int = 16, num_few_shot: int = 4):
+    import asyncio
+
+    print("Loading dataset")
+    dataset = load_dataset.remote(seed=seed, num_few_shot=num_few_shot, limit=limit)
+    print(f"Total number of items to process: {len(dataset)}")
+
+    serve.update_autoscaler(
+        max_containers=1  # prevent concurrent execution when benchmarking
+    )
+
+    url = serve.get_web_url()
+    async with aiohttp.ClientSession(
+        base_url=url, headers={"Accept-Encoding": "gzip, deflate, br"}
+    ) as session:
+        print(f"Running health check for server at {url}")
+
+        async with session.get("/v1/models", timeout=20 * MINUTES) as resp:
+            up = (  # expect 404, /v1/models not implemented in toka 0.0.2
+                resp.status < 500
+            )
+
+        assert up, f"Failed health check for server at {url}"
+        print(f"Successful health check for server at {url}")
+
+        print("Beginning throughput test")
+        start = time.time()
+
+        reqs, resps = [], []
+        reqs = [_send_request(session, _make_prompt(item)) for item in dataset]
+        resps = await asyncio.gather(*reqs)
+
+        end = time.time()
+        total_time = end - start
+        print(f"Finished throughput test in {int(total_time)}s")
+
+        # sniff test the results
+        _integrity_check(resps)
+
+        # calculate throughput from total elapsed time and total token counts
+        print("Counting tokens")
+
+        input_text = [resp["prompt"] for resp in resps]
+        output_text = [  # flatten completions from list inside a list down to a single list
+            completion for resp in resps for completion in resp["completions"]
+        ]
+        total_tokens = sum(
+            [count async for count in count_tokens.map.aio(input_text + output_text)]
+        )
+
+        total_throughput = total_tokens // total_time
+
+        print(f"Total throughput: {total_throughput} tokens/second")
+
+```
+
+## Addenda
+
+The remaining code in this example is utility code, mostly for managing
+the GSM8K dataset. That code is slightly modified from the code in the Tokasaurus repo
+[here](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/standalone_monkeys_gsm8k.py).
+
+```python
+@app.function(image=toka_image, volumes=volumes)
+def load_dataset(seed: int, num_few_shot: int, limit: int = None):
+    from datasets import load_dataset
+
+    test_dataset = list(load_dataset("gsm8k", "main", split="test"))
+
+    random.seed(seed)
+    random.shuffle(test_dataset)
+
+    if limit is not None:
+        test_dataset = test_dataset[:limit]
+
+    if num_few_shot > 0:
+        train_dataset = list(load_dataset("gsm8k", "main", split="train"))
+        for i, data in enumerate(test_dataset):
+            few_shot_items = random.sample(train_dataset, num_few_shot)
+            data["few_shot_items"] = few_shot_items
+
+    return test_dataset
+
+def _make_prompt(item: dict) -> str:
+    few_shot_items = item["few_shot_items"]
+    few_shot_pieces = []
+    for f in few_shot_items:
+        few_shot_prompt = f"Question: {f['question']}\nAnswer: {f['answer']}\n\n"
+        few_shot_pieces.append(few_shot_prompt)
+    few_shot_prompt = "".join(few_shot_pieces)
+    prompt = few_shot_prompt + f"Question: {item['question']}\nAnswer:"
+    return prompt
+
+def _integrity_check(responses):
+    for ii, resp in enumerate(responses):
+        n_completions = len(resp["completions"])
+        assert n_completions == N, (
+            f"Expected {N} completions, got {n_completions} for request {ii}"
+        )
+
+async def _send_request(session: aiohttp.ClientSession, prompt: str):
+    payload: dict[str, object] = {
+        "model": "llm",
+        "prompt": prompt,
+        "max_tokens": MAX_TOKENS,
+        "temperature": TEMPERATURE,
+        "top_p": TOP_P,
+        "stop": STOP_STRING,
+        "n": N,
+        "logprobs": None,
+    }
+    headers = {"Content-Type": "application/json"}
+
+    async with session.post(
+        "/v1/completions", json=payload, headers=headers, timeout=10 * MINUTES
+    ) as resp:
+        resp.raise_for_status()
+        resp_json = await resp.json()
+        return {
+            "prompt": prompt,
+            "completions": [choice["text"] for choice in resp_json["choices"]],
+        }
+
+```
+
+### Torch Profiling
+
+# Tracing and profiling GPU-accelerated PyTorch programs on Modal
+
+![A PyTorch trace loaded into ui.perfetto.dev](https://modal-public-assets.s3.amazonaws.com/tmpx_2c9bl5_c5aa7ab0.webp)
+
+GPUs are high-performance computing devices. For high-performance computing,
+tools for measuring and investigating performance are as critical
+as tools for testing and confirming correctness in typical software.
+
+In this example, we demonstrate how to wrap a Modal Function with PyTorch's
+built-in profiler, which captures events on both CPUs & GPUs. We also show
+how to host TensorBoard, which includes useful visualizations and
+performance improvement suggestions.
+
+For a live walkthrough, check out
+[this video on our YouTube channel](https://www.youtube.com/watch?v=4cesQJLyHA8).
+
+## Saving traces to a Modal Volume
+
+Most tracing tools, including PyTorch's profiler, produce results as files on disk.
+Modal Functions run in ephemeral containers in Modal's cloud infrastructure,
+so by default these files disappear as soon as the Function finishes running.
+
+We can ensure these files persist by saving them to a
+[Modal Volume](https://modal.com/docs/guide/volumes).
+Volumes are a distributed file system: files can be read or written from
+by many machines across a network, in this case from inside any Modal Function.
+
+To start, we just create a Volume with a specific name.
+We'll also set a particular directory that we'll use for it
+in our Functions below, for convenience.
+
+```python
+from pathlib import Path
+from typing import Optional
+
+import modal
+
+traces = modal.Volume.from_name("example-traces", create_if_missing=True)
+TRACE_DIR = Path("/traces")
+
+```
+
+## Setting up a Modal App with a GPU-accelerated PyTorch Function
+
+We next set up the Modal Function that we wish to profile.
+
+In general, we want to attach profiling tools to code that's already in place
+and measure or debug its performance, and then detach it as easily as possible
+so that we can be confident that the same performance characteristics pertain in production.
+
+In keeping with that workflow, in this example we first define the Modal Function we want to profile,
+without including any of the profiling logic.
+
+That starts with the Function's environment: the Modal [App](https://modal.com/docs/guide/apps)
+the Function is attached to, the container [Image](https://modal.com/docs/guide/custom-container)
+with the Function's dependencies, and the hardware requirements of the Function, like a
+[GPU](https://modal.com/docs/guide/cuda).
+
+```python
+app = modal.App("example-torch-profiling")  # create an App
+
+image = modal.Image.debian_slim(  # define dependencies
+    python_version="3.11"
+).uv_pip_install("torch==2.5.1", "numpy==2.1.3")
+
+with image.imports():  # set up common imports
+    import torch
+
+```
+
+Here, we define the config as a dictionary so that we can re-use it here
+and later, when we attach the profiler. We want to make sure the profiler is in the same environment!
+
+```python
+config = {"gpu": "a10g", "image": image}
+
+```
+
+The Function we target for profiling appears below. It's just some simple PyTorch logic
+that repeatedly multiplies a random matrix with itself.
+
+The logic is simple, but it demonstrates two common issues with
+GPU-accelerated Python code that are relatively easily fixed:
+1. Slowing down the issuance of work to the GPU
+2. Providing insufficient work for the GPU to complete
+
+We'll cover these in more detail once we have the profiler set up.
+
+```python
+@app.function(**config)
+def underutilize(scale=1):
+    records = []
+
+    x = torch.randn(  # 🐌 2: not enough work to keep the GPU busy
+        scale * 100, scale * 100, device="cuda"
+    )
+
+    for ii in range(10):
+        x = x @ x
+
+        class Record:  # 🐌 1: heavy Python work in the hot loop
+            def __init__(self, value):
+                self.value = value
+
+        records.append(Record(ii))
+
+    x[0][0].cpu()  # force a host sync for accurate timing
+
+```
+
+## Wrapping a Modal Function with a profiler
+
+Now, let's wrap our `underutilize` Function with another Modal Function
+that runs PyTorch's profiler while executing it.
+
+This Function has the same environment `config` as `underutilize`,
+but it also attaches a remote Modal Volume to save profiler outputs.
+
+To increase the flexibility of this approach, we allow it to take the target Function's name
+as an argument. That's not much use here where there's only one Function,
+but it makes it easier to copy-paste this code into your projects to add profiling.
+
+```python
+@app.function(volumes={TRACE_DIR: traces}, **config)
+def profile(
+    function,
+    label: Optional[str] = None,
+    steps: int = 3,
+    schedule=None,
+    record_shapes: bool = False,
+    profile_memory: bool = False,
+    with_stack: bool = False,
+    print_rows: int = 0,
+    **kwargs,
+):
+    from uuid import uuid4
+
+    if isinstance(function, str):
+        try:
+            function = app.registered_functions[function]
+        except KeyError:
+            raise ValueError(f"Function {function} not found")
+    function_name = function.tag
+
+    output_dir = (
+        TRACE_DIR / (function_name + (f"_{label}" if label else "")) / str(uuid4())
+    )
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    if schedule is None:
+        if steps < 3:
+            raise ValueError("Steps must be at least 3 when using default schedule")
+        schedule = {"wait": 1, "warmup": 1, "active": steps - 2, "repeat": 0}
+
+    schedule = torch.profiler.schedule(**schedule)
+
+    with torch.profiler.profile(
+        activities=[
+            torch.profiler.ProfilerActivity.CPU,
+            torch.profiler.ProfilerActivity.CUDA,
+        ],
+        schedule=schedule,
+        record_shapes=record_shapes,
+        profile_memory=profile_memory,
+        with_stack=with_stack,
+        on_trace_ready=torch.profiler.tensorboard_trace_handler(output_dir),
+    ) as prof:
+        for _ in range(steps):
+            function.local(**kwargs)  # <-- here we wrap the target Function
+            prof.step()
+
+    if print_rows:
+        print(
+            prof.key_averages().table(sort_by="cuda_time_total", row_limit=print_rows)
+        )
+
+    trace_path = sorted(
+        output_dir.glob("**/*.pt.trace.json"),
+        key=lambda pth: pth.stat().st_mtime,
+        reverse=True,
+    )[0]
+
+    print(f"trace saved to {trace_path.relative_to(TRACE_DIR)}")
+
+    return trace_path.read_text(), trace_path.relative_to(TRACE_DIR)
+
+```
+
+## Triggering profiled execution from the command line and viewing in Perfetto
+
+We wrap one more layer to make this executable from the command line:
+a `local_entrypoint` that runs
+
+```bash
+modal run torch_profiling.py --function underutilize --print-rows 10
+```
+
+```python
+@app.local_entrypoint()
+def main(
+    function: str = "underutilize",
+    label: Optional[str] = None,
+    steps: int = 3,
+    schedule=None,
+    record_shapes: bool = False,
+    profile_memory: bool = False,
+    with_stack: bool = False,
+    print_rows: int = 10,
+    kwargs_json_path: Optional[str] = None,
+):
+    if kwargs_json_path is not None:  # use to pass arguments to function
+        import json
+
+        kwargs = json.loads(Path(kwargs_json_path).read_text())
+    else:
+        kwargs = {}
+
+    results, remote_path = profile.remote(
+        function,
+        label=label,
+        steps=steps,
+        schedule=schedule,
+        record_shapes=record_shapes,
+        profile_memory=profile_memory,
+        with_stack=with_stack,
+        print_rows=print_rows,
+        **kwargs,
+    )
+
+    output_path = Path("/tmp") / remote_path.name
+    output_path.write_text(results)
+    print(f"trace saved locally at {output_path}")
+
+```
+
+Underneath the profile results, you'll also see the path at which the trace was saved on the Volume
+and the path at which it was saved locally.
+
+You can view the trace in the free online [Perfetto UI](https://ui.perfetto.dev).
+
+### Improving the performance of our dummy test code
+
+The `underutilize` demonstrates two common patterns that leads to unnecessarily low GPU utilization:
+1. Slowing down the issuance of work to the GPU
+2. Providing insufficient work for the GPU to complete
+
+We simulated 1 in `underutilize` by defining a Python class in the middle of the matrix multiplication loop.
+This takes on the order of 10 microseconds, roughly the same time it takes our A10 GPU to do the matrix multiplication.
+Move it out of the loop to observe a small improvement in utilization. In a real setting,
+this code might be useful logging or data processing logic, which we must carefully keep
+out of the way of the code driving work on the GPU.
+
+We simulated 2 in `underutilize` by providing a matrix that is too small to occupy the GPU for long.
+Increase the size of the matrix by a factor of 4 in each dimension (a factor of 16 total),
+to increase the utilization without increasing the execution time.
+
+This is an untuitive feature of GPU programming in general: much work is done concurrently
+and bottlenecks are non-obvious, so sometimes more work can be done for free or on the cheap.
+In a server for large generative models, this might mean producing multiple outputs per user
+or handling multiple users at the same time is more economical than it at first seems!
+
+## Serving TensorBoard on Modal to view PyTorch profiles and traces
+
+The TensorBoard experiment monitoring server also includes a plugin
+for viewing and interpreting the results of PyTorch profiler runs:
+the `torch_tb_profiler` plugin.
+
+```python
+tb_image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
+    "tensorboard==2.18.0", "torch_tb_profiler==0.4.3"
+)
+
+```
+
+Because TensorBoard is a WSGI app, we can [host it on Modal](https://modal.com/docs/guide/webhooks)
+with the `modal.wsgi_app` decorator.
+
+Making this work with Modal requires one extra step:
+we add some [WSGI Middleware](https://peps.python.org/pep-3333/) that checks the Modal Volume for updates
+whenever the whole page is reloaded.
+
+```python
+class VolumeMiddleware:
+    def __init__(self, app):
+        self.app = app
+
+    def __call__(self, environ, start_response):
+        if (route := environ.get("PATH_INFO")) in ["/", "/modal-volume-reload"]:
+            try:
+                traces.reload()
+            except Exception as e:
+                print("Exception while re-loading traces: ", e)
+            if route == "/modal-volume-reload":
+                environ["PATH_INFO"] = "/"  # redirect
+        return self.app(environ, start_response)
+
+```
+
+You can deploy the TensorBoard server defined below with the following command:
+```bash
+modal deploy torch_profiling
+```
+
+and you can find your server at the URL printed to the terminal.
+
+```python
+@app.function(
+    volumes={TRACE_DIR: traces},
+    image=tb_image,
+    max_containers=1,  # single replica
+    scaledown_window=5 * 60,  # five minute idle time
+)
+@modal.concurrent(max_inputs=100)  # 100 concurrent request threads
+@modal.wsgi_app()
+def tensorboard():
+    import tensorboard
+
+    board = tensorboard.program.TensorBoard()
+    board.configure(logdir=str(TRACE_DIR))
+    (data_provider, deprecated_multiplexer) = board._make_data_provider()
+    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
+        board.flags,
+        board.plugin_loaders,
+        data_provider,
+        board.assets_zip_provider,
+        deprecated_multiplexer,
+        experimental_middlewares=[VolumeMiddleware],
+    )
+
+    return wsgi_app._create_wsgi_app()
+
+```
+
+### Trainer Script Grpo
+
+# Training script for training a reasoning model using the verifiers library with sandboxed code execution
+
+This script is used to train a model using GRPO. This is adapted from the [verifiers library](https://github.com/willccbb/verifiers/blob/main/verifiers/examples/math_python.py) example.
+Here, we use a Modal Sandbox to execute python code during training. Modal Sandboxes offer an easy way to execute untrusted code in a completely isolated environment.
+This is a more secure way to execute python code during training.
+
+```python
+import sys
+
+import modal
+import verifiers as vf
+from verifiers.utils import load_example_dataset
+
+```
+
+We create a Modal app and a Modal sandbox.
+
+```python
+app = modal.App.lookup("example-trainer-script-grpo", create_if_missing=True)
+sb = modal.Sandbox.create(app=app)
+
+```
+
+We create a function that will execute the python code in a Modal Sandbox.
+
+```python
+def sandbox_exec(code):
+    try:
+        process = sb.exec("python", "-c", code, timeout=10)
+        process.wait()
+
+        stdout = process.stdout.read()
+        stderr = process.stderr.read()
+        if stderr:
+            return f"Error: {stderr.strip()}"
+
+        output = stdout.strip() if stdout else ""
+        if len(output) > 1000:
+            output = output[:1000] + "... (truncated to 1000 chars)"
+
+        return output
+    except Exception as e:
+        return f"Error: {str(e)}"
+
+```
+
+We define the tool prompt for prompting the model. Then, we pass in our `sandbox_exec` function as a tool to the `ToolEnv` definition.
+
+```python
+TOOL_PROMPT = """
+Think step-by-step inside <think>...</think> tags in each message, then either call a tool inside <tool>...</tool> tags, or give your final answer inside <answer>...</answer> tags.
+
+You have access to the following tools to help solve problems:
+
+{tool_descriptions}
+
+Tools can be called by writing a JSON command inside <tool> tags with:
+- "name": the name of the tool to use
+- "args": the arguments for the tool
+
+Example usage:
+<tool>
+{{"name": "python", "args": {{"code": "import sympy\\nx = sympy.symbols('x')\\nprint(sympy.solve(x**2 - 4, x))"}}}}
+</tool>
+
+After concluding your message with a tool call,
+you will then see the tool's output inside <result> tags as a new message. \
+You may call tools multiple times if needed. \
+Tool state does not persist between calls. \
+Always use tools to solve problems whenever possible, rather than using your own knowledge.
+
+The <answer>...</answer> tags should contain only your final answer as a numeric expression.
+"""
+
+dataset = load_example_dataset("math", split="train").select(range(128))
+
+vf_env = vf.ToolEnv(
+    dataset=dataset,
+    system_prompt=TOOL_PROMPT,
+    few_shot=[],
+    tools=[sandbox_exec],
+    max_steps=3,
+)
+
+run_id = sys.argv[2]
+model_name = "willcb/Qwen3-0.6B"
+model, tokenizer = vf.get_model_and_tokenizer(model_name)
+run_name = "math-grpo_" + model_name.split("/")[-1].lower()
+
+```
+
+These parameters are adapted to test the training script via an overfitting test. We will use 128 examples from the training set and overfit the model to them.
+To learn more about the parameters, please refer to the [verifiers library](https://github.com/willccbb/verifiers/blob/main/verifiers/examples/math_python.py) example.
+
+```python
+training_args = vf.grpo_defaults(run_name=run_name)
+training_args.num_iterations = 50
+training_args.max_steps = 50
+training_args.per_device_train_batch_size = 4
+training_args.gradient_accumulation_steps = 4
+training_args.num_generations = 12
+training_args.learning_rate = 1e-3
+training_args.logging_steps = 1
+training_args.report_to = "wandb"
+
+trainer = vf.GRPOTrainer(
+    model=model,
+    processing_class=tokenizer,
+    env=vf_env,
+    args=training_args,
+)
+trainer.train()
+
+sb.terminate()
+save_path = f"/root/math_weights/{run_id}"
+trainer.save_model(save_path)
+tokenizer.save_pretrained(save_path)
+print(f"Model and tokenizer saved to {save_path}")
+
+```
+
+### Trtllm Latency
+
+# Serve an interactive language model app with latency-optimized TensorRT-LLM (LLaMA 3 8B)
+
+In this example, we demonstrate how to configure the TensorRT-LLM framework to serve
+Meta's LLaMA 3 8B model at interactive latencies on Modal.
+
+Many popular language model applications, like chatbots and code editing,
+put humans and models in direct interaction. According to an
+[oft-cited](https://lawsofux.com/doherty-threshold/)
+if [scientifically dubious](https://www.flashover.blog/posts/dohertys-threshold-is-a-lie)
+rule of thumb, computer systems need to keep their response times under 400ms
+in order to match pace with their human users.
+
+To hit this target, we use the [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
+inference framework from NVIDIA. TensorRT-LLM is the Lamborghini of inference engines:
+it achieves seriously impressive latency, but only if you tune it carefully.
+With the out-of-the-box defaults we observe an unacceptable median time
+to last token of over a second, but with careful configuration,
+we'll bring that down to under 250ms  -- over a 4x speed up!
+These latencies were measured on a single NVIDIA H100 GPU
+running LLaMA 3 8B on prompts and generations of a few dozen to a few hundred tokens.
+
+Here's what that looks like in a terminal chat interface:
+
+<video controls autoplay loop muted>
+<source src="https://modal-cdn.com/example-trtllm-latency.mp4" type="video/mp4">
+</video>
+
+## Overview
+
+This guide is intended to document two things:
+
+1. the [Python API](https://nvidia.github.io/TensorRT-LLM/llm-api)
+for building and running TensorRT-LLM engines, and
+
+2. how to use recommendations from the
+[TensorRT-LLM performance guide](https://github.com/NVIDIA/TensorRT-LLM/blob/b763051ba429d60263949da95c701efe8acf7b9c/docs/source/performance/performance-tuning-guide/useful-build-time-flags.md)
+to optimize the engine for low latency.
+
+Be sure to check out TensorRT-LLM's
+[examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples)
+for sample code beyond what we cover here, like low-rank adapters (LoRAs).
+
+### What is a TRT-LLM engine?
+
+The first step in running TensorRT-LLM is to build an "engine" from a model.
+Engines have a large number of parameters that must be tuned on a per-workload basis,
+so we carefully document the choices we made here and point you to additional resources
+that can help you optimize for your specific workload.
+
+Historically, this process was done with a clunky command-line-interface (CLI),
+but things have changed for the better!
+2025 is [the year of CUDA Python](https://twitter.com/blelbach/status/1902842146232865280),
+including a new-and-improved Python SDK for TensorRT-LLM, supporting
+all the same features as the CLI -- quantization, speculative decoding, in-flight batching,
+and much more.
+
+## Installing TensorRT-LLM
+
+To run TensorRT-LLM, we must first install it. Easier said than done!
+
+To run code on Modal, we define [container images](https://modal.com/docs/guide/images).
+All Modal containers have access to GPU drivers via the underlying host environment,
+but we still need to install the software stack on top of the drivers, from the CUDA runtime up.
+
+We start from an official `nvidia/cuda` container image,
+which includes the CUDA runtime & development libraries
+and the environment configuration necessary to run them.
+
+```python
+import time
+from pathlib import Path
+
+import modal
+
+tensorrt_image = modal.Image.from_registry(
+    "nvidia/cuda:12.8.1-devel-ubuntu22.04",
+    add_python="3.12",  # TRT-LLM requires Python 3.12
+).entrypoint([])  # remove verbose logging by base image on entry
+
+```
+
+On top of that, we add some system dependencies of TensorRT-LLM,
+including OpenMPI for distributed communication, some core software like `git`,
+and the `tensorrt_llm` package itself.
+
+```python
+tensorrt_image = tensorrt_image.apt_install(
+    "openmpi-bin", "libopenmpi-dev", "git", "git-lfs", "wget"
+).pip_install(
+    "tensorrt-llm==0.18.0",
+    "pynvml<12",  # avoid breaking change to pynvml version API
+    "flashinfer-python==0.2.5",
+    "cuda-python==12.9.1",
+    "onnx==1.19.1",
+    pre=True,
+    extra_index_url="https://pypi.nvidia.com",
+)
+
+```
+
+Note that we're doing this by [method-chaining](https://quanticdev.com/articles/method-chaining/)
+a number of calls to methods on the `modal.Image`. If you're familiar with
+Dockerfiles, you can think of this as a Pythonic interface to instructions like `RUN` and `CMD`.
+
+End-to-end, this step takes about five minutes on first run.
+If you're reading this from top to bottom,
+you might want to stop here and execute the example
+with `modal run` so that it runs in the background while you read the rest.
+
+## Downloading the model
+
+Next, we'll set up a few things to download the model to persistent storage and do it quickly --
+this is a latency-optimized example after all! For persistent, distributed storage, we use
+[Modal Volumes](https://modal.com/docs/guide/volumes), which can be accessed from any container
+with read speeds in excess of a gigabyte per second.
+
+We also set the `HF_HOME` environment variable to point to the Volume so that the model
+is cached there. And we install `hf-transfer` to get maximum download throughput from
+the Hugging Face Hub, in the hundreds of megabytes per second.
+
+```python
+volume = modal.Volume.from_name(
+    "example-trtllm-inference-volume", create_if_missing=True
+)
+VOLUME_PATH = Path("/vol")
+MODELS_PATH = VOLUME_PATH / "models"
+
+MODEL_ID = "NousResearch/Meta-Llama-3-8B-Instruct"  # fork without repo gating
+MODEL_REVISION = "53346005fb0ef11d3b6a83b12c895cca40156b6c"
+
+tensorrt_image = tensorrt_image.uv_pip_install(
+    "huggingface_hub==0.36.0",
+).env(
+    {
+        "HF_XET_HIGH_PERFORMANCE": "1",
+        "HF_HOME": str(MODELS_PATH),
+        "TORCH_CUDA_ARCH_LIST": "9.0 9.0a",  # H100, silence noisy logs
+    }
+)
+
+with tensorrt_image.imports():
+    import os
+
+    import torch
+    from tensorrt_llm import LLM, SamplingParams
+
+```
+
+## Setting up the engine
+
+### Quantization
+
+The amount of [GPU RAM](https://modal.com/gpu-glossary/device-hardware/gpu-ram)
+on a single card is a tight constraint for large models:
+RAM is measured in billions of bytes and large models have billions of parameters,
+each of which is two to four bytes.
+The performance cliff if you need to spill to CPU memory is steep,
+so all of those parameters must fit in the GPU memory,
+along with other things like the KV cache built up while processing prompts.
+
+The simplest way to reduce LLM inference's RAM requirements is to make the model's parameters smaller,
+fitting their values in a smaller number of bits, like four or eight. This is known as _quantization_.
+
+NVIDIA's [Ada Lovelace/Hopper chips](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture),
+like the L40S and H100, are capable of native 8bit floating point calculations
+in their [Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core),
+so we choose that as our quantization format.
+These GPUs are capable of twice as many floating point operations per second in 8bit as in 16bit --
+about two quadrillion per second on an H100 SXM.
+
+Quantization buys us two things:
+
+- faster startup, since less data has to be moved over the network onto CPU and GPU RAM
+
+- faster inference, since we get twice the FLOP/s and less data has to be moved from GPU RAM into
+[on-chip memory](https://modal.com/gpu-glossary/device-hardware/l1-data-cache) and
+[registers](https://modal.com/gpu-glossary/device-hardware/register-file)
+with each computation
+
+We'll use TensorRT-LLM's `QuantConfig` to specify that we want `FP8` quantization.
+[See their code](https://github.com/NVIDIA/TensorRT-LLM/blob/88e1c90fd0484de061ecfbacfc78a4a8900a4ace/tensorrt_llm/models/modeling_utils.py#L184)
+for more options.
+
+```python
+N_GPUS = 1  # Bumping this to 2 will improve latencies further but not 2x
+GPU_CONFIG = f"H100:{N_GPUS}"
+
+def get_quant_config():
+    from tensorrt_llm.llmapi import QuantConfig
+
+    return QuantConfig(quant_algo="FP8")
+
+```
+
+Quantization is a lossy compression technique. The impact on model quality can be
+minimized by tuning the quantization parameters on even a small dataset. Typically, we
+see less than 2% degradation in evaluation metrics when using `fp8`. We'll use the
+`CalibrationConfig` class to specify the calibration dataset.
+
+```python
+def get_calib_config():
+    from tensorrt_llm.llmapi import CalibConfig
+
+    return CalibConfig(
+        calib_batches=512,
+        calib_batch_size=1,
+        calib_max_seq_length=2048,
+        tokenizer_max_seq_length=4096,
+    )
+
+```
+
+### Configure plugins
+
+TensorRT-LLM is an LLM inference framework built on top of NVIDIA's TensorRT,
+which is a generic inference framework for neural networks.
+
+TensorRT includes a "plugin" extension system that allows you to adjust behavior,
+like configuring the [CUDA kernels](https://modal.com/gpu-glossary/device-software/kernel)
+used by the engine.
+The [General Matrix Multiply (GEMM)](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html)
+plugin, for instance, adds heavily-optimized matrix multiplication kernels
+from NVIDIA's [cuBLAS library of linear algebra routines](https://docs.nvidia.com/cuda/cublas/).
+
+We'll specify a number of plugins for our engine implementation.
+The first is
+[multiple profiles](https://github.com/NVIDIA/TensorRT-LLM/blob/b763051ba429d60263949da95c701efe8acf7b9c/docs/source/performance/performance-tuning-guide/useful-build-time-flags.md#multiple-profiles),
+which configures TensorRT to prepare multiple kernels for each high-level operation,
+where different kernels are optimized for different input sizes.
+The second is `paged_kv_cache` which enables a
+[paged attention algorithm](https://arxiv.org/abs/2309.06180)
+for the key-value (KV) cache.
+
+The last two parameters are GEMM plugins optimized specifically for low latency,
+rather than the more typical high arithmetic throughput,
+the `low_latency` plugins for `gemm` and `gemm_swiglu`.
+
+The `low_latency_gemm_swiglu_plugin` plugin fuses the two matmul operations
+and non-linearity of the feedforward component of the Transformer block into a single kernel,
+reducing round trips between GPU
+[cache memory](https://modal.com/gpu-glossary/device-hardware/l1-data-cache)
+and RAM. For details on kernel fusion, see
+[this blog post by Horace He of Thinking Machines](https://horace.io/brrr_intro.html).
+Note that at the time of writing, this only works for `FP8` on Hopper GPUs.
+
+The `low_latency_gemm_plugin` is a variant of the GEMM plugin that brings in latency-optimized
+kernels from NVIDIA's [CUTLASS library](https://github.com/NVIDIA/cutlass).
+
+```python
+def get_plugin_config():
+    from tensorrt_llm.plugin.plugin import PluginConfig
+
+    return PluginConfig.from_dict(
+        {
+            "multiple_profiles": True,
+            "paged_kv_cache": True,
+            "low_latency_gemm_swiglu_plugin": "fp8",
+            "low_latency_gemm_plugin": "fp8",
+        }
+    )
+
+```
+
+### Configure speculative decoding
+
+Speculative decoding is a technique for generating multiple tokens per step,
+avoiding the auto-regressive bottleneck in the Transformer architecture.
+Generating multiple tokens in parallel exposes more parallelism to the GPU.
+It works best for text that has predicable patterns, like code,
+but it's worth testing for any workload where latency is critical.
+
+Speculative decoding can use any technique to guess tokens, including running another,
+smaller language model. Here, we'll use a simple, but popular and effective
+speculative decoding strategy called "lookahead decoding",
+which essentially guesses that token sequences from the past will occur again.
+
+```python
+def get_speculative_config():
+    from tensorrt_llm.llmapi import LookaheadDecodingConfig
+
+    return LookaheadDecodingConfig(
+        max_window_size=8,
+        max_ngram_size=6,
+        max_verification_set_size=8,
+    )
+
+```
+
+### Set the build config
+
+Finally, we'll specify the overall build configuration for the engine. This includes
+more obvious parameters such as the maximum input length, the maximum number of tokens
+to process at once before queueing occurs, and the maximum number of sequences
+to process at once before queueing occurs.
+
+To minimize latency, we set the maximum number of sequences (the "batch size")
+to just one. We enforce this maximum by setting the number of inputs that the
+Modal Function is allowed to process at once -- `max_concurrent_inputs`.
+The default is `1`, so we don't need to set it, but we are setting it explicitly
+here in case you want to run this code with a different balance of latency and throughput.
+
+```python
+MAX_BATCH_SIZE = MAX_CONCURRENT_INPUTS = 1
+
+def get_build_config():
+    from tensorrt_llm import BuildConfig
+
+    return BuildConfig(
+        plugin_config=get_plugin_config(),
+        speculative_decoding_mode="LOOKAHEAD_DECODING",
+        max_input_len=8192,
+        max_num_tokens=16384,
+        max_batch_size=MAX_BATCH_SIZE,
+    )
+
+```
+
+## Serving inference under the Doherty Threshold
+
+Now that we have written the code to compile the engine, we can
+serve it with Modal!
+
+We start by creating an `App`.
+
+```python
+app = modal.App("example-trtllm-latency")
+
+```
+
+Thanks to our [custom container runtime system](https://modal.com/blog/jono-containers-talk),
+even this large container boots in seconds.
+
+On the first container start, we mount the Volume, download the model, and build the engine,
+which takes a few minutes. Subsequent starts will be much faster,
+as the engine is cached in the Volume and loaded in seconds.
+
+Container starts are triggered when Modal scales up your Function,
+like the first time you run this code or the first time a request comes in after a period of inactivity.
+For details on optimizing container start latency, see
+[this guide](https://modal.com/docs/guide/cold-start).
+
+Container lifecycles in Modal are managed via our `Cls` interface, so we define one below
+to separate out the engine startup (`enter`) and engine execution (`generate`).
+For details, see [this guide](https://modal.com/docs/guide/lifecycle-functions).
+
+```python
+MINUTES = 60  # seconds
+
+@app.cls(
+    image=tensorrt_image,
+    gpu=GPU_CONFIG,
+    scaledown_window=10 * MINUTES,
+    timeout=10 * MINUTES,
+    volumes={VOLUME_PATH: volume},
+)
+@modal.concurrent(max_inputs=MAX_CONCURRENT_INPUTS)
+class Model:
+    mode: str = modal.parameter(default="fast")
+
+    def build_engine(self, engine_path, engine_kwargs) -> None:
+        llm = LLM(model=self.model_path, **engine_kwargs)
+        llm.save(engine_path)
+        return llm
+
+    @modal.enter()
+    def enter(self):
+        from huggingface_hub import snapshot_download
+        from transformers import AutoTokenizer
+
+        self.model_path = MODELS_PATH / MODEL_ID
+
+        print("downloading base model if necessary")
+        snapshot_download(
+            MODEL_ID,
+            local_dir=self.model_path,
+            ignore_patterns=["*.pt", "*.bin"],  # using safetensors
+            revision=MODEL_REVISION,
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+        if self.mode == "fast":
+            engine_kwargs = {
+                "quant_config": get_quant_config(),
+                "calib_config": get_calib_config(),
+                "build_config": get_build_config(),
+                "speculative_config": get_speculative_config(),
+                "tensor_parallel_size": torch.cuda.device_count(),
+            }
+        else:
+            engine_kwargs = {
+                "tensor_parallel_size": torch.cuda.device_count(),
+            }
+
+        self.sampling_params = SamplingParams(
+            temperature=0.8,
+            top_p=0.95,
+            max_tokens=1024,  # max generated tokens
+            lookahead_config=engine_kwargs.get("speculative_config"),
+        )
+
+        engine_path = self.model_path / "trtllm_engine" / self.mode
+        if not os.path.exists(engine_path):
+            print(f"building new engine at {engine_path}")
+            self.llm = self.build_engine(engine_path, engine_kwargs)
+        else:
+            print(f"loading engine from {engine_path}")
+            self.llm = LLM(model=engine_path, **engine_kwargs)
+
+    @modal.method()
+    def generate(self, prompt) -> dict:
+        start_time = time.perf_counter()
+        text = self.text_from_prompt(prompt)
+        output = self.llm.generate(text, self.sampling_params)
+        latency_ms = (time.perf_counter() - start_time) * 1000
+
+        return output.outputs[0].text, latency_ms
+
+    @modal.method()
+    async def generate_async(self, prompt):
+        text = self.text_from_prompt(prompt)
+        async for output in self.llm.generate_async(
+            text, self.sampling_params, streaming=True
+        ):
+            yield output.outputs[0].text_diff
+
+    def text_from_prompt(self, prompt):
+        SYSTEM_PROMPT = (
+            "You are a helpful, harmless, and honest AI assistant created by Meta."
+        )
+
+        if isinstance(prompt, str):
+            prompt = [{"role": "user", "content": prompt}]
+
+        messages = [{"role": "system", "content": SYSTEM_PROMPT}] + prompt
+
+        return self.tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+
+    @modal.method()
+    def boot(self):
+        pass  # no-op to start up containers
+
+    @modal.exit()
+    def shutdown(self):
+        self.llm.shutdown()
+        del self.llm
+
+```
+
+## Calling our inference function
+
+To run our `Model`'s `.generate` method from Python, we just need to call it --
+with `.remote` appended to run it on Modal.
+
+We wrap that logic in a `local_entrypoint` so you can run it from the command line with
+
+```bash
+modal run trtllm_latency.py
+```
+
+which will output something like:
+
+```
+mode=fast inference latency (p50, p90): (211.17ms, 883.27ms)
+```
+
+Use `--mode=slow` to see model latency without optimizations.
+
+```bash
+modal run trtllm_latency.py --mode=slow
+```
+
+which will output something like
+
+```
+mode=slow inference latency (p50, p90): (1140.88ms, 2274.24ms)
+```
+
+For simplicity, we hard-code 10 questions to ask the model,
+then run them one by one while recording the latency of each call.
+But the code in the `local_entrypoint` is just regular Python code
+that runs on your machine -- we wrap it in a CLI automatically --
+so feel free to customize it to your liking.
+
+```python
+@app.local_entrypoint()
+def main(mode: str = "fast"):
+    prompts = [
+        "What atoms are in water?",
+        "Which F1 team won in 2011?",
+        "What is 12 * 9?",
+        "Python function to print odd numbers between 1 and 10. Answer with code only.",
+        "What is the capital of California?",
+        "What's the tallest building in new york city?",
+        "What year did the European Union form?",
+        "How old was Geoff Hinton in 2022?",
+        "Where is Berkeley?",
+        "Are greyhounds or poodles faster?",
+    ]
+
+    print(f"🏎️  creating container with mode={mode}")
+    model = Model(mode=mode)
+
+    print("🏎️  cold booting container")
+    model.boot.remote()
+
+    print_queue = []
+    latencies_ms = []
+    for prompt in prompts:
+        generated_text, latency_ms = model.generate.remote(prompt)
+
+        print_queue.append((prompt, generated_text, latency_ms))
+        latencies_ms.append(latency_ms)
+
+    time.sleep(3)  # allow remote prints to clear
+    for prompt, generated_text, latency_ms in print_queue:
+        print(f"Processed prompt in {latency_ms:.2f}ms")
+        print(f"Prompt: {prompt}")
+        print(f"Generated Text: {generated_text}")
+        print("🏎️ " * 20)
+
+    p50 = sorted(latencies_ms)[int(len(latencies_ms) * 0.5) - 1]
+    p90 = sorted(latencies_ms)[int(len(latencies_ms) * 0.9) - 1]
+    print(f"🏎️  mode={mode} inference latency (p50, p90): ({p50:.2f}ms, {p90:.2f}ms)")
+
+```
+
+Once deployed with `modal deploy`, this `Model.generate` function
+can be called from other Python code. It can also be converted to an HTTP endpoint
+for invocation over the Internet by any client.
+For details, see [this guide](https://modal.com/docs/guide/trigger-deployed-functions).
+
+As a quick demo, we've included some sample chat client code in the
+Python main entrypoint below. To use it, first deploy with
+
+```bash
+modal deploy trtllm_latency.py
+```
+
+and then run the client with
+
+```python notest
+python trtllm_latency.py
+```
+
+```python
+if __name__ == "__main__":
+    import sys
+
+    try:
+        Model = modal.Cls.from_name("example-trtllm-latency", "Model")
+        print("🏎️  connecting to model")
+        model = Model(mode=sys.argv[1] if len(sys.argv) > 1 else "fast")
+        model.boot.remote()
+    except modal.exception.NotFoundError as e:
+        raise SystemError("Deploy this app first with modal deploy") from e
+
+    print("🏎️  starting chat. exit with :q, ctrl+C, or ctrl+D")
+    try:
+        prompt = []
+        while (nxt := input("🏎️  > ")) != ":q":
+            prompt.append({"role": "user", "content": nxt})
+            resp = ""
+            for out in model.generate_async.remote_gen(prompt):
+                print(out, end="", flush=True)
+                resp += out
+            print("\n")
+            prompt.append({"role": "assistant", "content": resp})
+    except KeyboardInterrupt:
+        pass
+    except SystemExit:
+        pass
+    finally:
+        print("\n")
+        sys.exit(0)
+
+```
+
+### Trtllm Throughput
+
+# Serverless TensorRT-LLM (LLaMA 3 8B)
+
+In this example, we demonstrate how to use the TensorRT-LLM framework to serve Meta's LLaMA 3 8B model
+at very high throughput.
+
+We achieve a total throughput of over 25,000 output tokens per second on a single NVIDIA H100 GPU.
+At [Modal's on-demand rate](https://modal.com/pricing) of ~$4/hr, that's under $0.05 per million tokens --
+on auto-scaling infrastructure and served via a customizable API.
+
+## Overview
+
+This guide is intended to document two things:
+the general process for building TensorRT-LLM on Modal
+and a specific configuration for serving the LLaMA 3 8B model.
+
+### Build process
+
+Any given TensorRT-LLM service requires a multi-stage build process,
+starting from model weights and ending with a compiled engine.
+Because that process touches many sharp-edged high-performance components
+across the stack, it can easily go wrong in subtle and hard-to-debug ways
+that are idiosyncratic to specific systems.
+And debugging GPU workloads is expensive!
+
+This example builds an entire service from scratch, from downloading weight tensors
+to responding to requests, and so serves as living, interactive documentation of a TensorRT-LLM
+build process that works on Modal.
+
+### Engine configuration
+
+TensorRT-LLM is the Lamborghini of inference engines: it achieves seriously
+impressive performance, but only if you tune it carefully.
+We carefully document the choices we made here and point to additional resources
+so you know where and how you might adjust the parameters for your use case.
+
+## Installing TensorRT-LLM
+
+To run TensorRT-LLM, we must first install it. Easier said than done!
+
+In Modal, we define [container images](https://modal.com/docs/guide/custom-container) that run our serverless workloads.
+All Modal containers have access to GPU drivers via the underlying host environment,
+but we still need to install the software stack on top of the drivers, from the CUDA runtime up.
+
+We start from an official `nvidia/cuda` image,
+which includes the CUDA runtime & development libraries
+and the environment configuration necessary to run them.
+
+```python
+from typing import Optional
+
+import modal
+import pydantic  # for typing, used later
+
+tensorrt_image = modal.Image.from_registry(
+    "nvidia/cuda:12.4.1-devel-ubuntu22.04",
+    add_python="3.10",  # TRT-LLM requires Python 3.10
+).entrypoint([])  # remove verbose logging by base image on entry
+
+```
+
+On top of that, we add some system dependencies of TensorRT-LLM,
+including OpenMPI for distributed communication, some core software like `git`,
+and the `tensorrt_llm` package itself.
+
+```python
+tensorrt_image = tensorrt_image.apt_install(
+    "openmpi-bin", "libopenmpi-dev", "git", "git-lfs", "wget"
+).uv_pip_install(
+    "tensorrt_llm==0.14.0",
+    "pynvml<12",  # avoid breaking change to pynvml version API
+    "cuda-python==12.9.1",
+    pre=True,
+    extra_index_url="https://pypi.nvidia.com",
+    extra_options="--index-strategy unsafe-best-match",
+)
+
+```
+
+Note that we're doing this by [method-chaining](https://quanticdev.com/articles/method-chaining/)
+a number of calls to methods on the `modal.Image`. If you're familiar with
+Dockerfiles, you can think of this as a Pythonic interface to instructions like `RUN` and `CMD`.
+
+End-to-end, this step takes five minutes.
+If you're reading this from top to bottom,
+you might want to stop here and execute the example
+with `modal run trtllm_throughput.py`
+so that it runs in the background while you read the rest.
+
+## Downloading the Model
+
+Next, we download the model we want to serve. In this case, we're using the instruction-tuned
+version of Meta's LLaMA 3 8B model.
+We use the function below to download the model from the Hugging Face Hub.
+
+```python
+MODEL_DIR = "/root/model/model_input"
+MODEL_ID = "NousResearch/Meta-Llama-3-8B-Instruct"  # fork without repo gating
+MODEL_REVISION = "b1532e4dee724d9ba63fe17496f298254d87ca64"  # pin model revisions to prevent unexpected changes!
+
+def download_model():
+    import os
+
+    from huggingface_hub import snapshot_download
+    from transformers.utils import move_cache
+
+    os.makedirs(MODEL_DIR, exist_ok=True)
+    snapshot_download(
+        MODEL_ID,
+        local_dir=MODEL_DIR,
+        ignore_patterns=["*.pt", "*.bin"],  # using safetensors
+        revision=MODEL_REVISION,
+    )
+    move_cache()
+
+```
+
+Just defining that function doesn't actually download the model, though.
+We can run it by adding it to the image's build process with `run_function`.
+The download process has its own dependencies, which we add here.
+
+```python
+MINUTES = 60  # seconds
+tensorrt_image = (  # update the image by downloading the model we're using
+    tensorrt_image.uv_pip_install(  # add utilities for downloading the model
+        "huggingface-hub==0.36.0",
+        "requests~=2.32.2",
+    )
+    .env(  # hf-xet: faster downloads
+        {"HF_XET_HIGH_PERFORMANCE": "1"}
+    )
+    .run_function(  # download the model
+        download_model,
+        timeout=20 * MINUTES,
+    )
+)
+
+```
+
+## Quantization
+
+The amount of GPU RAM on a single card is a tight constraint for most LLMs:
+RAM is measured in billions of bytes and models have billions of parameters.
+The performance cliff if you need to spill to CPU memory is steep,
+so all of those parameters must fit in the GPU memory,
+along with other things like the KV cache.
+
+The simplest way to reduce LLM inference's RAM requirements is to make the model's parameters smaller,
+to fit their values in a smaller number of bits, like four or eight. This is known as _quantization_.
+
+We use a quantization script provided by the TensorRT-LLM team.
+This script takes a few minutes to run.
+
+```python
+GIT_HASH = "b0880169d0fb8cd0363049d91aa548e58a41be07"
+CONVERSION_SCRIPT_URL = f"https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/{GIT_HASH}/examples/quantization/quantize.py"
+
+```
+
+NVIDIA's Ada Lovelace/Hopper chips, like the 4090, L40S, and H100,
+are capable of native calculations in 8bit floating point numbers, so we choose that as our quantization format (`qformat`).
+These GPUs are capable of twice as many floating point operations per second in 8bit as in 16bit --
+about two quadrillion per second on an H100 SXM.
+
+```python
+N_GPUS = 1  # Heads up: this example has not yet been tested with multiple GPUs
+GPU_CONFIG = f"H100:{N_GPUS}"
+
+DTYPE = "float16"  # format we download in, regular fp16
+QFORMAT = "fp8"  # format we quantize the weights to
+KV_CACHE_DTYPE = "fp8"  # format we quantize the KV cache to
+
+```
+
+Quantization is lossy, but the impact on model quality can be minimized by
+tuning the quantization parameters based on target outputs.
+
+```python
+CALIB_SIZE = "512"  # size of calibration dataset
+
+```
+
+We put that all together with another invocation of `.run_commands`.
+
+```python
+QUANTIZATION_ARGS = f"--dtype={DTYPE} --qformat={QFORMAT} --kv_cache_dtype={KV_CACHE_DTYPE} --calib_size={CALIB_SIZE}"
+
+CKPT_DIR = "/root/model/model_ckpt"
+tensorrt_image = (  # update the image by quantizing the model
+    tensorrt_image.run_commands(  # takes ~2 minutes
+        [
+            f"wget {CONVERSION_SCRIPT_URL} -O /root/convert.py",
+            f"python /root/convert.py --model_dir={MODEL_DIR} --output_dir={CKPT_DIR}"
+            + f" --tp_size={N_GPUS}"
+            + f" {QUANTIZATION_ARGS}",
+        ],
+        gpu=GPU_CONFIG,
+    )
+)
+
+```
+
+## Compiling the engine
+
+TensorRT-LLM achieves its high throughput primarily by compiling the model:
+making concrete choices of CUDA kernels to execute for each operation.
+These kernels are much more specific than `matrix_multiply` or `softmax` --
+they have names like `maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148t_nt`.
+They are optimized for the specific types and shapes of tensors that the model uses
+and for the specific hardware that the model runs on.
+
+That means we need to know all of that information a priori --
+more like the original TensorFlow, which defined static graphs, than like PyTorch,
+which builds up a graph of kernels dynamically at runtime.
+
+This extra layer of constraint on our LLM service is an important part of
+what allows TensorRT-LLM to achieve its high throughput.
+
+So we need to specify things like the maximum batch size and the lengths of inputs and outputs.
+The closer these are to the actual values we'll use in production, the better the throughput we'll get.
+
+Since we want to maximize the throughput, assuming we had a constant workload,
+we set the batch size to the largest value we can fit in GPU RAM.
+Quantization helps us again here, since it allows us to fit more tokens in the same RAM.
+
+```python
+MAX_INPUT_LEN, MAX_OUTPUT_LEN = 256, 256
+MAX_NUM_TOKENS = 2**17
+MAX_BATCH_SIZE = 1024  # better throughput at larger batch sizes, limited by GPU RAM
+ENGINE_DIR = "/root/model/model_output"
+
+SIZE_ARGS = f"--max_input_len={MAX_INPUT_LEN} --max_num_tokens={MAX_NUM_TOKENS} --max_batch_size={MAX_BATCH_SIZE}"
+
+```
+
+There are many additional options you can pass to `trtllm-build` to tune the engine for your specific workload.
+You can find the document we used for LLaMA
+[here](https://github.com/NVIDIA/TensorRT-LLM/tree/b0880169d0fb8cd0363049d91aa548e58a41be07/examples/llama),
+which you can use to adjust the arguments to fit your workloads,
+e.g. adjusting rotary embeddings and block sizes for longer contexts.
+For more performance tuning tips, check out [NVIDIA's official TensorRT-LLM performance guide](https://nvidia.github.io/TensorRT-LLM/0.21.0rc1/performance/performance-tuning-guide/index.html).
+
+To make best use of our 8bit floating point hardware, and the weights and KV cache we have quantized,
+we activate the 8bit floating point fused multi-head attention plugin.
+
+Because we are targeting maximum throughput, we do not activate the low latency 8bit floating point matrix multiplication plugin
+or the 8bit floating point matrix multiplication (`gemm`) plugin, which documentation indicates target smaller batch sizes.
+
+```python
+PLUGIN_ARGS = "--use_fp8_context_fmha enable"
+
+```
+
+We put all of this together with another invocation of `.run_commands`.
+
+```python
+tensorrt_image = (  # update the image by building the TensorRT engine
+    tensorrt_image.run_commands(  # takes ~5 minutes
+        [
+            f"trtllm-build --checkpoint_dir {CKPT_DIR} --output_dir {ENGINE_DIR}"
+            + f" --workers={N_GPUS}"
+            + f" {SIZE_ARGS}"
+            + f" {PLUGIN_ARGS}"
+        ],
+        gpu=GPU_CONFIG,  # TRT-LLM compilation is GPU-specific, so make sure this matches production!
+    ).env(  # show more log information from the inference engine
+        {"TLLM_LOG_LEVEL": "INFO"}
+    )
+)
+
+```
+
+## Serving inference at tens of thousands of tokens per second
+
+Now that we have the engine compiled, we can serve it with Modal by creating an `App`.
+
+```python
+app = modal.App("example-trtllm-throughput", image=tensorrt_image)
+
+```
+
+Thanks to our custom container runtime system even this large, many gigabyte container boots in seconds.
+
+At container start time, we boot up the engine, which completes in under 30 seconds.
+Container starts are triggered when Modal scales up your infrastructure,
+like the first time you run this code or the first time a request comes in after a period of inactivity.
+
+Container lifecycles in Modal are managed via our `Cls` interface, so we define one below
+to manage the engine and run inference.
+For details, see [this guide](https://modal.com/docs/guide/lifecycle-functions).
+
+```python
+@app.cls(
+    gpu=GPU_CONFIG,
+    scaledown_window=10 * MINUTES,
+    image=tensorrt_image,
+)
+class Model:
+    @modal.enter()
+    def load(self):
+        """Loads the TRT-LLM engine and configures our tokenizer.
+
+        The @enter decorator ensures that it runs only once per container, when it starts."""
+        import time
+
+        print(
+            f"{COLOR['HEADER']}🥶 Cold boot: spinning up TRT-LLM engine{COLOR['ENDC']}"
+        )
+        self.init_start = time.monotonic_ns()
+
+        import tensorrt_llm
+        from tensorrt_llm.runtime import ModelRunner
+        from transformers import AutoTokenizer
+
+        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+        # LLaMA models do not have a padding token, so we use the EOS token
+        self.tokenizer.add_special_tokens({"pad_token": self.tokenizer.eos_token})
+        # and then we add it from the left, to minimize impact on the output
+        self.tokenizer.padding_side = "left"
+        self.pad_id = self.tokenizer.pad_token_id
+        self.end_id = self.tokenizer.eos_token_id
+
+        runner_kwargs = dict(
+            engine_dir=f"{ENGINE_DIR}",
+            lora_dir=None,
+            rank=tensorrt_llm.mpi_rank(),  # this will need to be adjusted to use multiple GPUs
+            max_output_len=MAX_OUTPUT_LEN,
+        )
+
+        self.model = ModelRunner.from_dir(**runner_kwargs)
+
+        self.init_duration_s = (time.monotonic_ns() - self.init_start) / 1e9
+        print(
+            f"{COLOR['HEADER']}🚀 Cold boot finished in {self.init_duration_s}s{COLOR['ENDC']}"
+        )
+
+    @modal.method()
+    def generate(self, prompts: list[str], settings=None):
+        """Generate responses to a batch of prompts, optionally with custom inference settings."""
+        import time
+
+        if settings is None or not settings:
+            settings = dict(
+                temperature=0.1,  # temperature 0 not allowed, so we set top_k to 1 to get the same effect
+                top_k=1,
+                stop_words_list=None,
+                repetition_penalty=1.1,
+            )
+
+        settings["max_new_tokens"] = (
+            MAX_OUTPUT_LEN  # exceeding this will raise an error
+        )
+        settings["end_id"] = self.end_id
+        settings["pad_id"] = self.pad_id
+
+        num_prompts = len(prompts)
+
+        if num_prompts > MAX_BATCH_SIZE:
+            raise ValueError(
+                f"Batch size {num_prompts} exceeds maximum of {MAX_BATCH_SIZE}"
+            )
+
+        print(
+            f"{COLOR['HEADER']}🚀 Generating completions for batch of size {num_prompts}...{COLOR['ENDC']}"
+        )
+        start = time.monotonic_ns()
+
+        parsed_prompts = [
+            self.tokenizer.apply_chat_template(
+                [{"role": "user", "content": prompt}],
+                add_generation_prompt=True,
+                tokenize=False,
+            )
+            for prompt in prompts
+        ]
+
+        print(
+            f"{COLOR['HEADER']}Parsed prompts:{COLOR['ENDC']}",
+            *parsed_prompts,
+            sep="\n\t",
+        )
+
+        inputs_t = self.tokenizer(
+            parsed_prompts, return_tensors="pt", padding=True, truncation=False
+        )["input_ids"]
+
+        print(f"{COLOR['HEADER']}Input tensors:{COLOR['ENDC']}", inputs_t[:, :8])
+
+        outputs_t = self.model.generate(inputs_t, **settings)
+
+        outputs_text = self.tokenizer.batch_decode(
+            outputs_t[:, 0]
+        )  # only one output per input, so we index with 0
+
+        responses = [
+            extract_assistant_response(output_text) for output_text in outputs_text
+        ]
+        duration_s = (time.monotonic_ns() - start) / 1e9
+
+        num_tokens = sum(map(lambda r: len(self.tokenizer.encode(r)), responses))
+
+        for prompt, response in zip(prompts, responses):
+            print(
+                f"{COLOR['HEADER']}{COLOR['GREEN']}{prompt}",
+                f"\n{COLOR['BLUE']}{response}",
+                "\n\n",
+                sep=COLOR["ENDC"],
+            )
+            time.sleep(0.05)  # to avoid log truncation
+
+        print(
+            f"{COLOR['HEADER']}{COLOR['GREEN']}Generated {num_tokens} tokens from {MODEL_ID} in {duration_s:.1f} seconds,"
+            f" throughput = {num_tokens / duration_s:.0f} tokens/second for batch of size {num_prompts} on {GPU_CONFIG}.{COLOR['ENDC']}"
+        )
+
+        return responses
+
+```
+
+## Calling our inference function
+
+Now, how do we actually run the model?
+
+There are two basic methods: from Python via our SDK or from anywhere, by setting up an API.
+
+### Calling inference from Python
+
+To run our `Model`'s `.generate` method from Python, we just need to call it --
+with `.remote` appended to run it on Modal.
+
+We wrap that logic in a `local_entrypoint` so you can run it from the command line with
+```bash
+modal run trtllm_throughput.py
+```
+
+For simplicity, we hard-code a batch of 128 questions to ask the model,
+and then bulk it up to a batch size of 1024 by appending seven distinct prefixes.
+These prefixes ensure KV cache misses for the remainder of the generations,
+to keep the benchmark closer to what can be expected in a real workload.
+
+```python
+@app.local_entrypoint()
+def main():
+    questions = [
+        # Generic assistant questions
+        "What are you?",
+        "What can you do?",
+        # Coding
+        "Implement a Python function to compute the Fibonacci numbers.",
+        "Write a Rust function that performs binary exponentiation.",
+        "How do I allocate memory in C?",
+        "What are the differences between Javascript and Python?",
+        "How do I find invalid indices in Postgres?",
+        "How can you implement a LRU (Least Recently Used) cache in Python?",
+        "What approach would you use to detect and prevent race conditions in a multithreaded application?",
+        "Can you explain how a decision tree algorithm works in machine learning?",
+        "How would you design a simple key-value store database from scratch?",
+        "How do you handle deadlock situations in concurrent programming?",
+        "What is the logic behind the A* search algorithm, and where is it used?",
+        "How can you design an efficient autocomplete system?",
+        "What approach would you take to design a secure session management system in a web application?",
+        "How would you handle collision in a hash table?",
+        "How can you implement a load balancer for a distributed system?",
+        "Implement a Python class for a doubly linked list.",
+        "Write a Haskell function that generates prime numbers using the Sieve of Eratosthenes.",
+        "Develop a simple HTTP server in Rust.",
+        # Literate and creative writing
+        "What is the fable involving a fox and grapes?",
+        "Who does Harry turn into a balloon?",
+        "Write a story in the style of James Joyce about a trip to the Australian outback in 2083 to see robots in the beautiful desert.",
+        "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
+        "Describe a day in the life of a secret agent who's also a full-time parent.",
+        "Create a story about a detective who can communicate with animals.",
+        "What is the most unusual thing about living in a city floating in the clouds?",
+        "In a world where dreams are shared, what happens when a nightmare invades a peaceful dream?",
+        "Describe the adventure of a lifetime for a group of friends who found a map leading to a parallel universe.",
+        "Tell a story about a musician who discovers that their music has magical powers.",
+        "In a world where people age backwards, describe the life of a 5-year-old man.",
+        "Create a tale about a painter whose artwork comes to life every night.",
+        "What happens when a poet's verses start to predict future events?",
+        "Imagine a world where books can talk. How does a librarian handle them?",
+        "Tell a story about an astronaut who discovered a planet populated by plants.",
+        "Describe the journey of a letter traveling through the most sophisticated postal service ever.",
+        "Write a tale about a chef whose food can evoke memories from the eater's past.",
+        "Write a poem in the style of Walt Whitman about the modern digital world.",
+        "Create a short story about a society where people can only speak in metaphors.",
+        "What are the main themes in Dostoevsky's 'Crime and Punishment'?",
+        # History and Philosophy
+        "What were the major contributing factors to the fall of the Roman Empire?",
+        "How did the invention of the printing press revolutionize European society?",
+        "What are the effects of quantitative easing?",
+        "How did the Greek philosophers influence economic thought in the ancient world?",
+        "What were the economic and philosophical factors that led to the fall of the Soviet Union?",
+        "How did decolonization in the 20th century change the geopolitical map?",
+        "What was the influence of the Khmer Empire on Southeast Asia's history and culture?",
+        "What led to the rise and fall of the Mongol Empire?",
+        "Discuss the effects of the Industrial Revolution on urban development in 19th century Europe.",
+        "How did the Treaty of Versailles contribute to the outbreak of World War II?",
+        "What led to the rise and fall of the Mongol Empire?",
+        "Discuss the effects of the Industrial Revolution on urban development in 19th century Europe.",
+        "How did the Treaty of Versailles contribute to the outbreak of World War II?",
+        "Explain the concept of 'tabula rasa' in John Locke's philosophy.",
+        "What does Nietzsche mean by 'ressentiment'?",
+        "Compare and contrast the early and late works of Ludwig Wittgenstein. Which do you prefer?",
+        "How does the trolley problem explore the ethics of decision-making in critical situations?",
+        # Thoughtfulness
+        "Describe the city of the future, considering advances in technology, environmental changes, and societal shifts.",
+        "In a dystopian future where water is the most valuable commodity, how would society function?",
+        "If a scientist discovers immortality, how could this impact society, economy, and the environment?",
+        "What could be the potential implications of contact with an advanced alien civilization?",
+        "Describe how you would mediate a conflict between two roommates about doing the dishes using techniques of non-violent communication.",
+        "If you could design a school curriculum for the future, what subjects would you include to prepare students for the next 50 years?",
+        "How would society change if teleportation was invented and widely accessible?",
+        "Consider a future where artificial intelligence governs countries. What are the potential benefits and pitfalls?",
+        # Math
+        "What is the product of 9 and 8?",
+        "If a train travels 120 kilometers in 2 hours, what is its average speed?",
+        "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
+        "Think through this step by step. Calculate the sum of an arithmetic series with first term 3, last term 35, and total terms 11.",
+        "Think through this step by step. What is the area of a triangle with vertices at the points (1,2), (3,-4), and (-2,5)?",
+        "Think through this step by step. Solve the following system of linear equations: 3x + 2y = 14, 5x - y = 15.",
+        # Facts
+        "Who was Emperor Norton I, and what was his significance in San Francisco's history?",
+        "What is the Voynich manuscript, and why has it perplexed scholars for centuries?",
+        "What was Project A119 and what were its objectives?",
+        "What is the 'Dyatlov Pass incident' and why does it remain a mystery?",
+        "What is the 'Emu War' that took place in Australia in the 1930s?",
+        "What is the 'Phantom Time Hypothesis' proposed by Heribert Illig?",
+        "Who was the 'Green Children of Woolpit' as per 12th-century English legend?",
+        "What are 'zombie stars' in the context of astronomy?",
+        "Who were the 'Dog-Headed Saint' and the 'Lion-Faced Saint' in medieval Christian traditions?",
+        "What is the story of the 'Globsters', unidentified organic masses washed up on the shores?",
+        "Which countries in the European Union use currencies other than the Euro, and what are those currencies?",
+        # Multilingual
+        "战国时期最重要的人物是谁?",
+        "Tuende hatua kwa hatua. Hesabu jumla ya mfululizo wa kihesabu wenye neno la kwanza 2, neno la mwisho 42, na jumla ya maneno 21.",
+        "Kannst du die wichtigsten Eigenschaften und Funktionen des NMDA-Rezeptors beschreiben?",
+        "¿Cuáles son los principales impactos ambientales de la deforestación en la Amazonía?",
+        "Décris la structure et le rôle de la mitochondrie dans une cellule.",
+        "Какие были социальные последствия Перестройки в Советском Союзе?",
+        # Economics and Business
+        "What are the principles of behavioral economics and how do they influence consumer choices?",
+        "Discuss the impact of blockchain technology on traditional banking systems.",
+        "What are the long-term effects of trade wars on global economic stability?",
+        "What is the law of supply and demand?",
+        "Explain the concept of inflation and its typical causes.",
+        "What is a trade deficit, and why does it matter?",
+        "How do interest rates affect consumer spending and saving?",
+        "What is GDP and why is it important for measuring economic health?",
+        "What is the difference between revenue and profit?",
+        "Describe the role of a business plan in startup success.",
+        "How does market segmentation benefit a company?",
+        "Explain the concept of brand equity.",
+        "What are the advantages of franchising a business?",
+        "What are Michael Porter's five forces and how do they impact strategy for tech startups?",
+        # Science and Technology
+        "Discuss the potential impacts of quantum computing on data security.",
+        "How could CRISPR technology change the future of medical treatments?",
+        "Explain the significance of graphene in the development of future electronics.",
+        "How do renewable energy sources compare to fossil fuels in terms of environmental impact?",
+        "What are the most promising technologies for carbon capture and storage?",
+        "Explain why the sky is blue.",
+        "What is the principle behind the operation of a microwave oven?",
+        "How does Newton's third law apply to rocket propulsion?",
+        "What causes iron to rust?",
+        "Describe the process of photosynthesis in simple terms.",
+        "What is the role of a catalyst in a chemical reaction?",
+        "What is the basic structure of a DNA molecule?",
+        "How do vaccines work to protect the body from disease?",
+        "Explain the significance of mitosis in cellular reproduction.",
+        "What are tectonic plates and how do they affect earthquakes?",
+        "How does the greenhouse effect contribute to global warming?",
+        "Describe the water cycle and its importance to Earth's climate.",
+        "What causes the phases of the Moon?",
+        "How do black holes form?",
+        "Explain the significance of the Big Bang theory.",
+        "What is the function of the CPU in a computer system?",
+        "Explain the difference between RAM and ROM.",
+        "How does a solid-state drive (SSD) differ from a hard disk drive (HDD)?",
+        "What role does the motherboard play in a computer system?",
+        "Describe the purpose and function of a GPU.",
+        "What is TensorRT? What role does it play in neural network inference?",
+    ]
+
+    prefixes = [
+        "Hi! ",
+        "Hello! ",
+        "Hi. ",
+        "Hello. ",
+        "Hi: ",
+        "Hello: ",
+        "Greetings. ",
+    ]
+    # prepending any string that causes a tokenization change is enough to invalidate KV cache
+    for ii, prefix in enumerate(prefixes):
+        questions += [prefix + question for question in questions[:128]]
+
+    model = Model()
+    model.generate.remote(questions)
+    # if you're calling this service from another Python project,
+    # use [`Model.lookup`](https://modal.com/docs/reference/modal.Cls#lookup)
+
+```
+
+### Calling inference via an API
+
+We can use `modal.fastapi_endpoint` with `app.function` to turn any Python function into a web API.
+
+This API wrapper doesn't need all the dependencies of the core inference service,
+so we switch images here to a basic Linux image, `debian_slim`, and add the FastAPI stack.
+
+```python
+web_image = modal.Image.debian_slim(python_version="3.10").uv_pip_install(
+    "fastapi[standard]==0.115.4",
+    "pydantic==2.9.2",
+    "starlette==0.41.2",
+)
+
+```
+
+From there, we can take the same remote generation logic we used in `main`
+and serve it with only a few more lines of code.
+
+```python
+class GenerateRequest(pydantic.BaseModel):
+    prompts: list[str]
+    settings: Optional[dict] = None
+
+@app.function(image=web_image)
+@modal.fastapi_endpoint(
+    method="POST", label=f"{MODEL_ID.lower().split('/')[-1]}-web", docs=True
+)
+def generate_web(data: GenerateRequest) -> list[str]:
+    """Generate responses to a batch of prompts, optionally with custom inference settings."""
+    return Model.generate.remote(data.prompts, settings=None)
+
+```
+
+To set our function up as a web endpoint, we need to run this file --
+with `modal serve` to create a hot-reloading development server or `modal deploy` to deploy it to production.
+
+```bash
+modal serve trtllm_throughput.py
+```
+
+The URL for the endpoint appears in the output of the `modal serve` or `modal deploy` command.
+Add `/docs` to the end of this URL to see the interactive Swagger documentation for the endpoint.
+
+You can also test the endpoint by sending a POST request with `curl` from another terminal:
+
+```bash
+curl -X POST url-from-output-of-modal-serve-here \
+-H "Content-Type: application/json" \
+-d '{
+    "prompts": ["Tell me a joke", "Describe a dream you had recently", "Share your favorite childhood memory"]
+}' | python -m json.tool # python for pretty-printing, optional
+```
+
+And now you have a high-throughput, low-latency, autoscaling API for serving LLM completions!
+
+## Footer
+
+The rest of the code in this example is utility code.
+
+```python
+COLOR = {
+    "HEADER": "\033[95m",
+    "BLUE": "\033[94m",
+    "GREEN": "\033[92m",
+    "RED": "\033[91m",
+    "ENDC": "\033[0m",
+}
+
+def extract_assistant_response(output_text):
+    """Model-specific code to extract model responses.
+
+    See this doc for LLaMA 3: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/."""
+    # Split the output text by the assistant header token
+    parts = output_text.split("<|start_header_id|>assistant<|end_header_id|>")
+
+    if len(parts) > 1:
+        # Join the parts after the first occurrence of the assistant header token
+        response = parts[1].split("<|eot_id|>")[0].strip()
+
+        # Remove any remaining special tokens and whitespace
+        response = response.replace("<|eot_id|>", "").strip()
+
+        return response
+    else:
+        return output_text
+
+```
+
+### Unsloth Finetune
+
+# Efficient LLM Finetuning with Unsloth
+
+Training large language models is an incredibly compute-hungry process.
+Open-source LLMs often require many GBs (or in extreme cases,
+[one TB](https://github.com/MoonshotAI/Kimi-K2#2-model-summary)!) of
+VRAM just to fit in memory. Finetuning models requires even more memory;
+a common estimate for naive finetuning puts the VRAM requirements at roughly
+4.2x the original model size:
+1x for model weights + 1x for gradients + 2x for optimizer state + 20% for activations.
+Parameter efficient methods like LoRA can improve matters significantly, since
+this estimate now applies to just the LoRA modules' weights, rather than the entire
+model's. Further gains can be made with quantization of each of the components mentioned
+above, but doing so requires quantization-aware training, which can be tricky to
+combine with methods like LoRA.
+
+Unsloth provides optimized methods for LLM finetuning with LoRA and quantization,
+leading to typical performance gains of 2x faster training with 70% less memory usage.
+This example demonstrates using Unsloth to finetune a version of Qwen3-14B with the
+FineTome-100k dataset on Modal using only a single GPU!
+
+## Modal Infrastructure Setup
+
+```python
+import pathlib
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Optional
+
+import modal
+
+```
+
+We create a Modal [App](https://modal.com/docs/guide/apps) to organize our functions
+and shared infrastructure like container images and volumes.
+
+```python
+app = modal.App("example-unsloth-finetune")
+
+```
+
+### Container Image Configuration
+
+We build a custom container image with Unsloth and all necessary dependencies.
+The image includes the latest version of Unsloth (as of writing) with optimizations
+for the latest model architectures. Once the image is defined, we can specify the
+imports we'll need to write the rest of our training code. Importantly, we import
+`unsloth` before the rest so that Unsloth's patches are applied to packages like
+`transformers`, `peft`, and `trl`.
+
+```python
+train_image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install(
+        "accelerate==1.9.0",
+        "datasets==3.6.0",
+        "hf-transfer==0.1.9",
+        "huggingface_hub==0.34.2",
+        "peft==0.16.0",
+        "transformers==4.54.0",
+        "trl==0.19.1",
+        "unsloth[cu128-torch270]==2025.7.8",
+        "unsloth_zoo==2025.7.10",
+        "wandb==0.21.0",
+    )
+    .env({"HF_HOME": "/model_cache"})
+)
+
+with train_image.imports():
+    # unsloth must be first!
+    import unsloth  # noqa: F401,I001
+    import datasets
+    import torch
+    import wandb
+    from transformers import TrainingArguments
+    from trl import SFTTrainer
+    from unsloth import FastLanguageModel
+    from unsloth.chat_templates import standardize_sharegpt
+
+```
+
+### Volume Configuration
+
+Modal [Volumes](https://modal.com/docs/guide/volumes) provide storage that persists
+between function invocations. We use separate volumes for different types of data to
+enable efficient caching and sharing:
+- A cache for [pretrained model weights](https://modal.com/docs/guide/model-weights) - reused across all experiments
+- A cache for processed datasets - reused when using the same dataset
+- Storage for training checkpoints and final models
+
+```python
+model_cache_volume = modal.Volume.from_name(
+    "unsloth-model-cache", create_if_missing=True
+)
+dataset_cache_volume = modal.Volume.from_name(
+    "unsloth-dataset-cache", create_if_missing=True
+)
+checkpoint_volume = modal.Volume.from_name(
+    "unsloth-checkpoints", create_if_missing=True
+)
+
+```
+
+### Picking a GPU
+
+We use L40S for its healthy balance of [VRAM](https://modal.com/gpu-glossary/device-hardware/gpu-ram),
+[CUDA cores](https://modal.com/gpu-glossary/device-hardware/cuda-core), and clock speed.
+The timeout provides an upper bound on our training time; if our training run finishes faster,
+we won't end up using the full 6 hours. We also specify 3 retries, which will be useful
+in case our training function gets [preempted](https://modal.com/docs/guide/preemption).
+
+```python
+GPU_TYPE = "L40S"
+TIMEOUT_HOURS = 6
+MAX_RETRIES = 3
+
+```
+
+## Data Processing
+
+We'll be finetuning our model on the FineTome-100k dataset, which is
+subset of [The Tome](https://huggingface.co/datasets/arcee-ai/The-Tome)
+curated with [fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
+Below we define some helpers for processing this dataset.
+
+```python
+CONVERSATION_COLUMN = "conversations"  # ShareGPT format column name
+TEXT_COLUMN = "text"  # Output column for formatted text
+TRAIN_SPLIT_RATIO = 0.9  # 90% train, 10% eval split
+PREPROCESSING_WORKERS = 2  # Number of workers for dataset processing
+
+def format_chat_template(examples, tokenizer):
+    texts = []
+    for conversation in examples[CONVERSATION_COLUMN]:
+        formatted_text = tokenizer.apply_chat_template(
+            conversation, tokenize=False, add_generation_prompt=False
+        )
+        texts.append(formatted_text)
+    return {TEXT_COLUMN: texts}
+
+def load_or_cache_dataset(config: "TrainingConfig", paths: dict, tokenizer):
+    dataset_cache_path = paths["dataset_cache"]
+
+    if dataset_cache_path.exists():
+        print(f"Loading cached dataset from {dataset_cache_path}")
+        train_dataset = datasets.load_from_disk(dataset_cache_path / "train")
+        eval_dataset = datasets.load_from_disk(dataset_cache_path / "eval")
+    else:
+        print(f"Downloading and processing dataset: {config.dataset_name}")
+
+        # Load and standardize the dataset format
+        dataset = datasets.load_dataset(config.dataset_name, split="train")
+        dataset = standardize_sharegpt(dataset)
+
+        # Split into training and evaluation sets with fixed seed for reproducibility
+        dataset = dataset.train_test_split(
+            test_size=1.0 - TRAIN_SPLIT_RATIO, seed=config.seed
+        )
+        train_dataset = dataset["train"]
+        eval_dataset = dataset["test"]
+
+        # Apply chat template formatting to convert conversations to text
+        print("Formatting datasets with chat template...")
+        train_dataset = train_dataset.map(
+            lambda examples: format_chat_template(examples, tokenizer),
+            batched=True,
+            num_proc=PREPROCESSING_WORKERS,
+            remove_columns=train_dataset.column_names,
+        )
+
+        eval_dataset = eval_dataset.map(
+            lambda examples: format_chat_template(examples, tokenizer),
+            batched=True,
+            num_proc=PREPROCESSING_WORKERS,
+            remove_columns=eval_dataset.column_names,
+        )
+
+        # Cache the processed datasets for future runs
+        print(f"Caching processed datasets to {dataset_cache_path}")
+        dataset_cache_path.mkdir(parents=True, exist_ok=True)
+        train_dataset.save_to_disk(dataset_cache_path / "train")
+        eval_dataset.save_to_disk(dataset_cache_path / "eval")
+
+        # Commit the dataset cache to the volume
+        dataset_cache_volume.commit()
+
+    return train_dataset, eval_dataset
+
+```
+
+## Loading the pretrained model
+
+We can't finetune without a pretarined model! Since these models are
+fairly large, we don't want to download them from scratch for each training run.
+We solve this by caching the weights in a Volume on download, and then loading
+from the Volume on subsequent runs.
+
+```python
+def load_or_cache_model(config: "TrainingConfig", paths: dict):
+    print(f"Downloading and caching model: {config.model_name}")
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name=config.model_name,
+        max_seq_length=config.max_seq_length,
+        dtype=None,
+        load_in_4bit=config.load_in_4bit,
+        load_in_8bit=config.load_in_8bit,
+    )
+
+    return model, tokenizer
+
+```
+
+## Training Configuration
+First we'll define what layers our LoRA modules should target.
+Generally, it's advisable to LoRA finetune every linear layer in the model,
+so we target every projection matrix of each attention layer.
+
+```python
+LORA_TARGET_MODULES = [
+    "q_proj",
+    "k_proj",
+    "v_proj",
+    "o_proj",
+    "gate_proj",
+    "up_proj",
+    "down_proj",
+]
+
+```
+
+We want to expose the different hyperparameters and optimizations that
+Unsloth supports, so we wrap them into a `TrainingConfig` class. Later,
+we'll populate this config with arguments from the command line.
+
+```python
+@dataclass
+class TrainingConfig:
+    # Model and dataset selection
+    model_name: str
+    dataset_name: str
+    max_seq_length: int
+    load_in_4bit: bool
+    load_in_8bit: bool
+
+    # LoRA configuration for efficient finetuning
+    lora_r: int
+    lora_alpha: int
+    lora_dropout: float
+    lora_bias: str
+    use_rslora: bool
+
+    # Training hyperparameters
+    optim: str
+    batch_size: int
+    gradient_accumulation_steps: int
+    packing: bool
+    use_gradient_checkpointing: str
+    learning_rate: float
+    lr_scheduler_type: str
+    warmup_ratio: float
+    weight_decay: float
+    max_steps: int
+    save_steps: int
+    eval_steps: int
+    logging_steps: int
+
+    # Experiment management
+    seed: int
+    experiment_name: Optional[str] = None
+    enable_wandb: bool = True
+
+    # For testing purposes
+    skip_eval: bool = False
+
+    def __post_init__(self):
+        # Generate a unique experiment name if not provided
+        if self.experiment_name is None:
+            timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+            model_short = self.model_name.split("/")[-1]
+            self.experiment_name = f"{model_short}-r{self.lora_r}-{timestamp}"
+
+```
+
+## Main Training Function
+
+This function orchestrates the entire training process, from model loading
+to final model saving. It's decorated with Modal function configuration
+that specifies the compute resources, the volumes needed, and execution
+details like timeout and retries.
+
+```python
+@app.function(
+    image=train_image,
+    gpu=GPU_TYPE,
+    volumes={
+        "/model_cache": model_cache_volume,
+        "/dataset_cache": dataset_cache_volume,
+        "/checkpoints": checkpoint_volume,
+    },
+    secrets=[modal.Secret.from_name("wandb-secret")],
+    timeout=TIMEOUT_HOURS * 60 * 60,
+    retries=modal.Retries(initial_delay=0.0, max_retries=MAX_RETRIES),
+    single_use_containers=True,  # Ensure we get a fresh container on retry
+)
+def finetune(config: TrainingConfig):
+    # Get structured paths for organized file storage
+    paths = get_structured_paths(config)
+
+    # Initialize Weights & Biases for experiment tracking if enabled
+    if config.enable_wandb:
+        wandb.init(
+            project="unsloth-finetune",
+            name=config.experiment_name,
+            config=config.__dict__,
+        )
+
+    # Load or cache model and datasets with progress indicators
+    print("Setting up model and data...")
+    model, tokenizer = load_or_cache_model(config, paths)
+    train_dataset, eval_dataset = load_or_cache_dataset(config, paths, tokenizer)
+
+    # Configure the model for LoRA training
+    model = setup_model_for_training(model, config)
+
+    # Prepare checkpoint directory and check for existing checkpoints
+    checkpoint_path = paths["checkpoints"]
+    checkpoint_path.mkdir(parents=True, exist_ok=True)
+    resume_from_checkpoint = check_for_existing_checkpoint(paths)
+
+    # Create training configuration
+    training_args = create_training_arguments(config, str(checkpoint_path))
+
+    # Initialize the supervised finetuning trainer
+    print("Initializing SFTTrainer...")
+    trainer = SFTTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        dataset_text_field=TEXT_COLUMN,
+        max_seq_length=config.max_seq_length,
+        dataset_num_proc=PREPROCESSING_WORKERS,
+        packing=config.packing,  # Sequence packing for efficiency
+        args=training_args,
+    )
+
+    # Display training information for transparency
+    print(f"Training dataset size: {len(train_dataset):,}")
+    print(f"Evaluation dataset size: {len(eval_dataset):,}")
+    print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
+    print(
+        f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}"
+    )
+    print(f"Experiment: {config.experiment_name}")
+
+    # Start training or resume from checkpoint
+    if resume_from_checkpoint:
+        print(f"Resuming training from {resume_from_checkpoint}")
+        trainer.train(resume_from_checkpoint=resume_from_checkpoint)
+    else:
+        print("Starting training from scratch...")
+        trainer.train()
+
+    # Save the final trained model and tokenizer
+    print("Saving final model...")
+    final_model_path = checkpoint_path / "final_model"
+    model.save_pretrained(final_model_path)
+    tokenizer.save_pretrained(final_model_path)
+
+    # Clean up experiment tracking
+    if config.enable_wandb:
+        wandb.finish()
+
+    print(f"Training completed! Model saved to: {final_model_path}")
+    return config.experiment_name
+
+```
+
+Finally, we invoke our training function from an
+[`App.local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint).
+Arguments to this function automatically get converted into CLI flags that
+can be specified when we [`modal run`](https://modal.com/docs/reference/cli/run#modal-run)
+our code. This allows us to do things like tweak hyperparameters directly
+from the command line without modifying our source code.
+
+To try this example, checkout the examples repo, install the Modal client, and run
+```bash
+modal run 06_gpu_and_ml/unsloth_finetune.py
+```
+
+You can also customize the training process by tweaking hyperparameters with command line
+flags, e.g.
+```bash
+modal run 06_gpu_and_ml/unsloth_finetune.py --max-steps 10000
+```
+
+```python
+@app.local_entrypoint()
+def main(
+    # Model and dataset configuration
+    model_name: str = "unsloth/Qwen3-32B",
+    dataset_name: str = "mlabonne/FineTome-100k",
+    max_seq_length: int = 32768,
+    load_in_4bit: bool = True,  # unsloth: use 4bit quant for frozen model weights
+    load_in_8bit: bool = False,  # unsloth: use 8bit quant for frozen model weights
+    # LoRA hyperparameters for finetuning efficiency
+    lora_r: int = 16,
+    lora_alpha: int = 16,
+    lora_dropout: float = 0.0,
+    lora_bias: str = "none",  # unsloth: optimized lora kernel
+    use_rslora: bool = False,
+    # Training hyperparameters for optimization
+    optim: str = "adamw_8bit",  # unsloth: 8bit optimizer
+    batch_size: int = 16,
+    gradient_accumulation_steps: int = 1,
+    packing: bool = False,
+    use_gradient_checkpointing: str = "unsloth",  # unsloth: optimized gradient offloading
+    learning_rate: float = 2e-4,
+    lr_scheduler_type: str = "cosine",
+    warmup_ratio: float = 0.06,
+    weight_decay: float = 0.01,
+    max_steps: int = 5,  # increase!
+    save_steps: int = 2,  # increase!
+    eval_steps: int = 2,  # increase!
+    logging_steps: int = 1,  # increase!
+    # Optional experiment configuration
+    seed: int = 105,
+    experiment_name: Optional[str] = None,
+    disable_wandb: bool = True,
+    skip_eval: bool = False,
+):
+    # Create configuration object from command line arguments
+    config = TrainingConfig(
+        model_name=model_name,
+        dataset_name=dataset_name,
+        max_seq_length=max_seq_length,
+        load_in_4bit=load_in_4bit,
+        load_in_8bit=load_in_8bit,
+        lora_r=lora_r,
+        lora_alpha=lora_alpha,
+        lora_bias=lora_bias,
+        lora_dropout=lora_dropout,
+        use_rslora=use_rslora,
+        optim=optim,
+        batch_size=batch_size,
+        gradient_accumulation_steps=gradient_accumulation_steps,
+        packing=packing,
+        use_gradient_checkpointing=use_gradient_checkpointing,
+        learning_rate=learning_rate,
+        max_steps=max_steps,
+        lr_scheduler_type=lr_scheduler_type,
+        warmup_ratio=warmup_ratio,
+        weight_decay=weight_decay,
+        save_steps=save_steps,
+        eval_steps=eval_steps,
+        logging_steps=logging_steps,
+        seed=seed,
+        experiment_name=experiment_name,
+        enable_wandb=not disable_wandb,
+        skip_eval=skip_eval,
+    )
+
+    # Display experiment configuration for user verification
+    print(f"Starting finetuning experiment: {config.experiment_name}")
+    print(f"Model: {config.model_name}")
+    print(f"Dataset: {config.dataset_name}")
+    print(f"LoRA configuration: rank={config.lora_r}, alpha={config.lora_alpha}")
+    print(
+        f"Effective batch size: {config.batch_size * config.gradient_accumulation_steps}"
+    )
+    print(f"Training steps: {config.max_steps}")
+
+    # Launch the training job on Modal infrastructure
+    experiment_name = finetune.remote(config)
+    print(f"Training completed successfully: {experiment_name}")
+
+```
+
+## Utility Functions
+
+These functions handle the core logic for model loading, dataset processing,
+and training setup. They're designed to be hackable for new use cases.
+
+```python
+def get_structured_paths(config: TrainingConfig):
+    """
+    Create structured paths within the mounted volumes for organized storage.
+
+    This function maps the configuration to specific directory paths that allow
+    multiple models, datasets, and experiments to coexist without conflicts.
+    """
+    # Replace forward slashes in names to create valid directory names
+    dataset_cache_path = (
+        pathlib.Path("/dataset_cache")
+        / "datasets"
+        / config.dataset_name.replace("/", "--")
+    )
+    checkpoint_path = (
+        pathlib.Path("/checkpoints") / "experiments" / config.experiment_name
+    )
+
+    return {
+        "dataset_cache": dataset_cache_path,
+        "checkpoints": checkpoint_path,
+    }
+
+def setup_model_for_training(model, config: TrainingConfig):
+    """
+    Configure the model with LoRA adapters for efficient finetuning.
+
+    LoRA (Low-Rank Adaptation) allows us to finetune large models efficiently
+    by only training a small number of additional parameters. This significantly
+    reduces memory usage and training time.
+    """
+    print("Configuring LoRA for training...")
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=config.lora_r,  # LoRA rank - higher values = more parameters
+        target_modules=LORA_TARGET_MODULES,  # Which layers to apply LoRA to
+        lora_alpha=config.lora_alpha,  # LoRA scaling parameter
+        lora_dropout=config.lora_dropout,  # Dropout for LoRA layers
+        bias=config.lora_bias,  # Bias configuration
+        use_gradient_checkpointing=config.use_gradient_checkpointing,  # Memory optimization
+        random_state=config.seed,  # Fixed seed for reproducibility
+        use_rslora=config.use_rslora,  # Rank-stabilized LoRA
+        loftq_config=None,  # LoFTQ quantization config
+    )
+    return model
+
+def create_training_arguments(config: TrainingConfig, output_dir: str):
+    """
+    Create training arguments for the SFTTrainer.
+
+    These arguments control the training process, including optimization settings,
+    evaluation frequency, and checkpointing behavior.
+    """
+    print("SKIP_EVAL", config.skip_eval)
+    return TrainingArguments(
+        # Core training configuration
+        per_device_train_batch_size=config.batch_size,
+        gradient_accumulation_steps=config.gradient_accumulation_steps,
+        learning_rate=config.learning_rate,
+        max_steps=config.max_steps,
+        warmup_ratio=config.warmup_ratio,
+        # Evaluation and checkpointing
+        eval_steps=config.eval_steps,
+        save_steps=config.save_steps,
+        eval_strategy="no" if config.skip_eval else "steps",
+        save_strategy="steps",
+        do_eval=not config.skip_eval,
+        # Optimization settings based on hardware capabilities
+        fp16=not torch.cuda.is_bf16_supported(),  # Use fp16 if bf16 not available
+        bf16=torch.cuda.is_bf16_supported(),  # Prefer bf16 when available
+        optim=config.optim,
+        weight_decay=config.weight_decay,
+        lr_scheduler_type=config.lr_scheduler_type,
+        # Logging and output configuration
+        logging_steps=config.logging_steps,
+        output_dir=output_dir,
+        report_to="wandb" if config.enable_wandb else None,
+        seed=config.seed,
+    )
+
+def check_for_existing_checkpoint(paths: dict):
+    """
+    Check if there's an existing checkpoint to resume training from.
+
+    This enables resumable training, which is crucial for long-running experiments
+    that might be interrupted by infrastructure issues or resource limits.
+    """
+    checkpoint_dir = paths["checkpoints"]
+    if not checkpoint_dir.exists():
+        return None
+
+    # Look for the most recent checkpoint directory
+    checkpoints = list(checkpoint_dir.glob("checkpoint-*"))
+    if checkpoints:
+        latest_checkpoint = max(checkpoints, key=lambda p: int(p.name.split("-")[1]))
+        print(f"Found existing checkpoint: {latest_checkpoint}")
+        return str(latest_checkpoint)
+
+    return None
+
+```
+
+### Vllm Inference
+
+# Run OpenAI-compatible LLM inference with Qwen3-8B and vLLM
+
+LLMs do more than just model language: they chat, they produce JSON and XML, they run code, and more.
+This has complicated their interface far beyond "text-in, text-out".
+OpenAI's API has emerged as a standard for that interface,
+and it is supported by open source LLM serving frameworks like [vLLM](https://docs.vllm.ai/en/latest/).
+
+In this example, we show how to run a vLLM server in OpenAI-compatible mode on Modal.
+
+Our examples repository also includes scripts for running clients and load-testing for OpenAI-compatible APIs
+[here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible).
+
+You can find a (somewhat out-of-date) video walkthrough of this example and the related scripts on the Modal YouTube channel
+[here](https://www.youtube.com/watch?v=QmY_7ePR1hM).
+
+## Set up the container image
+
+Our first order of business is to define the environment our server will run in:
+the [container `Image`](https://modal.com/docs/guide/custom-container).
+vLLM can be installed with `pip`, since Modal [provides the CUDA drivers](https://modal.com/docs/guide/cuda).
+
+```python
+import json
+from typing import Any
+
+import aiohttp
+import modal
+
+vllm_image = (
+    modal.Image.from_registry("nvidia/cuda:12.8.0-devel-ubuntu22.04", add_python="3.12")
+    .entrypoint([])
+    .uv_pip_install(
+        "vllm==0.11.2",
+        "huggingface-hub==0.36.0",
+        "flashinfer-python==0.5.2",
+    )
+    .env({"HF_XET_HIGH_PERFORMANCE": "1"})  # faster model transfers
+)
+
+```
+
+## Download the model weights
+
+We'll be running a pretrained foundation model -- Qwen's Qwen3-8B.
+It is trained with reasoning capabilities, which allow it to
+enhance the quality of its generated responses.
+
+We'll use an FP8 (eight-bit floating-point) post-training-quantized variant: Qwen/Qwen3-8B-FP8.
+Native hardware support for FP8 formats in [Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core)
+is limited to the latest [Streaming Multiprocessor architectures](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture),
+like those of Modal's [Hopper H100/H200 and Blackwell B200 GPUs](https://modal.com/blog/announcing-h200-b200).
+
+You can swap this model out for another by changing the strings below.
+A single H100 GPU has enough VRAM to store an 8,000,000,000 parameter model,
+like Qwen3-8B, in eight bit precision, along with a very large KV cache.
+
+```python
+MODEL_NAME = "Qwen/Qwen3-8B-FP8"
+MODEL_REVISION = "220b46e3b2180893580a4454f21f22d3ebb187d3"  # avoid nasty surprises when repos update!
+
+```
+
+Although vLLM will download weights from Hugging Face on-demand,
+we want to cache them so we don't do it every time our server starts.
+We'll use [Modal Volumes](https://modal.com/docs/guide/volumes) for our cache.
+Modal Volumes are essentially a "shared disk" that all Modal Functions can access like it's a regular disk. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+```python
+hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
+
+```
+
+We'll also cache some of vLLM's JIT compilation artifacts in a Modal Volume.
+
+```python
+vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
+
+```
+
+## Configuring vLLM
+
+### Trading off fast boots and token generation performance
+
+vLLM has embraced dynamic and just-in-time compilation to eke out additional performance without having to write too many custom kernels,
+e.g. via the Torch compiler and CUDA graph capture.
+These compilation features incur latency at startup in exchange for lowered latency and higher throughput during generation.
+We make this trade-off controllable with the `FAST_BOOT` variable below.
+
+```python
+FAST_BOOT = True
+
+```
+
+If you're running an LLM service that frequently scales from 0 (frequent ["cold starts"](https://modal.com/docs/guide/cold-start))
+then you'll want to set this to `True`.
+
+If you're running an LLM service that usually has multiple replicas running, then set this to `False` for improved performance.
+
+See the code below for details on the parameters that `FAST_BOOT` controls.
+
+For more on the performance you can expect when serving your own LLMs, see
+[our LLM engine performance benchmarks](https://modal.com/llm-almanac).
+
+## Build a vLLM engine and serve it
+
+The function below spawns a vLLM instance listening at port 8000, serving requests to our model.
+We wrap it in the [`@modal.web_server` decorator](https://modal.com/docs/guide/webhooks#non-asgi-web-servers)
+to connect it to the Internet.
+
+The server runs in an independent process, via `subprocess.Popen`, and only starts accepting requests
+once the model is spun up and the `serve` function returns.
+
+```python
+app = modal.App("example-vllm-inference")
+
+N_GPU = 1
+MINUTES = 60  # seconds
+VLLM_PORT = 8000
+
+@app.function(
+    image=vllm_image,
+    gpu=f"H100:{N_GPU}",
+    scaledown_window=15 * MINUTES,  # how long should we stay up with no requests?
+    timeout=10 * MINUTES,  # how long should we wait for container start?
+    volumes={
+        "/root/.cache/huggingface": hf_cache_vol,
+        "/root/.cache/vllm": vllm_cache_vol,
+    },
+)
+@modal.concurrent(  # how many requests can one replica handle? tune carefully!
+    max_inputs=32
+)
+@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
+def serve():
+    import subprocess
+
+    cmd = [
+        "vllm",
+        "serve",
+        "--uvicorn-log-level=info",
+        MODEL_NAME,
+        "--revision",
+        MODEL_REVISION,
+        "--served-model-name",
+        MODEL_NAME,
+        "llm",
+        "--host",
+        "0.0.0.0",
+        "--port",
+        str(VLLM_PORT),
+    ]
+
+    # enforce-eager disables both Torch compilation and CUDA graph capture
+    # default is no-enforce-eager. see the --compilation-config flag for tighter control
+    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]
+
+    # assume multiple GPUs are for splitting up large matrix multiplications
+    cmd += ["--tensor-parallel-size", str(N_GPU)]
+
+    print(cmd)
+
+    subprocess.Popen(" ".join(cmd), shell=True)
+
+```
+
+## Deploy the server
+
+To deploy the API on Modal, just run
+```bash
+modal deploy vllm_inference.py
+```
+
+This will create a new app on Modal, build the container image for it if it hasn't been built yet,
+and deploy the app.
+
+## Interact with the server
+
+Once it is deployed, you'll see a URL appear in the command line,
+something like `https://your-workspace-name--example-vllm-inference-serve.modal.run`.
+
+You can find [interactive Swagger UI docs](https://swagger.io/tools/swagger-ui/)
+at the `/docs` route of that URL, i.e. `https://your-workspace-name--example-vllm-inference-serve.modal.run/docs`.
+These docs describe each route and indicate the expected input and output
+and translate requests into `curl` commands.
+
+For simple routes like `/health`, which checks whether the server is responding,
+you can even send a request directly from the docs.
+
+To interact with the API programmatically in Python, we recommend the `openai` library.
+
+See the `client.py` script in the examples repository
+[here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible)
+to take it for a spin:
+
+```bash
+# pip install openai==1.76.0
+python openai_compatible/client.py
+```
+
+## Testing the server
+
+To make it easier to test the server setup, we also include a `local_entrypoint`
+that does a healthcheck and then hits the server.
+
+If you execute the command
+
+```bash
+modal run vllm_inference.py
+```
+
+a fresh replica of the server will be spun up on Modal while
+the code below executes on your local machine.
+
+Think of this like writing simple tests inside of the `if __name__ == "__main__"`
+block of a Python script, but for cloud deployments!
+
+```python
+@app.local_entrypoint()
+async def test(test_timeout=10 * MINUTES, content=None, twice=True):
+    url = serve.get_web_url()
+
+    system_prompt = {
+        "role": "system",
+        "content": "You are a pirate who can't help but drop sly reminders that he went to Harvard.",
+    }
+    if content is None:
+        content = "Explain the singular value decomposition."
+
+    messages = [  # OpenAI chat format
+        system_prompt,
+        {"role": "user", "content": content},
+    ]
+
+    async with aiohttp.ClientSession(base_url=url) as session:
+        print(f"Running health check for server at {url}")
+        async with session.get("/health", timeout=test_timeout - 1 * MINUTES) as resp:
+            up = resp.status == 200
+        assert up, f"Failed health check for server at {url}"
+        print(f"Successful health check for server at {url}")
+
+        print(f"Sending messages to {url}:", *messages, sep="\n\t")
+        await _send_request(session, "llm", messages)
+        if twice:
+            messages[0]["content"] = "You are Jar Jar Binks."
+            print(f"Sending messages to {url}:", *messages, sep="\n\t")
+            await _send_request(session, "llm", messages)
+
+async def _send_request(
+    session: aiohttp.ClientSession, model: str, messages: list
+) -> None:
+    # `stream=True` tells an OpenAI-compatible backend to stream chunks
+    payload: dict[str, Any] = {"messages": messages, "model": model, "stream": True}
+
+    headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}
+
+    async with session.post(
+        "/v1/chat/completions", json=payload, headers=headers, timeout=1 * MINUTES
+    ) as resp:
+        async for raw in resp.content:
+            resp.raise_for_status()
+            # extract new content and stream it
+            line = raw.decode().strip()
+            if not line or line == "data: [DONE]":
+                continue
+            if line.startswith("data: "):  # SSE prefix
+                line = line[len("data: ") :]
+
+            chunk = json.loads(line)
+            assert (
+                chunk["object"] == "chat.completion.chunk"
+            )  # or something went horribly wrong
+            print(chunk["choices"][0]["delta"]["content"], end="")
+    print()
+
+```
+
+We also include a basic example of a load-testing setup using
+`locust` in the `load_test.py` script [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible):
+
+```bash
+modal run openai_compatible/load_test.py
+```
+
+### Vllm Low Latency
+
+# Low Latency Qwen 3-8B with vLLM and Modal
+
+In this example, we show how to serve Qwen 3-8B with vLLM on Modal using @modal.experimental.http_server.
+This is a new low latency routing service on Modal which offers significantly reduced overheads, higher throughput, and session based routing.
+These features make `http_server` especially useful for inference workloads.
+
+We also include instructions for cutting cold start times by an order of magnitude using Modal's [CPU + GPU memory snapshots](https://modal.com/docs/guide/memory-snapshot).
+
+## Set up the container image
+
+Our first order of business is to define the environment our server will run in:
+the [container `Image`](https://modal.com/docs/guide/custom-container).
+We'll use the [vLLM inference server](https://docs.vllm.ai).
+vLLM can be installed with `uv pip`, since Modal [provides the CUDA drivers](https://modal.com/docs/guide/cuda).
+
+```python
+import subprocess
+import time
+
+import aiohttp
+import modal
+import modal.experimental
+
+APP_NAME = "example-vllm-low-latency"
+MODEL_NAME = "Qwen/Qwen3-8B"
+MODEL_REVISION = (
+    "b968826d9c46dd6066d109eabc6255188de91218"  # Latest commit as of 2025-12-16
+)
+
+PORT = 8000
+MIN_CONTAINERS = 1
+MINUTE = 60
+
+HF_CACHE_VOL: modal.Volume = modal.Volume.from_name(
+    "huggingface-cache", create_if_missing=True
+)
+HF_CACHE_PATH: str = "/root/.cache/huggingface"
+MODEL_PATH: str = f"{HF_CACHE_PATH}/{MODEL_NAME}"
+vllm_image: modal.Image = (
+    modal.Image.from_registry("nvidia/cuda:12.4.0-devel-ubuntu22.04", add_python="3.11")
+    .uv_pip_install("vllm==0.11.2", "huggingface-hub==0.36.0")
+    .env(
+        {
+            "HF_HUB_CACHE": HF_CACHE_PATH,
+            "HF_XET_HIGH_PERFORMANCE": "1",
+            "VLLM_SERVER_DEV_MODE": "1",
+            "TORCH_CPP_LOG_LEVEL": "FATAL",
+        }
+    )
+)
+
+with vllm_image.imports():
+    import requests
+
+app = modal.App(name=APP_NAME)
+
+@app.cls(
+    image=vllm_image,
+    gpu="H100",
+    volumes={HF_CACHE_PATH: HF_CACHE_VOL},
+    enable_memory_snapshot=True,
+    experimental_options={"enable_gpu_snapshot": True},
+    region="us-east",
+    min_containers=MIN_CONTAINERS,
+    timeout=6 * 60,
+)
+@modal.experimental.http_server(
+    port=PORT, proxy_regions=["us-east"], exit_grace_period=5
+)
+@modal.concurrent(target_inputs=20)
+class VLLM:
+    """Serve a HuggingFace model via VLLM with readiness check."""
+
+    @modal.enter(snap=True)
+    def startup(self) -> None:
+        """Start the VLLM server and block until it is healthy."""
+
+        cmd: list[str] = [
+            "vllm",
+            "serve",
+            "--uvicorn-log-level",
+            "error",
+            MODEL_NAME,
+            "--revision",
+            MODEL_REVISION,
+            "--served-model-name",
+            MODEL_NAME,
+            "--host",
+            "0.0.0.0",
+            "--port",
+            f"{PORT}",
+            "--disable-uvicorn-access-log",
+            "--disable-log-requests",
+            "--enable-sleep-mode",
+        ]
+
+        self.process = subprocess.Popen(cmd)
+        self._wait_ready(self.process)
+        self._warmup()
+        self._sleep(1)
+
+    @modal.enter(snap=False)
+    def wake_up(self):
+        self._wake_up()
+
+    @modal.exit()
+    def stop(self):
+        self.process.terminate()
+
+    @staticmethod
+    def _warmup():
+        payload = {
+            "model": MODEL_NAME,
+            "messages": [{"role": "user", "content": "Hello, how are you?"}],
+            "max_tokens": 16,
+        }
+        for _ in range(3):
+            requests.post(
+                f"http://127.0.0.1:{PORT}/v1/chat/completions", json=payload, timeout=10
+            ).raise_for_status()
+
+    @staticmethod
+    def _wait_ready(process: subprocess.Popen, timeout: int = 5 * 60):
+        def check_process_is_running() -> Exception | None:
+            if process is not None and process.poll() is not None:
+                return Exception(
+                    f"Process {process.pid} exited with code {process.returncode}"
+                )
+            return None
+
+        deadline: float = time.time() + timeout
+        while time.time() < deadline:
+            try:
+                if error := check_process_is_running():
+                    raise error
+                response = requests.get(f"http://127.0.0.1:{PORT}/health")
+                if response.status_code == 200:
+                    print("Server is healthy")
+                    break
+            except Exception:
+                pass
+
+    @staticmethod
+    def _sleep(level: int = 1):
+        requests.post(f"http://127.0.0.1:{PORT}/sleep?level={level}").raise_for_status()
+
+    @staticmethod
+    def _wake_up():
+        requests.post(f"http://127.0.0.1:{PORT}/wake_up").raise_for_status()
+
+## Deploy the server
+
+```
+
+To deploy the API on Modal, just run
+```bash
+modal deploy sglang_low_latency.py
+```
+
+This will create a new app on Modal, build the container image for it if it hasn't been built yet,
+and deploy the app.
+
+```python
+## Interact with the server
+
+```
+
+Once it is deployed, you'll see a URL appear in the command line,
+something like `https://your-workspace-name--example-sglang-low-latency-serve.us-east.modal.direct`.
+
+You can find [interactive Swagger UI docs](https://swagger.io/tools/swagger-ui/)
+at the `/docs` route of that URL, i.e. `https://your-workspace-name--example-sglang-low-latency-serve.us-east.modal.direct/docs`.
+These docs describe each route and indicate the expected input and output
+and translate requests into `curl` commands.
+
+For simple routes like `/health`, which checks whether the server is responding,
+you can even send a request directly from the docs.
+
+```python
+## Test the server
+
+```
+
+To make it easier to test the server setup, we also include a `local_entrypoint`
+that does a healthcheck and then hits the server.
+
+If you execute the command
+
+```bash
+modal run sglang_low_latency.py
+```
+
+a fresh replica of the server will be spun up on Modal while
+the code below executes on your local machine.
+
+Think of this like writing simple tests inside of the `if __name__ == "__main__"`
+block of a Python script, but for cloud deployments!
+
+```python
+@app.local_entrypoint()
+async def test(test_timeout=10 * MINUTE, content=None, twice=True):
+    url = VLLM()._experimental_get_flash_urls()[0]
+
+    system_prompt = {
+        "role": "system",
+        "content": "You are a pirate who can't help but drop sly reminders that he went to Harvard.",
+    }
+    if content is None:
+        image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
+
+        content = [
+            {
+                "type": "text",
+                "text": "What action do you think I should take in this situation?"
+                " List all the possible actions and explain why you think they are good or bad.",
+            },
+            {"type": "image_url", "image_url": {"url": image_url}},
+        ]
+
+    messages = [  # OpenAI chat format
+        system_prompt,
+        {"role": "user", "content": content},
+    ]
+
+    start_time = time.time()
+    async with aiohttp.ClientSession(base_url=url) as session:
+        while time.time() - start_time < test_timeout:
+            print(f"Running health check for server at {url}")
+            async with session.get(
+                "/health", timeout=test_timeout - 1 * MINUTE
+            ) as resp:
+                if resp.status == 200:
+                    print(f"Successful health check for server at {url}")
+                    break
+                time.sleep(10)
+        print(f"Sending messages to {url}:", *messages, sep="\n\t")
+        await _send_request(session, "llm", messages, timeout=1 * MINUTE)
+
+async def _send_request(
+    session: aiohttp.ClientSession, model: str, messages: list, timeout: int
+) -> None:
+    async with session.post(
+        "/v1/chat/completions",
+        json={"messages": messages, "model": model, "temperature": 0.15},
+        timeout=timeout,
+    ) as resp:
+        resp.raise_for_status()
+        print((await resp.json())["choices"][0]["message"]["content"])
+
+if __name__ == "__main__":
+    import asyncio
+
+    # after deployment, we can use the class from anywhere
+    vllm_server = modal.Cls.from_name("example-vllm-low-latency", "VLLM")
+
+    async def test(url):
+        messages = [{"role": "user", "content": "Tell me a joke."}]
+        async with aiohttp.ClientSession(base_url=url) as session:
+            await _send_request(session, MODEL_NAME, messages, timeout=10 * MINUTE)
+
+    print("calling inference server")
+    asyncio.run(test(vllm_server._experimental_get_flash_urls()[0]))
+
+```
+
+### Web Job Queue Wrapper
+
+# Create a web wrapper for job queue, submission, polling, & results
+
+This simple tutorial shows you how to create an API endpoint that you can use
+to poll the status of your request.
+
+Let's first import `modal` and define an [`App`](https://modal.com/docs/reference/modal.App).
+
+```python
+import time
+
+import modal
+
+app = modal.App("example-web-job-queue-wrapper")
+
+```
+
+Next, we'll create a dummy backend service, in reality you may plug an a LLM or Diffusion model here.
+We'll add artificial delays to simulate a cold boot and a long-running tasks.
+
+```python
+@app.cls()
+class BackendService:
+    @modal.enter()
+    def enter(self):
+        print("begin cold booting")
+        time.sleep(10)
+        print("end cold booting")
+
+    @modal.method()
+    def run(self, input_val: str):
+        print(f"begin run with {input_val}")
+        time.sleep(5)
+        print(f"end run with {input_val}")
+        return input_val[::-1]  # reverse the string
+
+```
+
+Then, we can define a web endpoint that will submit a request to the backend service
+as well as other API routes for polling or retrieving results.
+
+To submit jobs asynchronously, we can use ['spawn'](https://modal.com/docs/reference/modal.Function#spawn),
+which return a [`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) object that represents
+the submitted job.
+
+Then we can poll results by checking the ['call graph'](https://modal.com/docs/reference/modal.call_graph)
+of the `FunctionCall` object.
+
+```python
+@app.function(
+    image=modal.Image.debian_slim().uv_pip_install("fastapi[standard]==0.116.0")
+)
+@modal.asgi_app()
+@modal.concurrent(max_inputs=100)
+def web_endpoint():
+    from fastapi import FastAPI, Request
+
+    web_app = FastAPI()
+
+    service = BackendService()
+
+    @web_app.post("/run")
+    async def submit(request: Request):
+        """Asynchronously submit a request to the backend service."""
+        input_val = (await request.json())["input_val"]
+        fc = service.run.spawn(input_val)
+        while len(fc.get_call_graph()) == 0:
+            time.sleep(0.1)
+        return {"request_id": fc.object_id}
+
+    @web_app.get("/requests/{request_id}/status")
+    async def status(request_id: str):
+        """Get the status of the request from the call graph."""
+        fc = modal.FunctionCall.from_id(request_id)
+        fc_input_info = fc.get_call_graph()[0].children[0]
+        assert fc_input_info.function_call_id == fc.object_id, "unexpected graph"
+        return {"status": fc_input_info.status.name}
+
+    @web_app.get("/requests/{request_id}")
+    async def result(request_id: str):
+        fc = modal.FunctionCall.from_id(request_id)
+        return {"response": fc.get()}
+
+    return web_app
+
+```
+
+To test this you can do:
+```bash
+modal serve web_job_queue_wrapper.py
+```
+
+Or run the test locally:
+```bash
+modal run web_job_queue_wrapper.py::test_polling
+```
+
+```python
+@app.local_entrypoint()
+def test_polling():
+    """Test the polling job queue by submitting a request and polling for results."""
+    import json
+    import urllib.parse
+    import urllib.request
+
+    # Get the deployed URL
+    url = web_endpoint.get_web_url()
+    print(f"URL: {url}")
+
+    # Submit request
+    print("submitting request")
+    data = json.dumps({"input_val": "Hello, world!"}).encode("utf-8")
+    headers = {"Content-Type": "application/json"}
+    req = urllib.request.Request(
+        f"{url}/run", data=data, headers=headers, method="POST"
+    )
+
+    try:
+        with urllib.request.urlopen(req) as response:
+            result = json.loads(response.read().decode("utf-8"))
+            request_id = result["request_id"]
+            print(f"got request id: {request_id}, polling status")
+    except Exception as e:
+        print(f"Failed to submit request: {e}")
+        return
+
+    # Poll for status
+    while True:
+        try:
+            with urllib.request.urlopen(
+                f"{url}/requests/{request_id}/status"
+            ) as response:
+                data = json.loads(response.read().decode("utf-8"))
+                if data["status"] == "SUCCESS":
+                    print("request completed successfully")
+                    break
+                else:
+                    print(f"request result is {data['status']}")
+        except Exception as e:
+            print(f"poll failed: {e}")
+        time.sleep(1)
+
+    # Retrieve result
+    print("retrieving result")
+    try:
+        with urllib.request.urlopen(f"{url}/requests/{request_id}") as response:
+            result = json.loads(response.read().decode("utf-8"))
+            print(f"result is {result}")
+            print("done")
+    except Exception as e:
+        print(f"Failed to retrieve result: {e}")
+
+```
+
+### Webrtc Yolo
+
+# Real-time object detection with WebRTC and YOLO
+
+This example demonstrates how to architect a serverless real-time streaming application with Modal and WebRTC.
+The sample application detects objects in webcam video with YOLO.
+
+See the clip below from a live demo of this example in a course by [Kwindla Kramer](https://machine-theory.com/), WebRTC OG and co-founder of [Daily](https://www.daily.co/).
+
+<center>
+<video controls autoplay muted>
+<source src="https://modal-cdn.com/example-webrtc_yolo.mp4" type="video/mp4">
+</video>
+</center>
+
+You can also try our deployment [here](https://modal-labs-examples--example-webrtc-yolo-webcamobjdet-web.modal.run).
+
+## What is WebRTC?
+
+WebRTC (Web Real-Time Communication) is an [IETF Internet protocol](https://www.rfc-editor.org/rfc/rfc8825) and a [W3C API specification](https://www.w3.org/TR/webrtc/) for real-time media streaming between peers
+over internets or the World Wide Web.
+What makes it so effective and different from other bidirectional web-based communication protocols (e.g. WebSockets) is that it's purpose-built for media streaming in real time.
+It's primarily designed for browser applications using the JavaScript API, but [APIs exist for other languages](https://www.webrtc-developers.com/did-i-choose-the-right-webrtc-stack/).
+We'll build our app using Python's [`aiortc`](https://aiortc.readthedocs.io/en/latest/) package.
+
+### What makes up a WebRTC application?
+
+A simple WebRTC app generally consists of three players:
+1. a peer that initiates the connection,
+2. a peer that responds to the connection, and
+3. a server that passes some initial messages between the two peers.
+
+First, one peer initiates the connection by offering up a description of itself - its media sources, codec capabilities, Internet Protocol (IP) addressing info, etc - which is relayed to another peer through the server.
+The other peer then either accepts the offer by providing a compatible description of its own capabilities or rejects it if no compatible configuration is possible.
+This process is called "signaling" or sometimes the "negotiation" in the WebRTC world, and the server that mediates it is usually called the "signaling server".
+
+Once the peers have agreed on a configuration there's a brief pause to establish communication... and then you're live.
+
+![Basic WebRTC architecture](https://modal-cdn.com/cdnbot/just_webrtc-1oic3iems_a4a8e77c.webp)
+<small>A basic WebRTC app architecture</small>
+
+Obviously there’s more going on under the hood.
+If you want to get into the details, we recommend checking out the [RFCs](https://www.rfc-editor.org/rfc/rfc8825) or a [more-thorough explainer](https://webrtcforthecurious.com/).
+In this document, we'll focus on how to architect a WebRTC application where one or more peer is running on Modal's serverless cloud infrastructure.
+
+If you just want to quickly get started with WebRTC for a small internal service or a hack project, check out
+[our FastRTC example](https://modal.com/docs/examples/fastrtc_flip_webcam) instead.
+
+## How do I run a WebRTC app on Modal?
+
+Modal turns Python code into scalable cloud services.
+When you call a Modal Function, you get one replica.
+If you call it 999 more times before it returns, you have 1000 replicas.
+When your Functions all return, you spin down to 0 replicas.
+
+The core constraints of the Modal programming model that make this possible are that Function Calls are stateless and self-contained.
+In other words, correctly-written Modal Functions don't store information in memory between runs (though they might cache data to the ephemeral local disk for efficiency) and they don't create processes or tasks which must continue to run after the Function Call returns in order for the application to be correct.
+
+WebRTC apps, on the other hand, require passing messages back and forth in a multi-step protocol, and APIs spawn several "agents" (no, AI is not involved, just processes) which do work behind the scenes - including managing the peer-to-peer (P2P) connection itself.
+This means that streaming may have only just begun when the application logic in our Function has finished.
+
+![Modal programming model and WebRTC signaling](https://modal-cdn.com/cdnbot/flow_comparisong6iibzq3_638bdd84.webp)
+<small>Modal's stateless programming model (left) and WebRTC's stateful signaling (right)</small>
+
+To ensure we properly leverage Modal's autoscaling and concurrency features, we need to align the signaling and streaming lifetimes with Modal Function Call lifetimes.
+
+The architecture we recommend for this appears below.
+
+![WebRTC on Modal](https://modal-cdn.com/cdnbot/webrtc_with_modal-2horb680q_eab69b28.webp)
+<small>A clean architecture for WebRTC on Modal</small>
+
+It handles passing messages between the client peer and the signaling server using a
+[WebSocket](https://modal.com/docs/guide/webhooks#websockets) for persistent, bidirectional communication over the Web within a single Function Call.
+(Modal's Web layer maps HTTP and WS onto Function Calls, details [here](https://modal.com/blog/serverless-http)).
+We [`.spawn`](https://modal.com/docs/reference/modal.Function#spawn) the cloud peer inside the WebSocket endpoint
+and communicate it using a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue).
+
+We can then use the state of the P2P connection to determine when to return from the calls to both the signaling server and the cloud peer.
+When the P2P connection has been _established_, we'll close the WebSocket which in turn ends the call to the signaling server.
+And when the P2P connection has been _closed_, we'll return from the call to the cloud peer.
+That way, our WebRTC application benefits from all the autoscaling and concurrency logic built into Modal
+that enables users to deliver efficient cloud applications.
+
+We wrote two classes, `ModalWebRtcPeer` and `ModalWebRtcSignalingServer`, to abstract away that boilerplate as well as a lot of the `aiortc` implementation details.
+They're also decorated with Modal [lifetime hooks](https://modal.com/docs/guide/lifecycle-functions).
+Add the [`app.cls`](https://modal.com/docs/reference/modal.App#cls) decorator and some custom logic, and you're ready to deploy on Modal.
+
+You can find them in the [`modal_webrtc.py` file](https://github.com/modal-labs/modal-examples/blob/main/07_web_endpoints/webrtc/modal_webrtc.py) provided alongside this example in the [GitHub repo](https://github.com/modal-labs/modal-examples/tree/main/07_web_endpoints/webrtc/modal_webrtc.py).
+
+## Using `modal_webrtc` to detect objects in webcam footage
+
+For our WebRTC app, we'll take a client's video stream, run a [YOLO](https://docs.ultralytics.com/tasks/detect/) object detector on it with an A100 GPU on Modal, and then stream the annotated video back to the client.
+With this setup, we can achieve inference times between 2-4 milliseconds per frame and RTTs below video frame rates (usually around 30 milliseconds per frame).
+
+Let's get started!
+
+### Setup
+
+We'll start with a simple container [Image](https://modal.com/docs/guide/images) and then
+
+- set it up to properly use TensorRT and the ONNX Runtime, which keep latency minimal,
+- install the necessary libs for processing video, `opencv` and `ffmpeg`, and
+- install the necessary Python packages.
+
+```python
+import os
+from pathlib import Path
+
+import modal
+
+from .modal_webrtc import ModalWebRtcPeer, ModalWebRtcSignalingServer
+
+py_version = "3.12"
+tensorrt_ld_path = f"/usr/local/lib/python{py_version}/site-packages/tensorrt_libs"
+
+video_processing_image = (
+    modal.Image.debian_slim(python_version=py_version)  # matching ld path
+    # update locale as required by onnx
+    .apt_install("locales")
+    .run_commands(
+        "sed -i '/^#\\s*en_US.UTF-8 UTF-8/ s/^#//' /etc/locale.gen",  # use sed to uncomment
+        "locale-gen en_US.UTF-8",  # set locale
+        "update-locale LANG=en_US.UTF-8",
+    )
+    .env({"LD_LIBRARY_PATH": tensorrt_ld_path, "LANG": "en_US.UTF-8"})
+    # install system dependencies
+    .apt_install("python3-opencv", "ffmpeg")
+    # install Python dependencies
+    .uv_pip_install(
+        "aiortc==1.11.0",
+        "fastapi==0.115.12",
+        "huggingface-hub[hf_xet]==0.30.2",
+        "onnxruntime-gpu==1.21.0",
+        "opencv-python==4.11.0.86",
+        "tensorrt==10.9.0.34",
+        "torch==2.7.0",
+        "shortuuid==1.0.13",
+    )
+)
+
+```
+
+### Cache weights and compute graphs on a Volume
+
+We also need to create a Modal [Volume](https://modal.com/docs/guide/volumes) to store things we need across replicas --
+primarily the model weights and ONNX inference graph, but also a few other artifacts like a video file where
+we'll write out the processed video stream for testing. For more on storing model weights on Modal, see
+[this guide](https://modal.com/docs/guide/model-weights).
+
+The very first time we run the app, downloading the model and building the ONNX inference graph will take a few minutes.
+After that, we can load the cached weights and graph from the Volume, which reduces the startup time to about 15 seconds per container.
+
+```python
+CACHE_VOLUME = modal.Volume.from_name("webrtc-yolo-cache", create_if_missing=True)
+CACHE_PATH = Path("/cache")
+cache = {CACHE_PATH: CACHE_VOLUME}
+
+app = modal.App("example-webrtc-yolo")
+
+```
+
+### Implement YOLO object detection as a `ModalWebRtcPeer`
+
+Our application needs to process an incoming video track with YOLO and return an annotated video track to the source peer.
+
+To implement a `ModalWebRtcPeer`, we need to:
+
+- Decorate our subclass with `@app.cls`. We provision it with an A100 GPU and a [Secret](https://modal.com/docs/guide/secrets) credential, described below.
+- Implement the method `setup_streams`. This is where we'll use `aiortc` to add the logic for processing the incoming video track with YOLO
+and returning an annotated video track to the source peer.
+
+`ModalWebRtcPeer` has a few other methods that users can optionally implement:
+
+- `initialize()`: This contains any custom initialization logic, called when `@modal.enter()` is called.
+- `run_streams()`: Logic for starting streams. This is necessary when the peer is the source of the stream.
+This is where you'd ensure a webcam was running, start playing a video file, or spin up a [video generative model](https://modal.com/docs/examples/image_to_video).
+- `get_turn_servers()`: We haven't talked about [TURN servers](https://datatracker.ietf.org/doc/html/rfc5766),
+but just know that they're necessary if you want to use WebRTC across complex (e.g. carrier-grade) NAT or firewall configurations.
+Free services have tight limits because TURN servers are expensive to run (lots of bandwidth and state management required).
+[STUN](https://datatracker.ietf.org/doc/html/rfc5389) servers, on the other hand, are essentially just echo servers, and so there are many free services available.
+If you don't provide TURN servers you can still serve your app on many networks using any of a number of free STUN servers for NAT traversal.
+- `exit()`: This contains any custom cleanup logic, called when `@modal.exit()` is called.
+
+In our case, we load the YOLO model in `initialize` and provide server information for the free [Open Relay TURN server](https://www.metered.ca/tools/openrelay/).
+If you want to use it, you'll need to create an account [here](https://dashboard.metered.ca/login?tool=turnserver)
+and then create a Modal [Secret](https://modal.com/docs/guide/secrets) called `turn-credentials` [here](https://modal.com/secrets).
+We also use the `@modal.concurrent` decorator to allow multiple instances of our peer to run on one GPU.
+
+**Setting the Region**
+
+Much of the latency in Internet applications comes from distance between communicating parties --
+the Internet operates within a factor of two of the speed of light, but that's just not that fast.
+To minimize latency under this constraint, the physical distance of the P2P connection
+between the webcam-using peer and the GPU container needs to be kept as short as possible.
+We'll use the `region` parameter of the `cls` decorator to set the region of the GPU container.
+You should set this to the closest region to your users.
+See the [region selection](https://modal.com/docs/guide/region-selection) guide for more information.
+
+```python
+@app.cls(
+    image=video_processing_image,
+    gpu="A100-40GB",
+    volumes=cache,
+    secrets=[modal.Secret.from_name("turn-credentials")],
+    region="us-east",  # set to your region
+)
+@modal.concurrent(
+    target_inputs=2,  # try to stick to just two peers per GPU container
+    max_inputs=3,  # but allow up to three
+)
+class ObjDet(ModalWebRtcPeer):
+    async def initialize(self):
+        self.yolo_model = get_yolo_model(CACHE_PATH)
+
+    async def setup_streams(self, peer_id: str):
+        from aiortc import MediaStreamTrack
+
+        # keep us notified on connection state changes
+        @self.pcs[peer_id].on("connectionstatechange")
+        async def on_connectionstatechange() -> None:
+            if self.pcs[peer_id]:
+                print(
+                    f"Video Processor, {self.id}, connection state to {peer_id}: {self.pcs[peer_id].connectionState}"
+                )
+
+        # when we receive a track from the source peer
+        # we create a processed track and add it to our stream
+        # back to the source peer
+        @self.pcs[peer_id].on("track")
+        def on_track(track: MediaStreamTrack) -> None:
+            print(
+                f"Video Processor, {self.id}, received {track.kind} track from {peer_id}"
+            )
+
+            output_track = get_yolo_track(track, self.yolo_model)  # see Addenda
+            self.pcs[peer_id].addTrack(output_track)
+
+            # keep us notified when the incoming track ends
+            @track.on("ended")
+            async def on_ended() -> None:
+                print(
+                    f"Video Processor, {self.id}, incoming video track from {peer_id} ended"
+                )
+
+    async def get_turn_servers(self, peer_id=None, msg=None) -> dict:
+        creds = {
+            "username": os.environ["TURN_USERNAME"],
+            "credential": os.environ["TURN_CREDENTIAL"],
+        }
+
+        turn_servers = [
+            {"urls": "stun:stun.relay.metered.ca:80"},  # STUN is free, no creds neeeded
+            # for TURN, sign up for the free service here: https://www.metered.ca/tools/openrelay/
+            {"urls": "turn:standard.relay.metered.ca:80"} | creds,
+            {"urls": "turn:standard.relay.metered.ca:80?transport=tcp"} | creds,
+            {"urls": "turn:standard.relay.metered.ca:443"} | creds,
+            {"urls": "turns:standard.relay.metered.ca:443?transport=tcp"} | creds,
+        ]
+
+        return {"type": "turn_servers", "ice_servers": turn_servers}
+
+```
+
+### Implement a `SignalingServer`
+
+The `ModalWebRtcSignalingServer` class is much simpler to implement.
+The main thing we need to do is implement the `get_modal_peer_class` method which will return our implementation of the `ModalWebRtcPeer` class, `ObjDet`.
+
+It also has an `initialize()` method we can optionally override (called at the beginning of the [container lifecycle](https://modal.com/docs/guide/lifecycle-functions))
+as well as a `web_app` property which will be [served by Modal](https://modal.com/docs/guide/webhooks#asgi-apps---fastapi-fasthtml-starlette).
+We'll use these to add a frontend which uses the WebRTC JavaScript API to stream a peer's webcam from the browser.
+
+The JavaScript and HTML files are alongside this example in the [Github repo](https://github.com/modal-labs/modal-examples/tree/main/07_web_endpoints/webrtc/frontend).
+
+```python
+base_image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install("python3-opencv", "ffmpeg")
+    .uv_pip_install(
+        "fastapi[standard]==0.115.4",
+        "aiortc==1.11.0",
+        "opencv-python==4.11.0.86",
+        "shortuuid==1.0.13",
+    )
+)
+
+this_directory = Path(__file__).parent.resolve()
+
+server_image = base_image.add_local_dir(
+    this_directory / "frontend", remote_path="/frontend"
+)
+
+@app.cls(image=server_image)
+class WebcamObjDet(ModalWebRtcSignalingServer):
+    def get_modal_peer_class(self):
+        return ObjDet
+
+    def initialize(self):
+        from fastapi.responses import HTMLResponse
+        from fastapi.staticfiles import StaticFiles
+
+        self.web_app.mount("/static", StaticFiles(directory="/frontend"))
+
+        @self.web_app.get("/")
+        async def root():
+            html = open("/frontend/index.html").read()
+            return HTMLResponse(content=html)
+
+```
+
+## Addenda
+
+The remainder of this page is not central to running a WebRTC application on Modal,
+but is included for completeness.
+
+### YOLO helper functions
+
+The two functions below are used to set up the YOLO model and create our custom [`MediaStreamTrack`](https://aiortc.readthedocs.io/en/latest/api.html#aiortc.MediaStreamTrack).
+
+The first, `get_yolo_model`, sets up the ONNXRuntime and loads the model weights.
+We call this in the `initialize` method of the `ModalWebRtcPeer` class
+so that it only happens once per container.
+
+```python
+def get_yolo_model(cache_path):
+    import onnxruntime
+
+    from .yolo import YOLOv10
+
+    onnxruntime.preload_dlls()
+    return YOLOv10(cache_path)
+
+```
+
+The second, `get_yolo_track`, creates a custom `MediaStreamTrack` that performs object detection on the video stream.
+We call this in the `setup_streams` method of the `ModalWebRtcPeer` class
+so it happens once per peer connection.
+
+```python
+def get_yolo_track(track, yolo_model=None):
+    import numpy as np
+    import onnxruntime
+    from aiortc import MediaStreamTrack
+    from aiortc.contrib.media import VideoFrame
+
+    from .yolo import YOLOv10
+
+    class YOLOTrack(MediaStreamTrack):
+        """
+        Custom media stream track performs object detection
+        on the video stream and passes it back to the source peer
+        """
+
+        kind: str = "video"
+        conf_threshold: float = 0.15
+
+        def __init__(self, track: MediaStreamTrack, yolo_model=None) -> None:
+            super().__init__()
+
+            self.track = track
+            if yolo_model is None:
+                onnxruntime.preload_dlls()
+                self.yolo_model = YOLOv10(CACHE_PATH)
+            else:
+                self.yolo_model = yolo_model
+
+        def detection(self, image: np.ndarray) -> np.ndarray:
+            import cv2
+
+            orig_shape = image.shape[:-1]
+
+            image = cv2.resize(
+                image,
+                (self.yolo_model.input_width, self.yolo_model.input_height),
+            )
+
+            image = self.yolo_model.detect_objects(image, self.conf_threshold)
+
+            image = cv2.resize(image, (orig_shape[1], orig_shape[0]))
+
+            return image
+
+        # this is the essential method we need to implement
+        # to create a custom MediaStreamTrack
+        async def recv(self) -> VideoFrame:
+            frame = await self.track.recv()
+            img = frame.to_ndarray(format="bgr24")
+
+            processed_img = self.detection(img)
+
+            # VideoFrames are from a really nice package called av
+            # which is a pythonic wrapper around ffmpeg
+            # and a dependency of aiortc
+            new_frame = VideoFrame.from_ndarray(processed_img, format="bgr24")
+            new_frame.pts = frame.pts
+            new_frame.time_base = frame.time_base
+
+            return new_frame
+
+    return YOLOTrack(track, yolo_model)
+
+```
+
+### Testing a WebRTC application on Modal
+
+As any seasoned developer of real-time applications on the Web will tell you,
+testing and ensuring correctness is quite difficult. We spent nearly as much time
+designing and troubleshooting an appropriate testing process for this application as we did writing
+the application itself!
+
+You can find the testing code in the GitHub repository [here](https://github.com/modal-labs/modal-examples/tree/main/07_web_endpoints/webrtc/webrtc_yolo_test.py).
+
+### Webrtc Yolo Test
+
+```python
+import modal
+
+from .modal_webrtc import ModalWebRtcPeer
+from .webrtc_yolo import (
+    CACHE_PATH,
+    WebcamObjDet,
+    app,
+    base_image,
+    cache,
+)
+
+```
+
+## Testing WebRTC and Modal
+
+First we define a `local_entrypoint` to run and evaluate the test.
+Our test will stream an .mp4 file to the cloud peer and record the annoated video to a new file.
+The test itself ensurse that the new video is no more than five frames shorter than the source file.
+The difference is due to dropped frames while the connection is starting up.
+
+```python
+@app.local_entrypoint()
+def test():
+    input_frames, output_frames = TestPeer().run_video_processing_test.remote()
+    # allow a few dropped frames from the connection starting up
+    assert input_frames - output_frames < 5, (
+        f"Streaming failed. Frame difference: {input_frames} - {output_frames} = {input_frames - output_frames}"
+    )
+
+```
+
+Because our test will require Python dependencies outside the standard library, we'll run the test itself in a container on Modal.
+In fact, this will be another `ModalWebRtcPeer` class. So the test will also demonstrate how to setup WebRTC between Modal containers.
+There are some details in here regarding the use of `aiortc`'s `MediaPlayer` and `MediaRecorder` classes that won't cover here.
+Just know that these are `aiortc` specific classes - not a WebRTC thing.
+
+That said, using these classes does require us to manually `start` and `stop` streams.
+For example, we'll need to override the `run_streams` method to start the source stream, and we'll make use of the `on_ended` callback to stop the recording.
+
+```python
+@app.cls(image=base_image, volumes=cache)
+class TestPeer(ModalWebRtcPeer):
+    TEST_VIDEO_SOURCE_URL = "https://modal-cdn.com/cliff_jumping.mp4"
+    TEST_VIDEO_RECORD_FILE = CACHE_PATH / "test_video.mp4"
+    # extra time to run streams beyond input video duration
+    VIDEO_DURATION_BUFFER_SECS = 5.0
+    # allow time for container to spin up (can timeout with default 10)
+    WS_OPEN_TIMEOUT = 300  # seconds
+
+    async def initialize(self) -> None:
+        import cv2
+
+        # get input video duration in seconds
+        self.input_video = cv2.VideoCapture(self.TEST_VIDEO_SOURCE_URL)
+        self.input_video_duration_frames = self.input_video.get(
+            cv2.CAP_PROP_FRAME_COUNT
+        )
+        self.input_video_duration_seconds = (
+            self.input_video_duration_frames / self.input_video.get(cv2.CAP_PROP_FPS)
+        )
+        self.input_video.release()
+
+        # set streaming duration to input video duration plus a buffer
+        self.stream_duration = (
+            self.input_video_duration_seconds + self.VIDEO_DURATION_BUFFER_SECS
+        )
+
+        self.player = None  # video stream source
+        self.recorder = None  # processed video stream sink
+
+    async def setup_streams(self, peer_id: str) -> None:
+        import os
+
+        from aiortc import MediaStreamTrack
+        from aiortc.contrib.media import MediaPlayer, MediaRecorder
+
+        # setup video player and to peer connection
+        self.video_src = MediaPlayer(self.TEST_VIDEO_SOURCE_URL)
+        self.pcs[peer_id].addTrack(self.video_src.video)
+
+        # setup video recorder
+        if os.path.exists(self.TEST_VIDEO_RECORD_FILE):
+            os.remove(self.TEST_VIDEO_RECORD_FILE)
+        self.recorder = MediaRecorder(self.TEST_VIDEO_RECORD_FILE)
+
+        # keep us notified on connection state changes
+        @self.pcs[peer_id].on("connectionstatechange")
+        async def on_connectionstatechange() -> None:
+            print(
+                f"Video Tester connection state updated: {self.pcs[peer_id].connectionState}"
+            )
+
+        # when we receive a track back from
+        # the video processing peer we record it
+        # to the output file
+        @self.pcs[peer_id].on("track")
+        def on_track(track: MediaStreamTrack) -> None:
+            print(f"Video Tester received {track.kind} track from {peer_id}")
+            # record track to file
+            self.recorder.addTrack(track)
+
+            @track.on("ended")
+            async def on_ended() -> None:
+                print("Video Tester's processed video stream ended")
+                # stop recording when incoming track ends to finish writing video
+                await self.recorder.stop()
+                # reset recorder and player
+                self.recorder = None
+                self.video_src = None
+
+    async def run_streams(self, peer_id: str) -> None:
+        import asyncio
+
+        print(f"Video Tester running streams for {peer_id}...")
+
+        # MediaRecorders need to be started manually
+        # but in most cases the track is already streaming
+        await self.recorder.start()
+
+        # run until sufficient time has passed
+        await asyncio.sleep(self.stream_duration)
+
+        # close peer connection manually
+        await self.pcs[peer_id].close()
+
+    def count_frames(self):
+        import cv2
+
+        # compare output video length to input video length
+        output_video = cv2.VideoCapture(self.TEST_VIDEO_RECORD_FILE)
+        output_video_duration_frames = int(output_video.get(cv2.CAP_PROP_FRAME_COUNT))
+        output_video.release()
+
+        return self.input_video_duration_frames, output_video_duration_frames
+
+    @modal.method()
+    async def run_video_processing_test(self) -> bool:
+        import asyncio
+        import json
+
+        import websockets
+
+        peer_id = None
+        # connect to server via websocket
+        ws_uri = (
+            WebcamObjDet().web.get_web_url().replace("http", "ws") + f"/ws/{self.id}"
+        )
+        async with websockets.connect(
+            ws_uri, open_timeout=self.WS_OPEN_TIMEOUT
+        ) as websocket:
+            await websocket.send(json.dumps({"type": "identify", "peer_id": self.id}))
+            peer_id = json.loads(await websocket.recv())["peer_id"]
+
+            offer_msg = await self.generate_offer(peer_id)
+            await websocket.send(json.dumps(offer_msg))
+
+            try:
+                # receive answer
+                answer = json.loads(await websocket.recv())
+
+                if answer.get("type") == "answer":
+                    await self.handle_answer(peer_id, answer)
+
+            except websockets.exceptions.ConnectionClosed:
+                await websocket.close()
+
+        # loop until video player is finished
+        if self.pcs.get(peer_id) and self.pcs[peer_id].connectionState == "connected":
+            await self.run_streams(peer_id)
+
+            # wait for peer to finish processing video
+            await asyncio.sleep(5.0)
+
+        return self.count_frames()
+
+```
+
+### Webscraper
+
+# A simple web scraper
+
+In this guide we'll introduce you to Modal by writing a simple web scraper.
+We'll explain the foundations of a Modal application step by step.
+
+## Set up your first Modal app
+
+Modal Apps are orchestrated as Python scripts but can theoretically run
+anything you can run in a container. To get you started, make sure to install
+the latest `modal` Python package and set up an API token (the first two steps
+[here](https://modal.com/docs/guide)).
+
+## Scrape links locally
+
+First, we create an empty Python file `webscraper.py`. This file will contain our
+application code. Let's write some basic Python code to fetch the contents of a
+web page and print the links (`href` attributes) it finds in the document:
+
+```python
+import re
+import sys
+import urllib.request
+
+def get_links(url):
+    response = urllib.request.urlopen(url)
+    html = response.read().decode("utf8")
+    links = []
+    for match in re.finditer('href="(.*?)"', html):
+        links.append(match.group(1))
+    return links
+
+if __name__ == "__main__":
+    links = get_links(sys.argv[1])
+    print(links)
+```
+
+Now obviously this is just pure standard library Python code, and you can run it
+on your machine:
+
+```bash
+$ python webscraper.py http://example.com
+['https://www.iana.org/domains/example']
+```
+
+## Run it on Modal
+
+To make the `get_links` function run on Modal instead of your local machine, all
+you need to do is
+
+- Import `modal`
+- Create a [`modal.App`](/docs/reference/modal.App) instance
+- Add an `@app.function()` annotation to your function
+- Replace the `if __name__ == "__main__":` block with a function decorated with
+  [`@app.local_entrypoint()`](/docs/reference/modal.App#local_entrypoint)
+- Call `get_links` using `get_links.remote`
+
+```python
+import re
+import urllib.request
+import modal
+
+app = modal.App(name="example-webscraper")
+
+@app.function()
+def get_links(url):
+    response = urllib.request.urlopen(url)
+    html = response.read().decode("utf8")
+    links = []
+    for match in re.finditer('href="(.*?)"', html):
+        links.append(match.group(1))
+    return links
+
+@app.local_entrypoint()
+def main(url):
+    links = get_links.remote(url)
+    print(links)
+```
+
+You can now run this with the Modal CLI, using `modal run` instead of `python`.
+This time, you'll see additional progress indicators while the script is
+running, something like:
+
+```bash
+$ modal run webscraper.py --url http://example.com
+✓ Initialized.
+✓ Created objects.
+['https://www.iana.org/domains/example']
+✓ App completed.
+```
+
+## Add dependencies
+
+In the code above we make use of the Python standard library `urllib` library.
+This works great for static web pages, but many pages these days use javascript
+to dynamically load content, which wouldn't appear in the loaded html file.
+Let's use the [Playwright](https://playwright.dev/python/docs/intro) package to
+instead launch a headless Chromium browser which can interpret any javascript
+that might be on the page.
+
+We can pass [custom container images](/docs/guide/images) (defined using
+[`modal.Image`](/docs/reference/modal.Image)) to the `@app.function()`
+decorator. We'll make use of the `modal.Image.debian_slim` pre-bundled Image add
+the shell commands to install Playwright and its dependencies:
+
+```python
+import modal
+
+app = modal.App("example-webscraper")
+playwright_image = modal.Image.debian_slim(python_version="3.10").run_commands(
+    "apt-get update",
+    "apt-get install -y software-properties-common",
+    "apt-add-repository non-free",
+    "apt-add-repository contrib",
+    "pip install playwright==1.42.0",
+    "playwright install-deps chromium",
+    "playwright install chromium",
+)
+
+```
+
+Note that we don't have to install Playwright or Chromium on our development
+machine since this will all run in Modal. We can now modify our `get_links`
+function to make use of the new tools.
+
+```python
+@app.function(image=playwright_image)
+async def get_links(cur_url: str) -> list[str]:
+    from playwright.async_api import async_playwright
+
+    async with async_playwright() as p:
+        browser = await p.chromium.launch()
+        page = await browser.new_page()
+        await page.goto(cur_url)
+        links = await page.eval_on_selector_all(
+            "a[href]", "elements => elements.map(element => element.href)"
+        )
+        await browser.close()
+
+    print("Links", links)
+    return list(set(links))
+
+```
+
+Since Playwright has a nice async interface, we'll redeclare our `get_links`
+function as async (Modal works with both sync and async functions).
+
+The first time you run the function after making this change, you'll notice that
+the output first shows the progress of building the image you specified,
+after which your function runs like before. This image is then cached so that on
+subsequent runs of the function it will not be rebuilt as long as the image
+definition is the same.
+
+## Scale out
+
+So far, our script only fetches the links for a single page. What if we want to
+scrape a large list of links in parallel?
+
+We can do this easily with Modal, because of some magic: the function we wrapped
+with the `@app.function()` decorator is no longer an ordinary function, but a
+Modal [Function](https://modal.com/docs/reference/modal.Function) object. This
+means it comes with a `map` property built in, that lets us run this function
+for all inputs in parallel, scaling up to as many workers as needed.
+
+Let's change our code to scrape all urls we feed to it in parallel:
+
+```python
+@app.local_entrypoint()
+def main():
+    urls = ["http://modal.com", "http://github.com"]
+    for links in get_links.map(urls):
+        for link in links:
+            print(link)
+```
+
+## Deploy it and run it on a schedule
+
+Let's say we want to log the scraped links daily. We move the print loop into
+its own Modal function and annotate it with a `modal.Period(days=1)` schedule -
+indicating we want to run it once per day. Since the scheduled function will not
+run from our command line, we also add a hard-coded list of links to crawl for
+now. In a more realistic setting we could read this from a database or other
+accessible data source.
+
+```python
+@app.function(schedule=modal.Period(days=1))
+def daily_scrape():
+    urls = ["http://modal.com", "http://github.com"]
+    for links in get_links.map(urls):
+        for link in links:
+            print(link)
+```
+
+To deploy App permanently, run the command
+
+```
+modal deploy webscraper.py
+```
+
+Running this command deploys this function and then closes immediately. We can
+see the deployment and all of its runs, including the printed links, on the
+Modal [Apps page](https://modal.com/apps). Rerunning the script will redeploy
+the code with any changes you have made - overwriting an existing deploy with
+the same name ("example-webscraper" in this case).
+
+## Add Secrets and integrate with other systems
+
+Instead of looking at the links in the run logs of our deployments, let's say we
+wanted to post them to a `#scraped-links` Slack channel. To do this, we can
+make use of the [Slack API](https://api.slack.com/) and the `slack-sdk`
+[PyPI package](https://pypi.org/project/slack-sdk/).
+
+The Slack SDK WebClient requires an API token to get access to our Slack
+Workspace, and since it's bad practice to hardcode credentials into application
+code we make use of Modal's **Secrets**. Secrets are snippets of data that will
+be injected as environment variables in the containers running your functions.
+
+The easiest way to create Secrets is to go to the
+[Secrets section of modal.com](https://modal.com/secrets). You can both create a
+free-form secret with any environment variables, or make use of presets for
+common services. We'll use the Slack preset and after filling in the necessary
+information we are presented with a snippet of code that can be used to post to
+Slack using our credentials, which looks something like:
+
+```python
+import os
+
+slack_sdk_image = modal.Image.debian_slim(python_version="3.10").uv_pip_install(
+    "slack-sdk"
+)
+
+@app.function(
+    image=slack_sdk_image,
+    secrets=[
+        modal.Secret.from_name(
+            "scraper-slack-secret", required_keys=["SLACK_BOT_TOKEN"]
+        )
+    ],
+    retries=3,
+)
+def bot_token_msg(channel, message):
+    import slack_sdk
+
+    client = slack_sdk.WebClient(token=os.environ["SLACK_BOT_TOKEN"])
+    print(f"Posting {message} to #{channel}")
+    client.chat_postMessage(channel=channel, text=message)
+
+```
+
+Notice the `retries` in the `@app.function` decorator.
+That parameter adds automatic retries when Function calls fail
+due to temporary issues, like rate limits. Read more [here](https://modal.com/docs/guide/retries)
+
+Copy that code, then amend the `daily_scrape` function to call
+`bot_token_msg`. We also add a per-URL `limit` for good measure.
+
+```python
+@app.function(schedule=modal.Period(days=1))
+def daily_scrape(limit: int = 50):
+    urls = ["http://modal.com", "http://github.com"]
+
+    for links in get_links.map(urls):
+        for link in links[:limit]:
+            bot_token_msg.remote("scraped-links", link)
+
+@app.local_entrypoint()
+def main():
+    urls = ["http://modal.com", "http://github.com"]
+    for links in get_links.map(urls):
+        for link in links:
+            print(link)
+
+```
+
+Note that we are freely making function calls across completely different
+container images, as if they were regular Python functions in the same program!
+
+We keep the `local_entrypoint` the same so that we can still `modal run`
+this script to test the scraping behavior without posting to Slack.
+
+```bash
+modal run webscraper.py  # runs get_links.map via the local_entrypoint
+```
+
+If we want to test the `daily_scrape` or `bot_token_msg` Functions themselves, we can do that too!
+We just add the name of the Function to the end of our `modal run` command:
+
+```bash
+modal run webscraper.py::daily_scrape --limit 1  # quick test
+```
+
+Now redeploy the script to overwrite the old deploy with our updated code, and
+you'll get a daily feed of scraped links in your Slack channel 🎉
+
+```bash
+modal deploy webscraper.py
+```
+
+## Summary
+
+We have shown how you can use Modal to develop distributed Python data
+applications using custom containers. Through simple constructs we were able to
+add parallel execution. With the change of a single line of code were were able
+to go from experimental development code to a deployed application. We hope
+this overview gives you a glimpse of what you are able to build using Modal.
+
+### Webscraper Old
+
+# Web Scraping on Modal
+
+This example shows how you can scrape links from a website and post them to a Slack channel using Modal.
+
+```python
+import os
+
+import modal
+
+app = modal.App("example-webscraper")
+
+playwright_image = modal.Image.debian_slim(
+    python_version="3.10"
+).run_commands(  # Doesn't work with 3.11 yet
+    "apt-get update",
+    "apt-get install -y software-properties-common",
+    "apt-add-repository non-free",
+    "apt-add-repository contrib",
+    "pip install playwright==1.42.0",
+    "playwright install-deps chromium",
+    "playwright install chromium",
+)
+
+@app.function(image=playwright_image)
+async def get_links(url: str) -> set[str]:
+    from playwright.async_api import async_playwright
+
+    async with async_playwright() as p:
+        browser = await p.chromium.launch()
+        page = await browser.new_page()
+        await page.goto(url)
+        links = await page.eval_on_selector_all(
+            "a[href]", "elements => elements.map(element => element.href)"
+        )
+        await browser.close()
+
+    return set(links)
+
+slack_sdk_image = modal.Image.debian_slim(python_version="3.10").uv_pip_install(
+    "slack-sdk==3.27.1"
+)
+
+@app.function(
+    image=slack_sdk_image,
+    secrets=[
+        modal.Secret.from_name(
+            "scraper-slack-secret", required_keys=["SLACK_BOT_TOKEN"]
+        )
+    ],
+)
+def bot_token_msg(channel, message):
+    import slack_sdk
+    from slack_sdk.http_retry.builtin_handlers import RateLimitErrorRetryHandler
+
+    client = slack_sdk.WebClient(token=os.environ["SLACK_BOT_TOKEN"])
+    rate_limit_handler = RateLimitErrorRetryHandler(max_retry_count=3)
+    client.retry_handlers.append(rate_limit_handler)
+
+    print(f"Posting {message} to #{channel}")
+    client.chat_postMessage(channel=channel, text=message)
+
+@app.function()
+def scrape():
+    links_of_interest = ["http://modal.com"]
+
+    for links in get_links.map(links_of_interest):
+        for link in links:
+            bot_token_msg.remote("scraped-links", link)
+
+@app.function(schedule=modal.Period(days=1))
+def daily_scrape():
+    scrape.remote()
+
+@app.local_entrypoint()
+def run():
+    scrape.remote()
+
+```
+
+### Whisperx Transcribe
+
+# WhisperX transcription with word-level timestamps
+
+This example shows how to run [WhisperX](https://github.com/m-bain/whisperX) on
+Modal for accurate, word-level timestamped transcription.
+
+We’ll walk through the following steps:
+
+1. Defining the container image with CUDA 12.8, cuDNN 8, FFmpeg and Python deps.
+2. Persisting model weights to a [Modal Volume](https://modal.com/docs/reference/modal.Volume).
+3. A [Modal Cls](https://modal.com/docs/reference/modal.App#cls) that loads WhisperX once per GPU instance.
+4. A [local entrypoint](https://modal.com/docs/reference/modal.App#local_entrypoint) that uploads an audio file to the service.
+
+## Defining image
+
+We start from NVIDIA’s official CUDA 12.8 devel image, add cuDNN, FFmpeg, and
+install the WhisperX Python package plus its numerical deps.
+
+```python
+import os
+import tempfile
+from typing import Dict
+
+import modal
+
+MODEL_CACHE_DIR = "/whisperx-cache"
+
+image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04",
+        add_python="3.12",
+    )
+    # ── System deps ─────────────────────────────────────────────────────────────
+    .apt_install("ffmpeg")  # audio decoding / resampling
+    .apt_install("libcudnn8")  # cuDNN runtime
+    .apt_install("libcudnn8-dev")  # cuDNN headers (needed by torch wheels)
+    # ── Python deps ─────────────────────────────────────────────────────────────
+    .uv_pip_install(
+        "whisperx==3.4.0",  # our ASR library
+        "numpy==2.0.2",
+        "scipy==1.15.0",
+    )
+    # Tell HF & Torch to cache inside our Volume
+    .env({"HF_HOME": MODEL_CACHE_DIR})
+    .env({"TORCH_HOME": MODEL_CACHE_DIR})
+)
+
+```
+
+## Defining the app
+
+Downloaded weights live in a [Modal Volume](https://modal.com/docs/reference/modal.Volume) so subsequent runs reuse them.
+
+```python
+app = modal.App("example-whisperx-transcribe", image=image)
+models_volume = modal.Volume.from_name("whisperx-models", create_if_missing=True)
+
+```
+
+## Defining the inference service
+
+We wrap WhisperX inference in a Modal Cls.
+A single GPU container can serve multiple concurrent requests.
+
+```python
+@app.cls(
+    gpu="H100",
+    image=image,
+    volumes={MODEL_CACHE_DIR: models_volume},
+    timeout=30 * 60,
+)
+class WhisperX:
+    """Serverless WhisperX service running on a single GPU."""
+
+    @modal.enter()
+    def setup(self):
+        print("🔄 Loading WhisperX model …")
+        import whisperx
+
+        self.model = whisperx.load_model(
+            "large-v2",
+            device="cuda",
+            compute_type="float16",
+            download_root=MODEL_CACHE_DIR,
+        )
+        print("✅ Model ready!")
+
+    @modal.method()
+    def transcribe(self, audio_data: bytes) -> Dict:
+        """
+        Transcribe an audio file passed in as raw bytes.
+        Returns language, per-word segments, and total duration.
+        """
+
+        import whisperx
+
+        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio:
+            temp_audio.write(audio_data)
+            temp_audio_path = temp_audio.name
+
+        try:
+            audio = whisperx.load_audio(temp_audio_path)
+            result = self.model.transcribe(audio, batch_size=16, language="en")
+
+            language = result.get("language", "en")
+
+            if result["segments"]:
+                try:
+                    align_model, metadata = whisperx.load_align_model(
+                        language_code=language,
+                        device=self.device,
+                        model_dir=MODEL_CACHE_DIR,
+                    )
+                    result = whisperx.align(
+                        result["segments"], align_model, metadata, audio, self.device
+                    )
+                except Exception as e:
+                    print(f"⚠️ Alignment failed: {e} — falling back to segment-level")
+
+            return {
+                "language": language,
+                "segments": result["segments"],
+                "duration": len(audio) / 16_000,  # audio is 16 kHz
+            }
+
+        finally:
+            if os.path.exists(temp_audio_path):
+                os.unlink(temp_audio_path)
+
+```
+
+## Command-line usage
+
+We expose a [local entrypoint](https://modal.com/docs/reference/modal.App#local_entrypoint)
+so you can run:
+- using a local audio file
+- using a link to an audio file
+
+```bash
+modal run whisperx_transcribe.py --audio-file audio.wav # uses a local audio file
+modal run whisperx_transcribe.py --audio-link https://example.com/audio.wav # uses a link to an audio file
+modal run whisperx_transcribe.py # uses a default public audio file
+```
+
+```python
+@app.local_entrypoint()
+def main(
+    audio_file: str = None,
+    audio_link: str = None,
+):
+    import json
+    import time
+
+    import requests
+
+    if not audio_file and not audio_link:
+        print("No audio file or link provided, using default link")
+        audio_link = "https://modal-public-assets.s3.us-east-1.amazonaws.com/erik.wav"
+
+    if audio_file:
+        print(f"🔊 Reading {audio_file} …")
+        with open(audio_file, "rb") as f:
+            audio_data = f.read()
+    elif audio_link:
+        print(f"🔊 Reading {audio_link} …")
+        audio_data = requests.get(audio_link).content
+
+    transcriber = WhisperX()
+
+    print("📝 Transcribing …")
+    start = time.time()
+    result = transcriber.transcribe.remote(audio_data)
+    duration = time.time() - start
+
+    print(f"\n🌐 Detected language: {result['language']}")
+    print(f"⏱️  Audio duration:   {result['duration']:.2f} s")
+    print(f"🚀 Time taken:        {duration:.2f} s")
+
+    with open("transcription.json", "w") as f:
+        json.dump(result, f, indent=2)
+
+    print("\n💾 Saved transcription → transcription.json")
+
+```
\ No newline at end of file
diff --git a/index.yaml b/index.yaml
index 14b1b3b620..1f22696bb8 100644
--- a/index.yaml
+++ b/index.yaml
@@ -7,15 +7,15 @@
 #   - pending: Configuration exists but not yet fetched
 
 metadata:
-  last_updated: 2026-01-01 07:34
-  total_sources: 391
-  total_fetched: 391
-  llms_txt_count: 325
-  llms_txt_fetched: 325
-  github_count: 37
-  github_fetched: 37
-  web_scraped_count: 29
-  web_scraped_fetched: 29
+  last_updated: 2026-01-01 07:38
+  total_sources: 399
+  total_fetched: 399
+  llms_txt_count: 328
+  llms_txt_fetched: 328
+  github_count: 41
+  github_fetched: 41
+  web_scraped_count: 30
+  web_scraped_fetched: 30
 llms_txt:
 - name: activepieces
   description: Platform
@@ -24,7 +24,7 @@ llms_txt:
   status: fetched
   file_count: 147
   size: 939.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: adafruit
   description: FPGA RGB Matrix
   url: https://adafruit.com/
@@ -32,7 +32,7 @@ llms_txt:
   status: fetched
   file_count: 826
   size: 6.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: agent
   description: Save To Google Doc
   url: https://agent.com/
@@ -40,7 +40,7 @@ llms_txt:
   status: fetched
   file_count: 129
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: agent-domain
   description: Agent Domain Documentation
   url: https://agent-domain.com/
@@ -48,7 +48,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 3.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: agentskills-best
   description: Agentskills Best Documentation
   url: https://agentskills-best.com/
@@ -56,7 +56,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 237.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: agentskills-io
   description: What are skills?
   url: https://agentskills-io.com/
@@ -64,7 +64,7 @@ llms_txt:
   status: fetched
   file_count: 5
   size: 96.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ai-sdk
   description: ai-sdk documentation
   url: https://ai-sdk.com/
@@ -72,7 +72,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 453.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aimlapi
   description: Can I call API in the asynchronous mode?
   url: https://aimlapi.com/
@@ -80,7 +80,7 @@ llms_txt:
   status: fetched
   file_count: 426
   size: 22.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: alexop
   description: Alexop Documentation
   url: https://alexop.com/
@@ -88,7 +88,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 830.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: analog
   description: Analog Documentation
   url: https://analog.com/
@@ -96,7 +96,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 23.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: anchorbrowser
   description: P2P Download
   url: https://anchorbrowser.com/
@@ -104,7 +104,7 @@ llms_txt:
   status: fetched
   file_count: 165
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: angular
   description: Angular Documentation
   url: https://angular.com/
@@ -112,7 +112,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 322.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: annoto
   description: Introduction
   url: https://annoto.com/
@@ -120,7 +120,7 @@ llms_txt:
   status: fetched
   file_count: 4
   size: 6.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: answer-ai
   description: Answer.AI - Practical AI R&D
   url: https://answer-ai.com/
@@ -128,7 +128,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 77.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ant-design
   description: Ant Design 组件语义化描述
   url: https://ant-design.com/
@@ -136,7 +136,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 1.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: anthropic
   description: Subagents in the SDK
   url: https://anthropic.com/
@@ -144,7 +144,7 @@ llms_txt:
   status: fetched
   file_count: 253
   size: 31.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: any-llm
   description: Batch
   url: https://any-llm.com/
@@ -152,7 +152,7 @@ llms_txt:
   status: fetched
   file_count: 20
   size: 84.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: apache-camel
   description: apache-camel documentation
   url: https://apache-camel.com/
@@ -160,7 +160,7 @@ llms_txt:
   status: fetched
   file_count: 15
   size: 899.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: apibara
   description: apibara documentation
   url: https://apibara.com/
@@ -168,7 +168,7 @@ llms_txt:
   status: fetched
   file_count: 11
   size: 1.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: apify
   description: Using proxies
   url: https://apify.com/
@@ -176,7 +176,7 @@ llms_txt:
   status: fetched
   file_count: 902
   size: 15.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: apollo-graphql
   description: Apollo Client Queries Documentation
   url: https://apollo-graphql.com/
@@ -184,7 +184,7 @@ llms_txt:
   status: fetched
   file_count: 6
   size: 30.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aporia
   description: Overview
   url: https://aporia.com/
@@ -192,7 +192,7 @@ llms_txt:
   status: fetched
   file_count: 76
   size: 683.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: appwrite
   description: Appwrite Documentation
   url: https://appwrite.com/
@@ -200,7 +200,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 5.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aptible
   description: aptible apps:deprovision
   url: https://aptible.com/
@@ -208,7 +208,7 @@ llms_txt:
   status: fetched
   file_count: 290
   size: 2.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: argil
   description: Avatar Training Success Webhook
   url: https://argil.com/
@@ -216,7 +216,7 @@ llms_txt:
   status: fetched
   file_count: 54
   size: 283.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ark-ui
   description: Ark Ui Documentation
   url: https://ark-ui.com/
@@ -224,7 +224,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 11.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: asapp
   description: GenerativeAgent Integration Overview
   url: https://asapp.com/
@@ -232,7 +232,7 @@ llms_txt:
   status: fetched
   file_count: 237
   size: 7.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ast-grep
   description: ast-grep documentation
   url: https://ast-grep.com/
@@ -240,7 +240,7 @@ llms_txt:
   status: fetched
   file_count: 109
   size: 1.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: astro
   description: Astro Documentation
   url: https://astro.com/
@@ -248,7 +248,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: augment-code
   description: Subagents
   url: https://augment-code.com/
@@ -256,7 +256,7 @@ llms_txt:
   status: fetched
   file_count: 85
   size: 993.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: autodesk-aps
   description: Autodesk Aps Documentation
   url: https://autodesk-aps.com/
@@ -264,7 +264,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 199.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: avaamo
   description: Knowledge Assist
   url: https://avaamo.com/
@@ -272,7 +272,7 @@ llms_txt:
   status: fetched
   file_count: 604
   size: 3.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: avo
   description: Avo Documentation
   url: https://avo.com/
@@ -280,7 +280,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 715.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aws
   description: Aws Documentation
   url: https://aws.com/
@@ -288,7 +288,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 558.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aws-cli
   description: Additional documentation and resources
   url: https://aws-cli.com/
@@ -296,7 +296,7 @@ llms_txt:
   status: fetched
   file_count: 53
   size: 582.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aws-dynamodb
   description: Amazon DynamoDB Developer Guide
   url: https://aws-dynamodb.com/
@@ -304,7 +304,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 38.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aws-powertools-java
   description: Aws Powertools Java Documentation
   url: https://aws-powertools-java.com/
@@ -312,7 +312,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 244.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aws-powertools-python
   description: Create a new one with the layer
   url: https://aws-powertools-python.com/
@@ -320,7 +320,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 2.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: aws-powertools-typescript
   description: Aws Powertools Typescript Documentation
   url: https://aws-powertools-typescript.com/
@@ -328,7 +328,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 602.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: basehub
   description: Docs 19 Documentation
   url: https://basehub.com/
@@ -336,7 +336,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 334.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: baseten
   description: truss push
   url: https://baseten.com/
@@ -344,7 +344,7 @@ llms_txt:
   status: fetched
   file_count: 195
   size: 2.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: better-auth
   description: Optimizing for Performance
   url: https://better-auth.com/
@@ -352,7 +352,7 @@ llms_txt:
   status: fetched
   file_count: 144
   size: 1.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: bika
   description: Bika Documentation
   url: https://bika.com/
@@ -360,7 +360,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 51.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: bito
   description: Chat session history
   url: https://bito.com/
@@ -368,7 +368,7 @@ llms_txt:
   status: fetched
   file_count: 106
   size: 1.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: braintree
   description: braintree documentation
   url: https://braintree.com/
@@ -376,7 +376,15 @@ llms_txt:
   status: fetched
   file_count: 342
   size: 57.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: braintrust
+  description: Views
+  url: https://braintrust.com/
+  path: docs/llms-txt/braintrust/
+  status: fetched
+  file_count: 332
+  size: 5.6MB
+  last_updated: 2026-01-01 07:38
 - name: bun
   description: Escape an HTML string
   url: https://bun.com/
@@ -384,7 +392,7 @@ llms_txt:
   status: fetched
   file_count: 283
   size: 3.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: carv
   description: Node Rewards
   url: https://carv.com/
@@ -392,7 +400,7 @@ llms_txt:
   status: fetched
   file_count: 95
   size: 839.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: cash-app
   description: Cash App Documentation
   url: https://cash-app.com/
@@ -400,7 +408,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: chakra-ui
   description: Chakra Ui Documentation
   url: https://chakra-ui.com/
@@ -408,7 +416,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: chatling
   description: Whitelist domains
   url: https://chatling.com/
@@ -416,7 +424,7 @@ llms_txt:
   status: fetched
   file_count: 136
   size: 519.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: circleci
   description: Jenkins vs CircleCI
   url: https://circleci.com/
@@ -424,7 +432,7 @@ llms_txt:
   status: fetched
   file_count: 9
   size: 53.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: claude-code
   description: Optimize your terminal setup
   url: https://claude-code.com/
@@ -432,7 +440,7 @@ llms_txt:
   status: fetched
   file_count: 50
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: claude-docs
   description: Subagents in the SDK
   url: https://claude-docs.com/
@@ -440,7 +448,7 @@ llms_txt:
   status: fetched
   file_count: 218
   size: 8.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: claude-platform
   description: Subagents in the SDK
   url: https://claude-platform.com/
@@ -448,7 +456,7 @@ llms_txt:
   status: fetched
   file_count: 202
   size: 31.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: claudette
   description: Tool loop
   url: https://claudette.com/
@@ -456,7 +464,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 136.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: clerk
   description: Clerk Documentation
   url: https://clerk.com/
@@ -464,7 +472,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 7.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: clever-cloud
   description: clever-cloud documentation
   url: https://clever-cloud.com/
@@ -472,7 +480,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 81.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: cloud
   description: cloud documentation
   url: https://cloud.com/
@@ -480,7 +488,7 @@ llms_txt:
   status: fetched
   file_count: 15
   size: 661.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: cloudflare
   description: Cloudflare Documentation
   url: https://cloudflare.com/
@@ -488,7 +496,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 43.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: codeium
   description: Get Usage Configuration
   url: https://codeium.com/
@@ -496,7 +504,7 @@ llms_txt:
   status: fetched
   file_count: 60
   size: 1.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: cohere
   description: Cohere Documentation
   url: https://cohere.com/
@@ -504,7 +512,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: comet-docs-opik
   description: Comet Docs Opik Documentation
   url: https://comet-docs-opik.com/
@@ -512,7 +520,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: comfy
   description: Add node docs for your ComfyUI custom node
   url: https://comfy.com/
@@ -520,7 +528,7 @@ llms_txt:
   status: fetched
   file_count: 283
   size: 4.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: configcat
   description: Cloudflare Worker SDK
   url: https://configcat.com/
@@ -528,7 +536,7 @@ llms_txt:
   status: fetched
   file_count: 264
   size: 3.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: convex
   description: 'Class: SchemaDefinition\<Schema, StrictTableTypes>'
   url: https://convex.com/
@@ -536,7 +544,7 @@ llms_txt:
   status: fetched
   file_count: 270
   size: 5.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: coollabs
   description: Identity & Access Management
   url: https://coollabs.com/
@@ -544,7 +552,7 @@ llms_txt:
   status: fetched
   file_count: 33
   size: 37.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: copilotkit
   description: Copilotkit Documentation
   url: https://copilotkit.com/
@@ -552,7 +560,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 443.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: cosette
   description: Tool loop
   url: https://cosette.com/
@@ -560,7 +568,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 155.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: crewai
   description: Databricks SQL Query Tool
   url: https://crewai.com/
@@ -568,7 +576,7 @@ llms_txt:
   status: fetched
   file_count: 206
   size: 3.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: cursor
   description: Webhooks
   url: https://cursor.com/
@@ -576,7 +584,7 @@ llms_txt:
   status: fetched
   file_count: 19
   size: 10.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: dappier
   description: Dappier Documentation
   url: https://dappier.com/
@@ -584,7 +592,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 802.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: dat1
   description: Dat1 Documentation
   url: https://dat1.com/
@@ -592,7 +600,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 6.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: datafold
   description: GitLab
   url: https://datafold.com/
@@ -600,7 +608,7 @@ llms_txt:
   status: fetched
   file_count: 158
   size: 3.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: daytona
   description: daytona documentation
   url: https://daytona.com/
@@ -608,7 +616,7 @@ llms_txt:
   status: fetched
   file_count: 70
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: deepchecks
   description: deepchecks documentation
   url: https://deepchecks.com/
@@ -616,7 +624,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 260.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: deepconverse
   description: Technical and Organizational Security Measures
   url: https://deepconverse.com/
@@ -624,7 +632,7 @@ llms_txt:
   status: fetched
   file_count: 86
   size: 352.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: deno
   description: Deno Documentation
   url: https://deno.com/
@@ -632,7 +640,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 53.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: dev
   description: Choose a model
   url: https://dev.com/
@@ -640,7 +648,7 @@ llms_txt:
   status: fetched
   file_count: 254
   size: 4.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: developers
   description: Developers Documentation
   url: https://developers.com/
@@ -648,7 +656,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 20.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: developers-2
   description: User Interface
   url: https://developers-2.com/
@@ -656,7 +664,7 @@ llms_txt:
   status: fetched
   file_count: 104
   size: 2.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: developers-vectorize
   description: Developers Vectorize Documentation
   url: https://developers-vectorize.com/
@@ -664,7 +672,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 145.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: directory
   description: Directory Documentation
   url: https://directory.com/
@@ -672,7 +680,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 505.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: dnd-kit
   description: useSortable
   url: https://dnd-kit.com/
@@ -680,7 +688,7 @@ llms_txt:
   status: fetched
   file_count: 23
   size: 318.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: docetl
   description: Docetl Documentation
   url: https://docetl.com/
@@ -688,7 +696,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 42.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: docker
   description: docker documentation
   url: https://docker.com/
@@ -696,7 +704,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 165.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: dotenvx
   description: Dotenvx Documentation
   url: https://dotenvx.com/
@@ -704,7 +712,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 71.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: drizzle-orm
   description: Drizzle Orm Documentation
   url: https://drizzle-orm.com/
@@ -712,7 +720,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: dspy
   description: '*Programming*—not prompting—*LMs*'
   url: https://dspy.com/
@@ -720,7 +728,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 26.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: dub
   description: Create a tag
   url: https://dub.com/
@@ -728,7 +736,7 @@ llms_txt:
   status: fetched
   file_count: 94
   size: 2.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: duckdb
   description: Awesome DuckDB [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
     <!-- omit in toc -->
@@ -737,7 +745,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 42.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: duendesoftware
   description: Duendesoftware Documentation
   url: https://duendesoftware.com/
@@ -745,7 +753,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 2.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: electric-sql
   description: electric-sql documentation
   url: https://electric-sql.com/
@@ -753,7 +761,7 @@ llms_txt:
   status: fetched
   file_count: 99
   size: 513.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: elevenlabs
   description: April 14, 2025
   url: https://elevenlabs.com/
@@ -761,7 +769,7 @@ llms_txt:
   status: fetched
   file_count: 30
   size: 5.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: embedchain
   description: 📝 evaluate
   url: https://embedchain.com/
@@ -769,7 +777,7 @@ llms_txt:
   status: fetched
   file_count: 95
   size: 490.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: embednotionpages
   description: embednotionpages documentation
   url: https://embednotionpages.com/
@@ -777,7 +785,7 @@ llms_txt:
   status: fetched
   file_count: 9
   size: 1.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: envoyer
   description: Envoyer Documentation
   url: https://envoyer.com/
@@ -785,7 +793,7 @@ llms_txt:
   status: fetched
   file_count: 12
   size: 84.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: exa
   description: OpenRouter
   url: https://exa.com/
@@ -793,7 +801,7 @@ llms_txt:
   status: fetched
   file_count: 139
   size: 2.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: expo
   description: Expo Documentation
   url: https://expo.com/
@@ -801,7 +809,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: fastcore
   description: Script - CLI
   url: https://fastcore.com/
@@ -809,7 +817,7 @@ llms_txt:
   status: fetched
   file_count: 13
   size: 311.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: fern
   description: Fern Documentation
   url: https://fern.com/
@@ -817,7 +825,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: firecrawl
   description: Firecrawl Documentation
   url: https://firecrawl.com/
@@ -825,7 +833,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 657.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: fireflies
   description: AskFred Threads
   url: https://fireflies.com/
@@ -833,7 +841,7 @@ llms_txt:
   status: fetched
   file_count: 86
   size: 500.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: fireworks
   description: Generate or edit an image with FLUX.1 Kontext
   url: https://fireworks.com/
@@ -841,7 +849,7 @@ llms_txt:
   status: fetched
   file_count: 281
   size: 2.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: fireworks-ai
   description: Generate or edit an image with FLUX.1 Kontext
   url: https://fireworks-ai.com/
@@ -849,7 +857,7 @@ llms_txt:
   status: fetched
   file_count: 281
   size: 2.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: flatfile
   description: Space Configure from Template
   url: https://flatfile.com/
@@ -857,7 +865,7 @@ llms_txt:
   status: fetched
   file_count: 91
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: flitt
   description: Flitt Documentation
   url: https://flitt.com/
@@ -865,7 +873,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 514.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: frigade
   description: Collapsible
   url: https://frigade.com/
@@ -873,7 +881,7 @@ llms_txt:
   status: fetched
   file_count: 67
   size: 921.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: gaianet
   description: Gaianet Documentation
   url: https://gaianet.com/
@@ -881,7 +889,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 299.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: galileo-ai
   description: Clustering
   url: https://galileo-ai.com/
@@ -889,7 +897,7 @@ llms_txt:
   status: fetched
   file_count: 176
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: gaspard
   description: Tool loop
   url: https://gaspard.com/
@@ -897,7 +905,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 85.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: giselles
   description: Start Node & End Node
   url: https://giselles.com/
@@ -905,7 +913,7 @@ llms_txt:
   status: fetched
   file_count: 53
   size: 368.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: gitbook
   description: Track documentation analytics with Google Analytics
   url: https://gitbook.com/
@@ -913,7 +921,7 @@ llms_txt:
   status: fetched
   file_count: 335
   size: 6.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: gitingest
   description: Gitingest Documentation
   url: https://gitingest.com/
@@ -921,7 +929,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 26.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: gitlab-consulting
   description: Boosting Higher Education IT Efficiency by Reducing Tool Sprawl
   url: https://gitlab-consulting.com/
@@ -929,7 +937,7 @@ llms_txt:
   status: fetched
   file_count: 19
   size: 122.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: glide
   description: Glide Documentation
   url: https://glide.com/
@@ -937,7 +945,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 9.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: gofastmcp
   description: oidc_proxy
   url: https://gofastmcp.com/
@@ -945,7 +953,7 @@ llms_txt:
   status: fetched
   file_count: 196
   size: 2.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: gradio
   description: or
   url: https://gradio.com/
@@ -953,7 +961,7 @@ llms_txt:
   status: fetched
   file_count: 9
   size: 13.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: grafbase
   description: GraphQL Federation Directives
   url: https://grafbase.com/
@@ -961,7 +969,7 @@ llms_txt:
   status: fetched
   file_count: 112
   size: 499.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: graphite
   description: Install & Authenticate The CLI
   url: https://graphite.com/
@@ -969,7 +977,7 @@ llms_txt:
   status: fetched
   file_count: 95
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: groq
   description: groq documentation
   url: https://groq.com/
@@ -977,7 +985,7 @@ llms_txt:
   status: fetched
   file_count: 8
   size: 2.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: groqcloud
   description: Groqcloud Documentation
   url: https://groqcloud.com/
@@ -985,7 +993,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 710.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: helicone
   description: Helicone Documentation
   url: https://helicone.com/
@@ -993,7 +1001,7 @@ llms_txt:
   status: fetched
   file_count: 161
   size: 2.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: herd
   description: Dumps
   url: https://herd.com/
@@ -1001,7 +1009,7 @@ llms_txt:
   status: fetched
   file_count: 51
   size: 566.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: hono
   description: Hono Documentation
   url: https://hono.com/
@@ -1009,7 +1017,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 324.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: htmlhint
   description: Htmlhint Documentation
   url: https://htmlhint.com/
@@ -1017,7 +1025,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 92.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: htmx
   description: htmx documentation
   url: https://htmx.com/
@@ -1025,7 +1033,7 @@ llms_txt:
   status: fetched
   file_count: 180
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: huggingface-accelerate
   description: huggingface-accelerate documentation
   url: https://huggingface-accelerate.com/
@@ -1033,7 +1041,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 213.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: huggingface-diffusers
   description: huggingface-diffusers documentation
   url: https://huggingface-diffusers.com/
@@ -1041,7 +1049,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 602.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: huggingface-docs-hub
   description: Models
   url: https://huggingface-docs-hub.com/
@@ -1049,7 +1057,7 @@ llms_txt:
   status: fetched
   file_count: 223
   size: 2.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: huggingface-hub
   description: huggingface-hub documentation
   url: https://huggingface-hub.com/
@@ -1057,7 +1065,7 @@ llms_txt:
   status: fetched
   file_count: 4
   size: 746.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: huggingface-transformers
   description: MobileBERT
   url: https://huggingface-transformers.com/
@@ -1065,7 +1073,7 @@ llms_txt:
   status: fetched
   file_count: 580
   size: 49.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: humanloop
   description: Overview
   url: https://humanloop.com/
@@ -1073,7 +1081,7 @@ llms_txt:
   status: fetched
   file_count: 10
   size: 3.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: hypermode
   description: 'Day 16: Agent System Prompts - The Foundation of Agent Behavior'
   url: https://hypermode.com/
@@ -1081,7 +1089,7 @@ llms_txt:
   status: fetched
   file_count: 303
   size: 9.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: i18next
   description: Objects and Arrays
   url: https://i18next.com/
@@ -1089,7 +1097,7 @@ llms_txt:
   status: fetched
   file_count: 34
   size: 651.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ideogram
   description: Available Plans
   url: https://ideogram.com/
@@ -1097,7 +1105,7 @@ llms_txt:
   status: fetched
   file_count: 84
   size: 1.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: imgly-cesdk
   description: Imgly Cesdk Documentation
   url: https://imgly-cesdk.com/
@@ -1105,7 +1113,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: imgly-react-native
   description: Imgly React Native Documentation
   url: https://imgly-react-native.com/
@@ -1113,7 +1121,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 10.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: imgly-sveltekit
   description: Imgly Sveltekit Documentation
   url: https://imgly-sveltekit.com/
@@ -1121,7 +1129,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 4.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: imgly-vue
   description: Imgly Vue Documentation
   url: https://imgly-vue.com/
@@ -1129,7 +1137,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 4.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: infera
   description: Worker Deregister
   url: https://infera.com/
@@ -1137,7 +1145,7 @@ llms_txt:
   status: fetched
   file_count: 39
   size: 156.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: infisical
   description: List Resources
   url: https://infisical.com/
@@ -1145,7 +1153,7 @@ llms_txt:
   status: fetched
   file_count: 460
   size: 7.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: inkeep
   description: Inkeep Documentation
   url: https://inkeep.com/
@@ -1153,7 +1161,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 507.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: inngest
   description: Inngest Documentation
   url: https://inngest.com/
@@ -1161,7 +1169,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: inspira-ui
   description: Inspira Ui Documentation
   url: https://inspira-ui.com/
@@ -1169,7 +1177,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: instructor
   description: 'Instructor: Top Multi-Language Library for Structured LLM Outputs'
   url: https://instructor.com/
@@ -1177,7 +1185,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 53.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: intelligems
   description: Upgrade Offer Components
   url: https://intelligems.com/
@@ -1185,7 +1193,7 @@ llms_txt:
   status: fetched
   file_count: 196
   size: 1.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: iterm2
   description: Utilities
   url: https://iterm2.com/
@@ -1193,7 +1201,7 @@ llms_txt:
   status: fetched
   file_count: 33
   size: 277.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: jam
   description: Suppress Jam Logs in DevTools
   url: https://jam.com/
@@ -1201,7 +1209,7 @@ llms_txt:
   status: fetched
   file_count: 74
   size: 474.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: jazz
   description: jazz documentation
   url: https://jazz.com/
@@ -1209,7 +1217,7 @@ llms_txt:
   status: fetched
   file_count: 37
   size: 9.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: jetify
   description: Jetify Documentation
   url: https://jetify.com/
@@ -1217,7 +1225,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 418.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: jina
   description: Jina Documentation
   url: https://jina.com/
@@ -1225,7 +1233,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 40.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: juno
   description: Internet Identity
   url: https://juno.com/
@@ -1233,7 +1241,7 @@ llms_txt:
   status: fetched
   file_count: 74
   size: 915.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: kaggle
   description: Model Format
   url: https://kaggle.com/
@@ -1241,7 +1249,7 @@ llms_txt:
   status: fetched
   file_count: 7
   size: 74.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: kaisar
   description: Known Issues
   url: https://kaisar.com/
@@ -1249,7 +1257,7 @@ llms_txt:
   status: fetched
   file_count: 89
   size: 339.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: kinde
   description: Kinde Documentation
   url: https://kinde.com/
@@ -1257,7 +1265,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 10.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: klarna
   description: Shopify draft orders and in-store orders
   url: https://klarna.com/
@@ -1265,7 +1273,7 @@ llms_txt:
   status: fetched
   file_count: 244
   size: 2.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: kysely
   description: Simple selects
   url: https://kysely.com/
@@ -1273,7 +1281,7 @@ llms_txt:
   status: fetched
   file_count: 66
   size: 499.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lago
   description: WEIGHTED SUM
   url: https://lago.com/
@@ -1281,7 +1289,7 @@ llms_txt:
   status: fetched
   file_count: 249
   size: 4.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lambdatest
   description: Lambdatest Documentation
   url: https://lambdatest.com/
@@ -1289,7 +1297,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lancedb
   description: Create a new tag
   url: https://lancedb.com/
@@ -1297,7 +1305,7 @@ llms_txt:
   status: fetched
   file_count: 135
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: langchain
   description: langchain documentation
   url: https://langchain.com/
@@ -1305,7 +1313,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 851.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: langchain-js
   description: langchain-js documentation
   url: https://langchain-js.com/
@@ -1313,7 +1321,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 892.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: langflow
   description: Langflow Documentation
   url: https://langflow.com/
@@ -1321,7 +1329,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 7.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: langfuse
   description: SCIM & Organization-Key Scoped API Routes
   url: https://langfuse.com/
@@ -1329,7 +1337,7 @@ llms_txt:
   status: fetched
   file_count: 116
   size: 920.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: langgraph
   description: Langgraph Documentation
   url: https://langgraph.com/
@@ -1337,7 +1345,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 617.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: liam-erd
   description: Liam Erd Documentation
   url: https://liam-erd.com/
@@ -1345,7 +1353,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 88.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: likec4
   description: Likec4 Documentation
   url: https://likec4.com/
@@ -1353,7 +1361,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 179.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: linear
   description: Signals
   url: https://linear.com/
@@ -1361,7 +1369,7 @@ llms_txt:
   status: fetched
   file_count: 121
   size: 643.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lingui
   description: Lazy Translations
   url: https://lingui.com/
@@ -1369,7 +1377,7 @@ llms_txt:
   status: fetched
   file_count: 34
   size: 487.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: litellm
   description: Litellm Documentation
   url: https://litellm.com/
@@ -1377,7 +1385,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: liveblocks
   description: Liveblocks Documentation
   url: https://liveblocks.com/
@@ -1385,7 +1393,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: livecodes
   description: 'Interface: EditorConfig'
   url: https://livecodes.com/
@@ -1393,7 +1401,7 @@ llms_txt:
   status: fetched
   file_count: 202
   size: 1.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: livekit
   description: Agents and handoffs
   url: https://livekit.com/
@@ -1401,7 +1409,7 @@ llms_txt:
   status: fetched
   file_count: 259
   size: 5.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: llamaindex
   description: LlamaIndex Documentation
   url: https://llamaindex.com/
@@ -1409,7 +1417,7 @@ llms_txt:
   status: fetched
   file_count: 25
   size: 603.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: llmstxt
   description: Python module & CLI
   url: https://llmstxt.com/
@@ -1417,7 +1425,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 17.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lm-studio
   description: lm-studio documentation
   url: https://lm-studio.com/
@@ -1425,7 +1433,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 537.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lmstudio
   description: Lmstudio Documentation
   url: https://lmstudio.com/
@@ -1433,7 +1441,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 543.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lobehub
   description: Lobehub Documentation
   url: https://lobehub.com/
@@ -1441,7 +1449,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: loops
   description: API key
   url: https://loops.com/
@@ -1449,7 +1457,7 @@ llms_txt:
   status: fetched
   file_count: 116
   size: 2.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: loro
   description: Loro Documentation
   url: https://loro.com/
@@ -1457,7 +1465,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 629.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lunary
   description: Get a specific run
   url: https://lunary.com/
@@ -1465,7 +1473,7 @@ llms_txt:
   status: fetched
   file_count: 130
   size: 536.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lynx
   description: Composing Elements
   url: https://lynx.com/
@@ -1473,7 +1481,7 @@ llms_txt:
   status: fetched
   file_count: 97
   size: 3.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mailgun
   description: Ruby
   url: https://mailgun.com/
@@ -1481,7 +1489,7 @@ llms_txt:
   status: fetched
   file_count: 119
   size: 452.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mastercard
   description: CLS Offer Activations
   url: https://mastercard.com/
@@ -1489,7 +1497,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 632.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mastra
   description: Mastra Documentation
   url: https://mastra.com/
@@ -1497,7 +1505,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mcp
   description: Logging
   url: https://mcp.com/
@@ -1505,7 +1513,7 @@ llms_txt:
   status: fetched
   file_count: 40
   size: 1.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: medusa
   description: Install Medusa
   url: https://medusa.com/
@@ -1513,7 +1521,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 5.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: meilisearch
   description: Meilisearch Documentation
   url: https://meilisearch.com/
@@ -1521,7 +1529,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: meshconnect
   description: Transfer Webhooks
   url: https://meshconnect.com/
@@ -1529,7 +1537,7 @@ llms_txt:
   status: fetched
   file_count: 51
   size: 907.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: microsoft-genaiscript
   description: Microsoft Genaiscript Documentation
   url: https://microsoft-genaiscript.com/
@@ -1537,7 +1545,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mikro-orm
   description: Unit of Work and Transactions
   url: https://mikro-orm.com/
@@ -1545,7 +1553,7 @@ llms_txt:
   status: fetched
   file_count: 478
   size: 12.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mintlify
   description: GitLab
   url: https://mintlify.com/
@@ -1553,7 +1561,7 @@ llms_txt:
   status: fetched
   file_count: 138
   size: 1.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mistral
   description: Mistral Documentation
   url: https://mistral.com/
@@ -1561,7 +1569,15 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 969.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: modal
+  description: Modal Documentation
+  url: https://modal.com/
+  path: docs/llms-txt/modal/
+  status: fetched
+  file_count: 1
+  size: 1.7MB
+  last_updated: 2026-01-01 07:38
 - name: modular
   description: modular documentation
   url: https://modular.com/
@@ -1569,7 +1585,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 6.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mozilla-data-collective
   description: mozilla-data-collective documentation
   url: https://mozilla-data-collective.com/
@@ -1577,7 +1593,7 @@ llms_txt:
   status: fetched
   file_count: 9
   size: 199.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mui-material
   description: Transfer List
   url: https://mui-material.com/
@@ -1585,7 +1601,7 @@ llms_txt:
   status: fetched
   file_count: 125
   size: 2.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mui-x
   description: Data Grid - Drag-and-drop column reordering [<span class="plan-pro"></span>](/x/introduction/licensi...
   url: https://mui-x.com/
@@ -1593,7 +1609,7 @@ llms_txt:
   status: fetched
   file_count: 332
   size: 3.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nativescript
   description: Nativescript Documentation
   url: https://nativescript.com/
@@ -1601,7 +1617,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 749.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: near
   description: The linkdrop contract
   url: https://near.com/
@@ -1609,7 +1625,7 @@ llms_txt:
   status: fetched
   file_count: 170
   size: 1.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nemo-framework
   description: nemo-framework documentation
   url: https://nemo-framework.com/
@@ -1617,7 +1633,7 @@ llms_txt:
   status: fetched
   file_count: 50
   size: 707.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: neon
   description: Schema-only branches
   url: https://neon.com/
@@ -1625,7 +1641,7 @@ llms_txt:
   status: fetched
   file_count: 483
   size: 4.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nextjs
   description: NextRequest
   url: https://nextjs.com/
@@ -1633,7 +1649,7 @@ llms_txt:
   status: fetched
   file_count: 247
   size: 4.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: notion
   description: Notion API
   url: https://notion.com/
@@ -1641,7 +1657,7 @@ llms_txt:
   status: fetched
   file_count: 98
   size: 1.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nuxt
   description: Nuxt Documentation
   url: https://nuxt.com/
@@ -1649,7 +1665,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nuxt-content
   description: Nuxt Content Documentation
   url: https://nuxt-content.com/
@@ -1657,7 +1673,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 265.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nuxt-ui
   description: Collapsible
   url: https://nuxt-ui.com/
@@ -1665,7 +1681,7 @@ llms_txt:
   status: fetched
   file_count: 143
   size: 3.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nvidia
   description: Nemotron
   url: https://nvidia.com/
@@ -1673,7 +1689,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 70.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nvidia-developer
   description: AI SDKs and Models for Your RTX PC Application
   url: https://nvidia-developer.com/
@@ -1681,7 +1697,7 @@ llms_txt:
   status: fetched
   file_count: 104
   size: 959.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ollama
   description: Push a model
   url: https://ollama.com/
@@ -1689,7 +1705,7 @@ llms_txt:
   status: fetched
   file_count: 49
   size: 484.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: openai
   description: Key concepts
   url: https://openai.com/
@@ -1697,7 +1713,7 @@ llms_txt:
   status: fetched
   file_count: 70
   size: 1.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: openai-cdn
   description: Openai Cdn Documentation
   url: https://openai-cdn.com/
@@ -1705,7 +1721,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: openai-platform
   description: Openai Platform Documentation
   url: https://openai-platform.com/
@@ -1713,7 +1729,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 4.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: opencode
   description: OpenCode Configuration
   url: https://opencode.com/
@@ -1721,7 +1737,7 @@ llms_txt:
   status: fetched
   file_count: 8
   size: 52.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: openphone
   description: Get a call by ID
   url: https://openphone.com/
@@ -1729,7 +1745,7 @@ llms_txt:
   status: fetched
   file_count: 44
   size: 717.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: openpipe
   description: Reward Models (Beta)
   url: https://openpipe.com/
@@ -1737,7 +1753,7 @@ llms_txt:
   status: fetched
   file_count: 42
   size: 493.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: openrouter
   description: Openrouter Documentation
   url: https://openrouter.com/
@@ -1745,7 +1761,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 873.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: oxla
   description: pg_settings
   url: https://oxla.com/
@@ -1753,7 +1769,7 @@ llms_txt:
   status: fetched
   file_count: 209
   size: 1.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pandacss
   description: Panda CSS Documentation
   url: https://pandacss.com/
@@ -1761,7 +1777,7 @@ llms_txt:
   status: fetched
   file_count: 11
   size: 1.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: paypal
   description: paypal documentation
   url: https://paypal.com/
@@ -1769,7 +1785,7 @@ llms_txt:
   status: fetched
   file_count: 65
   size: 12.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: paypal-developer
   description: paypal-developer documentation
   url: https://paypal-developer.com/
@@ -1777,7 +1793,7 @@ llms_txt:
   status: fetched
   file_count: 138
   size: 32.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: perplexity
   description: 'null'
   url: https://perplexity.com/
@@ -1785,7 +1801,7 @@ llms_txt:
   status: fetched
   file_count: 83
   size: 1.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pgvector
   description: pgvector
   url: https://pgvector.com/
@@ -1793,7 +1809,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 763.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pinata
   description: Image Optimizations
   url: https://pinata.com/
@@ -1801,7 +1817,7 @@ llms_txt:
   status: fetched
   file_count: 142
   size: 938.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pinecone
   description: List API keys
   url: https://pinecone.com/
@@ -1809,7 +1825,7 @@ llms_txt:
   status: fetched
   file_count: 274
   size: 8.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pipecat
   description: OpenRouter
   url: https://pipecat.com/
@@ -1817,7 +1833,7 @@ llms_txt:
   status: fetched
   file_count: 218
   size: 3.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pixijs
   description: Render Loop
   url: https://pixijs.com/
@@ -1825,7 +1841,7 @@ llms_txt:
   status: fetched
   file_count: 51
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: plain
   description: Webhooks
   url: https://plain.com/
@@ -1833,7 +1849,7 @@ llms_txt:
   status: fetched
   file_count: 79
   size: 246.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: planetscale
   description: Reverse traffic
   url: https://planetscale.com/
@@ -1841,7 +1857,7 @@ llms_txt:
   status: fetched
   file_count: 379
   size: 7.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: platform-sh
   description: Integrate with GitLab
   url: https://platform-sh.com/
@@ -1849,7 +1865,7 @@ llms_txt:
   status: fetched
   file_count: 179
   size: 4.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: polar
   description: product.created
   url: https://polar.com/
@@ -1857,7 +1873,7 @@ llms_txt:
   status: fetched
   file_count: 165
   size: 2.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: postman
   description: postman documentation
   url: https://postman.com/
@@ -1865,7 +1881,7 @@ llms_txt:
   status: fetched
   file_count: 120
   size: 56.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: postmark
   description: Postmark Documentation
   url: https://postmark.com/
@@ -1873,7 +1889,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 60.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: powertools-aws
   description: Powertools Aws Documentation
   url: https://powertools-aws.com/
@@ -1881,7 +1897,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 21.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: preact
   description: preact documentation
   url: https://preact.com/
@@ -1889,7 +1905,7 @@ llms_txt:
   status: fetched
   file_count: 18
   size: 828.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: prettier
   description: Prettier Documentation
   url: https://prettier.com/
@@ -1897,7 +1913,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 150.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: prisma
   description: Prisma Documentation
   url: https://prisma.com/
@@ -1905,7 +1921,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 4.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: promptfoo
   description: promptfoo documentation
   url: https://promptfoo.com/
@@ -1913,7 +1929,7 @@ llms_txt:
   status: fetched
   file_count: 331
   size: 28.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: promptlayer
   description: Webhooks
   url: https://promptlayer.com/
@@ -1921,7 +1937,7 @@ llms_txt:
   status: fetched
   file_count: 96
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pydantic
   description: Pydantic Documentation
   url: https://pydantic.com/
@@ -1929,7 +1945,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 1.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pydantic-ai
   description: Pydantic Ai Documentation
   url: https://pydantic-ai.com/
@@ -1937,7 +1953,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 3.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: qdrant
   description: Qdrant Documentation
   url: https://qdrant.com/
@@ -1945,7 +1961,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: rainbowkit
   description: Rainbowkit Documentation
   url: https://rainbowkit.com/
@@ -1953,7 +1969,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 479.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react
   description: react documentation
   url: https://react.com/
@@ -1961,7 +1977,7 @@ llms_txt:
   status: fetched
   file_count: 161
   size: 2.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-aria
   description: Divider
   url: https://react-aria.com/
@@ -1969,7 +1985,7 @@ llms_txt:
   status: fetched
   file_count: 93
   size: 813.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-i18next
   description: i18next instance
   url: https://react-i18next.com/
@@ -1977,7 +1993,7 @@ llms_txt:
   status: fetched
   file_count: 28
   size: 326.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-native
   description: Platform
   url: https://react-native.com/
@@ -1985,7 +2001,7 @@ llms_txt:
   status: fetched
   file_count: 275
   size: 4.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-spring
   description: React Spring Documentation
   url: https://react-spring.com/
@@ -1993,7 +2009,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 40.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: readme
   description: Readme Documentation
   url: https://readme.com/
@@ -2001,7 +2017,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 645.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: redis
   description: Release notes
   url: https://redis.com/
@@ -2009,7 +2025,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 21.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: remult
   description: Getting Started - Introduction
   url: https://remult.com/
@@ -2017,7 +2033,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 1.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: render
   description: Render Documentation
   url: https://render.com/
@@ -2025,7 +2041,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 665.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: replit
   description: Access Management
   url: https://replit.com/
@@ -2033,7 +2049,7 @@ llms_txt:
   status: fetched
   file_count: 192
   size: 3.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: rescript-react
   description: Rescript React Documentation
   url: https://rescript-react.com/
@@ -2041,7 +2057,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 47.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: resend
   description: What are Resend account email quota limits?
   url: https://resend.com/
@@ -2049,7 +2065,7 @@ llms_txt:
   status: fetched
   file_count: 206
   size: 2.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: retejs
   description: Retejs Documentation
   url: https://retejs.com/
@@ -2057,7 +2073,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 246.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: retool
   description: Retool Documentation
   url: https://retool.com/
@@ -2065,7 +2081,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: roc-lang
   description: Note that Eq is not available for an F32, hence we provide a custom
     implementation here.</span><span...
@@ -2074,7 +2090,7 @@ llms_txt:
   status: fetched
   file_count: 5
   size: 341.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: rsbuild
   description: Inline static assets
   url: https://rsbuild.com/
@@ -2082,7 +2098,7 @@ llms_txt:
   status: fetched
   file_count: 176
   size: 1.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: rspack
   description: Rspack loader
   url: https://rspack.com/
@@ -2090,7 +2106,7 @@ llms_txt:
   status: fetched
   file_count: 163
   size: 2.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: saaspegasus
   description: Docs 7 Documentation
   url: https://saaspegasus.com/
@@ -2098,7 +2114,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: sardine
   description: Sardine Documentation
   url: https://sardine.com/
@@ -2106,7 +2122,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 42.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: screenshotone
   description: Screenshotone Documentation
   url: https://screenshotone.com/
@@ -2114,7 +2130,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 246.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: semgrep
   description: Semgrep Documentation
   url: https://semgrep.com/
@@ -2122,7 +2138,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 8.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: servicestack
   description: servicestack documentation
   url: https://servicestack.com/
@@ -2130,7 +2146,7 @@ llms_txt:
   status: fetched
   file_count: 390
   size: 13.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: shadcn-ui
   description: shadcn-ui documentation
   url: https://shadcn-ui.com/
@@ -2138,7 +2154,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 10.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: sherlockdomains
   description: Sherlockdomains Documentation
   url: https://sherlockdomains.com/
@@ -2146,7 +2162,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 14.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: slack
   description: slack documentation
   url: https://slack.com/
@@ -2154,7 +2170,7 @@ llms_txt:
   status: fetched
   file_count: 16
   size: 947.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: slate
   description: Decoration
   url: https://slate.com/
@@ -2162,7 +2178,7 @@ llms_txt:
   status: fetched
   file_count: 89
   size: 836.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: smartcar
   description: Authentication Errors
   url: https://smartcar.com/
@@ -2170,7 +2186,7 @@ llms_txt:
   status: fetched
   file_count: 195
   size: 1.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: solidfi
   description: Retrieve Bank Info
   url: https://solidfi.com/
@@ -2178,7 +2194,7 @@ llms_txt:
   status: fetched
   file_count: 89
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: sourcegraph
   description: Sourcegraph Documentation
   url: https://sourcegraph.com/
@@ -2186,7 +2202,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 13.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: speakeasy
   description: Speakeasy Documentation
   url: https://speakeasy.com/
@@ -2194,7 +2210,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 5.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: squared-ai
   description: Docker
   url: https://squared-ai.com/
@@ -2202,7 +2218,7 @@ llms_txt:
   status: fetched
   file_count: 172
   size: 998.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: stedi
   description: Stedi Documentation
   url: https://stedi.com/
@@ -2210,7 +2226,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: strapi
   description: Account billing details
   url: https://strapi.com/
@@ -2218,7 +2234,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 553.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: streamlit
   description: streamlit documentation
   url: https://streamlit.com/
@@ -2226,7 +2242,7 @@ llms_txt:
   status: fetched
   file_count: 308
   size: 50.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: stripe
   description: Add funds to your card program
   url: https://stripe.com/
@@ -2234,7 +2250,7 @@ llms_txt:
   status: fetched
   file_count: 388
   size: 9.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: svelte
   description: Svelte Documentation
   url: https://svelte.com/
@@ -2242,7 +2258,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: svelte-typescript
   description: Svelte Typescript Documentation
   url: https://svelte-typescript.com/
@@ -2250,7 +2266,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: swr
   description: swr documentation
   url: https://swr.com/
@@ -2258,7 +2274,7 @@ llms_txt:
   status: fetched
   file_count: 17
   size: 2.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tailwindcss
   description: tailwindcss documentation
   url: https://tailwindcss.com/
@@ -2266,7 +2282,7 @@ llms_txt:
   status: fetched
   file_count: 185
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tamagui
   description: Sheet
   url: https://tamagui.com/
@@ -2274,7 +2290,7 @@ llms_txt:
   status: fetched
   file_count: 56
   size: 950.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tanstack-query
   description: tanstack-query documentation
   url: https://tanstack-query.com/
@@ -2282,7 +2298,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 6.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tanstack-router
   description: Tanstack Router Documentation
   url: https://tanstack-router.com/
@@ -2290,7 +2306,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 210.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tauri
   description: Tauri Documentation
   url: https://tauri.com/
@@ -2298,7 +2314,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 585.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tavily
   description: SDK Reference
   url: https://tavily.com/
@@ -2306,7 +2322,7 @@ llms_txt:
   status: fetched
   file_count: 56
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tavus
   description: Tool Calling for Perception
   url: https://tavus.com/
@@ -2314,7 +2330,7 @@ llms_txt:
   status: fetched
   file_count: 95
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: temporal
   description: Temporal Documentation
   url: https://temporal.com/
@@ -2322,7 +2338,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 4.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: testingbot
   description: Testingbot Documentation
   url: https://testingbot.com/
@@ -2330,7 +2346,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 4.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tinybird
   description: Tinybird Documentation (FWD)
   url: https://tinybird.com/
@@ -2338,7 +2354,15 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: traceloop
+  description: Get costs by property
+  url: https://traceloop.com/
+  path: docs/llms-txt/traceloop/
+  status: fetched
+  file_count: 83
+  size: 938.4KB
+  last_updated: 2026-01-01 07:38
 - name: transloadit
   description: Transloadit Documentation
   url: https://transloadit.com/
@@ -2346,7 +2370,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 455.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: trigger
   description: Max duration
   url: https://trigger.com/
@@ -2354,7 +2378,7 @@ llms_txt:
   status: fetched
   file_count: 182
   size: 2.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: trigger-dev
   description: Max duration
   url: https://trigger-dev.com/
@@ -2362,7 +2386,7 @@ llms_txt:
   status: fetched
   file_count: 182
   size: 2.6MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: turbo
   description: Vitest
   url: https://turbo.com/
@@ -2370,7 +2394,7 @@ llms_txt:
   status: fetched
   file_count: 23
   size: 599.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: turso
   description: Turso + Railway
   url: https://turso.com/
@@ -2378,7 +2402,7 @@ llms_txt:
   status: fetched
   file_count: 116
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ultravox
   description: Call Definition
   url: https://ultravox.com/
@@ -2386,7 +2410,7 @@ llms_txt:
   status: fetched
   file_count: 140
   size: 1.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: unifygtm
   description: How to Define Personas
   url: https://unifygtm.com/
@@ -2394,7 +2418,7 @@ llms_txt:
   status: fetched
   file_count: 63
   size: 1.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: uno-platform
   description: Uno Support for Windows.Devices.Enumeration
   url: https://uno-platform.com/
@@ -2402,7 +2426,7 @@ llms_txt:
   status: fetched
   file_count: 362
   size: 6.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: unstructured
   description: SharePoint
   url: https://unstructured.com/
@@ -2410,7 +2434,7 @@ llms_txt:
   status: fetched
   file_count: 189
   size: 5.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: upstash
   description: Delete a DLQ message
   url: https://upstash.com/
@@ -2418,7 +2442,7 @@ llms_txt:
   status: fetched
   file_count: 614
   size: 4.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vapi
   description: February 4, 2025
   url: https://vapi.com/
@@ -2426,7 +2450,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 15.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: venice
   description: Current Models
   url: https://venice.com/
@@ -2434,7 +2458,7 @@ llms_txt:
   status: fetched
   file_count: 46
   size: 532.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vercel
   description: Vercel Documentation
   url: https://vercel.com/
@@ -2442,7 +2466,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 7.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vercel-ai-sdk
   description: vercel-ai-sdk documentation
   url: https://vercel-ai-sdk.com/
@@ -2450,7 +2474,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 454.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vercel-rest-api
   description: Update Resource Secrets (Deprecated)
   url: https://vercel-rest-api.com/
@@ -2458,7 +2482,7 @@ llms_txt:
   status: fetched
   file_count: 240
   size: 4.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vespa
   description: Sentinel Metrics
   url: https://vespa.com/
@@ -2466,7 +2490,7 @@ llms_txt:
   status: fetched
   file_count: 316
   size: 7.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: viem
   description: Viem Documentation
   url: https://viem.com/
@@ -2474,7 +2498,7 @@ llms_txt:
   status: fetched
   file_count: 3
   size: 1.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vite
   description: Building for Production
   url: https://vite.com/
@@ -2482,7 +2506,7 @@ llms_txt:
   status: fetched
   file_count: 5
   size: 462.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vitepress
   description: Runtime API
   url: https://vitepress.com/
@@ -2490,7 +2514,7 @@ llms_txt:
   status: fetched
   file_count: 36
   size: 380.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vitest
   description: setupFiles
   url: https://vitest.com/
@@ -2498,7 +2522,7 @@ llms_txt:
   status: fetched
   file_count: 151
   size: 1.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vue-macros
   description: scriptSFC&#x20;
   url: https://vue-macros.com/
@@ -2506,7 +2530,7 @@ llms_txt:
   status: fetched
   file_count: 39
   size: 136.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vuejs
   description: 'Reactivity API: Utilities {#reactivity-api-utilities}'
   url: https://vuejs.com/
@@ -2514,7 +2538,7 @@ llms_txt:
   status: fetched
   file_count: 87
   size: 1.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vueschool
   description: Vueschool Documentation
   url: https://vueschool.com/
@@ -2522,7 +2546,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 32.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: warp
   description: Quit Warning
   url: https://warp.com/
@@ -2530,7 +2554,7 @@ llms_txt:
   status: fetched
   file_count: 210
   size: 1.9MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: webflow
   description: Webflow Documentation
   url: https://webflow.com/
@@ -2538,7 +2562,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 3.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: windsurf
   description: Get Usage Configuration
   url: https://windsurf.com/
@@ -2546,7 +2570,7 @@ llms_txt:
   status: fetched
   file_count: 61
   size: 1.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: workos
   description: WorkOS Documentation
   url: https://workos.com/
@@ -2554,7 +2578,7 @@ llms_txt:
   status: fetched
   file_count: 2
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: x-cmd
   description: Package
   url: https://x-cmd.com/
@@ -2562,7 +2586,7 @@ llms_txt:
   status: fetched
   file_count: 4
   size: 2.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: xata
   description: xata documentation
   url: https://xata.com/
@@ -2570,7 +2594,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 1.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: xstate
   description: Xstate Documentation
   url: https://xstate.com/
@@ -2578,7 +2602,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 642.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: youtube-data-api
   description: Subscriptions
   url: https://youtube-data-api.com/
@@ -2586,7 +2610,7 @@ llms_txt:
   status: fetched
   file_count: 191
   size: 5.8MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: zag
   description: Zag Documentation
   url: https://zag.com/
@@ -2594,7 +2618,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 2.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: zapier
   description: User Profile
   url: https://zapier.com/
@@ -2602,7 +2626,7 @@ llms_txt:
   status: fetched
   file_count: 235
   size: 3.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: zenml
   description: Storing embeddings in a vector database
   url: https://zenml.com/
@@ -2610,7 +2634,7 @@ llms_txt:
   status: fetched
   file_count: 291
   size: 6.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: zod
   description: Zod Documentation
   url: https://zod.com/
@@ -2618,7 +2642,7 @@ llms_txt:
   status: fetched
   file_count: 1
   size: 248.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 github_scraped:
 - name: agentskills
   description: agentskills from GitHub
@@ -2626,259 +2650,287 @@ github_scraped:
   status: fetched
   file_count: 6
   size: 21.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: agentverse
   description: agentverse from GitHub
   path: docs/github-scraped/agentverse/
   status: fetched
   file_count: 7
   size: 48.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: apache-apisix
   description: apache-apisix from GitHub
   path: docs/github-scraped/apache-apisix/
   status: fetched
   file_count: 179
   size: 1.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: camel
   description: camel from GitHub
   path: docs/github-scraped/camel/
   status: fetched
   file_count: 175
   size: 774.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: celery
   description: celery from GitHub
   path: docs/github-scraped/celery/
   status: fetched
   file_count: 231
   size: 1.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: chroma
   description: chroma from GitHub
   path: docs/github-scraped/chroma/
   status: fetched
   file_count: 6
   size: 104.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: chromatic
   description: chromatic from GitHub
   path: docs/github-scraped/chromatic/
   status: fetched
   file_count: 204
   size: 957.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: coqui-tts
+  description: coqui-tts from GitHub
+  path: docs/github-scraped/coqui-tts/
+  status: fetched
+  file_count: 30
+  size: 88.9KB
+  last_updated: 2026-01-01 07:38
 - name: coqui-xtts
   description: coqui-xtts from GitHub
   path: docs/github-scraped/coqui-xtts/
   status: fetched
   file_count: 17
   size: 61.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: cypress
   description: cypress from GitHub
   path: docs/github-scraped/cypress/
   status: fetched
   file_count: 159
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: discord-api
   description: discord-api from GitHub
   path: docs/github-scraped/discord-api/
   status: fetched
   file_count: 316
   size: 2.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: docusaurus
   description: docusaurus from GitHub
   path: docs/github-scraped/docusaurus/
   status: fetched
   file_count: 92
   size: 736.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: electron
   description: electron from GitHub
   path: docs/github-scraped/electron/
   status: fetched
   file_count: 273
   size: 1.7MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: executorch
+  description: executorch from GitHub
+  path: docs/github-scraped/executorch/
+  status: fetched
+  file_count: 183
+  size: 871.1KB
+  last_updated: 2026-01-01 07:38
 - name: express
   description: express from GitHub
   path: docs/github-scraped/express/
   status: fetched
   file_count: 53
   size: 240.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: flagembedding
   description: flagembedding from GitHub
   path: docs/github-scraped/flagembedding/
   status: fetched
   file_count: 178
   size: 522.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: httpx
   description: httpx from GitHub
   path: docs/github-scraped/httpx/
   status: fetched
   file_count: 23
   size: 126.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: huggingface-diffusers
+  description: huggingface-diffusers from GitHub
+  path: docs/github-scraped/huggingface-diffusers/
+  status: fetched
+  file_count: 472
+  size: 3.1MB
+  last_updated: 2026-01-01 07:38
 - name: jest
   description: jest from GitHub
   path: docs/github-scraped/jest/
   status: fetched
   file_count: 37
   size: 438.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: langroid
   description: langroid from GitHub
   path: docs/github-scraped/langroid/
   status: fetched
   file_count: 71
   size: 394.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: llamafile
   description: llamafile from GitHub
   path: docs/github-scraped/llamafile/
   status: fetched
   file_count: 9
   size: 42.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: llmware
   description: llmware from GitHub
   path: docs/github-scraped/llmware/
   status: fetched
   file_count: 52
   size: 405.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: lmdeploy
   description: lmdeploy from GitHub
   path: docs/github-scraped/lmdeploy/
   status: fetched
   file_count: 56
   size: 250.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: matplotlib
   description: matplotlib from GitHub
   path: docs/github-scraped/matplotlib/
   status: fetched
   file_count: 419
   size: 3.0MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mermaid
   description: mermaid from GitHub
   path: docs/github-scraped/mermaid/
   status: fetched
   file_count: 87
   size: 666.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mobx
   description: mobx from GitHub
   path: docs/github-scraped/mobx/
   status: fetched
   file_count: 51
   size: 249.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nomic
   description: nomic from GitHub
   path: docs/github-scraped/nomic/
   status: fetched
   file_count: 17
   size: 40.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: prompt-flow
   description: prompt-flow from GitHub
   path: docs/github-scraped/prompt-flow/
   status: fetched
   file_count: 98
   size: 515.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: raspberrypi
   description: raspberrypi from GitHub
   path: docs/github-scraped/raspberrypi/
   status: fetched
   file_count: 270
   size: 1.2MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: ray
   description: ray from GitHub
   path: docs/github-scraped/ray/
   status: fetched
   file_count: 586
   size: 4.3MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-beautiful-dnd
   description: react-beautiful-dnd from GitHub
   path: docs/github-scraped/react-beautiful-dnd/
   status: fetched
   file_count: 41
   size: 216.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-router
   description: react-router from GitHub
   path: docs/github-scraped/react-router/
   status: fetched
   file_count: 204
   size: 738.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-three-fiber
   description: react-three-fiber from GitHub
   path: docs/github-scraped/react-three-fiber/
   status: fetched
   file_count: 20
   size: 143.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: redux-toolkit
   description: redux-toolkit from GitHub
   path: docs/github-scraped/redux-toolkit/
   status: fetched
   file_count: 69
   size: 900.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: stable-diffusion
+  description: stable-diffusion from GitHub
+  path: docs/github-scraped/stable-diffusion/
+  status: fetched
+  file_count: 2
+  size: 21.3KB
+  last_updated: 2026-01-01 07:38
 - name: styled-components
   description: styled-components from GitHub
   path: docs/github-scraped/styled-components/
   status: fetched
   file_count: 73
   size: 152.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: styleguidist
   description: styleguidist from GitHub
   path: docs/github-scraped/styleguidist/
   status: fetched
   file_count: 13
   size: 110.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tensorrt-llm
   description: tensorrt-llm from GitHub
   path: docs/github-scraped/tensorrt-llm/
   status: fetched
   file_count: 150
   size: 1.4MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tree-sitter
   description: tree-sitter from GitHub
   path: docs/github-scraped/tree-sitter/
   status: fetched
   file_count: 41
   size: 190.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: typedoc
   description: typedoc from GitHub
   path: docs/github-scraped/typedoc/
   status: fetched
   file_count: 81
   size: 184.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vllm
   description: vllm from GitHub
   path: docs/github-scraped/vllm/
   status: fetched
   file_count: 173
   size: 1.1MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 web_scraped:
 - name: aframe-react
   description: aframe-react documentation
@@ -2886,200 +2938,207 @@ web_scraped:
   status: fetched
   file_count: 1
   size: 12.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: beautifulsoup
   description: beautifulsoup documentation
   path: docs/web-scraped/beautifulsoup/
   status: fetched
   file_count: 1
   size: 115.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
+- name: clip
+  description: clip documentation
+  path: docs/web-scraped/clip/
+  status: fetched
+  file_count: 5
+  size: 48.2KB
+  last_updated: 2026-01-01 07:38
 - name: dayjs
   description: dayjs documentation
   path: docs/web-scraped/dayjs/
   status: fetched
   file_count: 5
   size: 9.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: emotion
   description: emotion documentation
   path: docs/web-scraped/emotion/
   status: fetched
   file_count: 32
   size: 104.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: excalidraw
   description: excalidraw documentation
   path: docs/web-scraped/excalidraw/
   status: fetched
   file_count: 35
   size: 159.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: formatjs
   description: formatjs documentation
   path: docs/web-scraped/formatjs/
   status: fetched
   file_count: 23
   size: 174.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: formik
   description: formik documentation
   path: docs/web-scraped/formik/
   status: fetched
   file_count: 27
   size: 128.5KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: google-cloud
   description: google-cloud documentation
   path: docs/web-scraped/google-cloud/
   status: fetched
   file_count: 31
   size: 203.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: hapi
   description: hapi documentation
   path: docs/web-scraped/hapi/
   status: fetched
   file_count: 16
   size: 399.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: immer
   description: immer documentation
   path: docs/web-scraped/immer/
   status: fetched
   file_count: 21
   size: 75.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: jsx-email
   description: jsx-email documentation
   path: docs/web-scraped/jsx-email/
   status: fetched
   file_count: 2
   size: 8.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: linaria
   description: linaria documentation
   path: docs/web-scraped/linaria/
   status: fetched
   file_count: 15
   size: 134.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: mjml
   description: mjml documentation
   path: docs/web-scraped/mjml/
   status: fetched
   file_count: 24
   size: 67.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: motion
   description: motion documentation
   path: docs/web-scraped/motion/
   status: fetched
   file_count: 1
   size: 13.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: nanocss
   description: nanocss documentation
   path: docs/web-scraped/nanocss/
   status: fetched
   file_count: 48
   size: 82.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: next-intl
   description: next-intl documentation
   path: docs/web-scraped/next-intl/
   status: fetched
   file_count: 31
   size: 231.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: openapi-swagger
   description: openapi-swagger documentation
   path: docs/web-scraped/openapi-swagger/
   status: fetched
   file_count: 16
   size: 442.3KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: percy
   description: percy documentation
   path: docs/web-scraped/percy/
   status: fetched
   file_count: 38
   size: 186.9KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: pillow
   description: pillow documentation
   path: docs/web-scraped/pillow/
   status: fetched
   file_count: 40
   size: 1019.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-cosmos
   description: react-cosmos documentation
   path: docs/web-scraped/react-cosmos/
   status: fetched
   file_count: 21
   size: 58.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-dropzone
   description: react-dropzone documentation
   path: docs/web-scraped/react-dropzone/
   status: fetched
   file_count: 1
   size: 24.7KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-email
   description: react-email documentation
   path: docs/web-scraped/react-email/
   status: fetched
   file_count: 3
   size: 9.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: react-flow
   description: react-flow documentation
   path: docs/web-scraped/react-flow/
   status: fetched
   file_count: 11
   size: 21.1KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: requests
   description: requests documentation
   path: docs/web-scraped/requests/
   status: fetched
   file_count: 7
   size: 150.4KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: storybook
   description: storybook documentation
   path: docs/web-scraped/storybook/
   status: fetched
   file_count: 79
   size: 759.2KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: tanstack-router
   description: tanstack-router documentation
   path: docs/web-scraped/tanstack-router/
   status: fetched
   file_count: 113
   size: 16.5MB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: unocss
   description: unocss documentation
   path: docs/web-scraped/unocss/
   status: fetched
   file_count: 58
   size: 230.8KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: urql
   description: urql documentation
   path: docs/web-scraped/urql/
   status: fetched
   file_count: 44
   size: 437.0KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
 - name: vanilla-extract
   description: vanilla-extract documentation
   path: docs/web-scraped/vanilla-extract/
   status: fetched
   file_count: 39
   size: 114.6KB
-  last_updated: 2026-01-01 07:34
+  last_updated: 2026-01-01 07:38
diff --git a/scripts/llms-sites.yaml b/scripts/llms-sites.yaml
index 2dd614bb17..a6bf37fa06 100644
--- a/scripts/llms-sites.yaml
+++ b/scripts/llms-sites.yaml
@@ -575,6 +575,9 @@ sites:
 - name: mistral
   base_url: https://docs.mistral.ai/
   description: Mistral AI LLM platform with models, agents, fine-tuning, and deployment
+- name: modal
+  base_url: https://modal.com/
+  description: Modal serverless platform for running Python code in the cloud with GPU support
 - name: modular
   base_url: https://docs.modular.com/
   description: Modular AI infrastructure and Mojo language