GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading [Paper]
GS-Scale is a fast, memory-efficient, and scalable training framework built for large-scale 3DGS training. To reduce the GPU memory usage, GS-Scale stores all Gaussian parameters and optimizer states in host memory, transferring only a subset to the GPU on demand for each forward and backward pass. With various system-level optimizations, GS-Scale significantly lowers GPU memory demands by 3.3-5.6x, while achieving training speeds comparable to GPU-Only systems, enabling large-scale 3DGS training on consumer-grade GPUs.
source install.sh
Note:
We recommend using CUDA 12.x; Currently, only Intel CPUs are supported;
Download the dataset and organize image/colmap folder as follows.
├── data
│ ├── colmap_results
│ │ ├── mill19
│ │ │ ├── rubble-pixsfm
│ │ │ │ ├── images
│ │ │ │ ├── sparse
│ │ │ │ │ ├── 0
│ │ │ │ │ │ ├── cameras.bin
│ │ │ │ │ │ ├── points3D.bin
│ │ │ │ │ │ ├── images.bin
│ │ │ ├── building-pixsfm
│ │ │ │ ├── images
│ │ │ │ ├── sparse
│ │ ├── GauU_Scene
│ │ │ ├── LFLS
│ │ │ │ ├── images
│ │ │ │ ├── sparse
│ │ │ ├── SZTU
│ │ │ │ ├── images
│ │ │ │ ├── sparse
│ │ │ ├── SZIIT
│ │ │ │ ├── images
│ │ │ │ ├── sparse
│ │ ├── MatrixCity
│ │ │ ├── aerial
│ │ │ │ ├── images
│ │ │ │ ├── sparse
│ │ │ │ │ ├── 0
│ │ │ │ │ │ ├── cameras.bin
│ │ │ │ │ │ ├── points3D.bin
│ │ │ │ │ │ ├── images.bin
│ │ │ │ │ │ ├── points.ply
Download dataset Rubble and Building. We use downsampling rate 4, following CityGaussian. Downsampling is performed automatically.
We downloaded preprocesed colmap from CityGaussian. We merge train and test sets in a single directory for both images and colmap. (We do this to use "test_every".) We provide merged colmap files through Google Drive link.
Download dataset and colmap from GauU Scene. We use test_every=10 and downsampling rate 3.4175 following CityGaussian. Downsampling is performed automatically.
Download dataset MatrixCity. We use downsampling rate 1.2, following CityGaussian. Downsampling is performed automatically. We merge images from train and test set to use "test_after=5620", which means that scenes with ids greater than 5620 are used for test sets. You will need to rename the image files.
For Gaussian initialization, you can either use initial point cloud from ground truth depth (matrixcity_point_cloud_ds20.zip link) or use colmap from CityGaussian repo. We use point cloud from ground truth depth for our experiments. You may change "init_ply" option to False to use colmap. We provide both files through Google Drive link.
Original gsplat training script. We recommend using original script when training small scenes. It is typically faster than host offloading versions as long as you don't encounter Out-of-Memory (OOM) issues.
CUDA_VISIBLE_DEVICES=0 python simple_trainer.py [rubble, building, sztu, lfls, sziit, aerial] --data_dir [path-to-dataset] --result_dir [path-to-result-dir]
A naive version of host offloading training. This saves substantial amount of GPU memory usage but is extremely slow. We do not recommend using this script.
CUDA_VISIBLE_DEVICES=0 python simple_trainer_hybrid_baseline.py [rubble, building, sztu, lfls, sziit, aerial] --data_dir [path-to-dataset] --result_dir [path-to-result-dir]
An optimized version of host offloading training. We recommend this script for training large scenes on a single GPU. It prevents Out-of-Memory (OOM) errors while maintaining a training speed similar to GPU-only training.
Consider using split mode if the optimized version still results in OOM errors. It splits images that consume more Gaussians than split_threshold * total_gaussians to further reduce GPU memory usage. This does not affect training quality. A smaller split_threshold saves more memory but may slow down training. The default value is 0.3.
# w/o split mode
CUDA_VISIBLE_DEVICES=0 python simple_trainer_hybrid_optimized.py [rubble, building, sztu, lfls, sziit, aerial] --data_dir [path-to-dataset] --result_dir [path-to-result-dir]
# w/ split mode
CUDA_VISIBLE_DEVICES=0 python simple_trainer_hybrid_optimized_split.py [rubble, building, sztu, lfls, sziit, aerial] --data_dir [path-to-dataset] --result_dir [path-to-result-dir] --split_threshold [split-threshold]
We define configs for each dataset at the end of training script. You may change hyperparameters in there.
We use the default learning rate settings from the original 3DGS paper except position_lr and scaling_lr. Decreasing these values helps improve redering quality. You may consider decreasing them further for larger scenes.
We follow Grendel's methodology to scale up or scale down the Gaussian counts for each scene. See Appendix C.3 for more details. Example densification settings are shown in our training script.
This repository is built on gsplat library.
@inproceedings {gsscale,
title={GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading},
author={Donghyun Lee and Dawoon Jeong and Jae W. Lee and Hongil Yoon},
booktitle = {31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems ({ASPLOS} 26)},
year={2026},
}