Skip to content
/ usls Public

A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models such as YOLO, FastVLM, and more.

License

Notifications You must be signed in to change notification settings

jamjamjon/usls

Repository files navigation

usls

Rust CI Crates.io Version ONNXRuntime MSRV Rust MSRV


📘 API Documentation | 🌟 Examples | 📦 Model Zoo


usls is a cross-platform Rust library powered by ONNX Runtime for efficient inference of SOTA vision and vision-language models (typically under 1B parameters).

(Generated by Seedream4.5)

🌟 Highlights

  • ⚡ High Performance: Multi-threading, SIMD, and CUDA-accelerated processing
  • 🌐 Cross-Platform: Linux, macOS, Windows with ONNX Runtime execution providers (CUDA, TensorRT, CoreML, OpenVINO, DirectML, etc.)
  • 🏗️ Unified API: Single Model trait inference with run()/forward()/encode_images()/encode_texts() and unified Y output
  • 📥 Auto-Management: Automatic model download (HuggingFace/GitHub), caching and path resolution
  • 📦 Multiple Inputs: Image, directory, video, webcam, stream and combinations
  • 🎯 Precision Support: FP32, FP16, INT8, UINT8, Q4, Q4F16, BNB4, and more
  • 🛠️ Full-Stack Suite: DataLoader, Annotator, and Viewer for complete workflows
  • 🌱 Model Ecosystem: 50+ SOTA vision and VLM models

🚀 Quick Start

Run the YOLO-Series demo to explore models with different tasks, precision and execution providers:

  • Tasks: detect, segment, pose, classify, obb
  • Versions: YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, YOLO11, YOLOv12, YOLOv13, YOLO26
  • Scales: n, s, m, l, x
  • Precision: fp32, fp16, q8, q4, q4f16, bnb4
  • Execution Providers: CPU, CUDA, TensorRT, TensorRT-RTX, CoreML, OpenVINO, and more

Examples

# CPU: Object detection with YOLO26n (FP16)
cargo run -r --example yolo -- --task detect --ver 26 --scale n --dtype fp16

# CUDA model + CPU processor: Instance segmentation with YOLO11m
cargo run -r -F cuda --example yolo -- --task segment --ver 11 --scale m --device cuda:0 --processor-device cpu

# CUDA model + CUDA processor: Pose estimation with YOLOv8m
cargo run -r -F cuda-full --example yolo -- --task pose --ver 8 --scale s --device cuda:0 --processor-device cuda:0

# TensorRT model + CPU processor
cargo run -r -F tensorrt --example yolo -- --device tensorrt:0 --processor-device cpu

# TensorRT model + CUDA processor (CUDA 12.4)
cargo run -r -F tensorrt-cuda-12040 --example yolo -- --device tensorrt:0 --processor-device cuda:0

# TensorRT-RTX model + CUDA processor
cargo run -r -F nvrtx-full --example yolo -- --device nvrtx:0 --processor-device cuda:0

# TensorRT-RTX model + CPU processor
cargo run -r -F nvrtx --example yolo -- --device nvrtx:0

# Apple Silicon CoreML
cargo run -r -F coreml --example yolo -- --device coreml

# Intel OpenVINO (CPU/GPU/VPU)
cargo run -r -F openvino -F ort-load-dynamic --example yolo -- --device openvino:CPU

# Show all available options
cargo run -r --example yolo -- --help

See YOLO Examples for more details and use cases.

See Device Combination Guide for feature and device configurations.

Performance

Environment: NVIDIA RTX 3060Ti (TensorRT-10.11.0.33, CUDA 12.8, TensorRT-RTX-1.3.0.35) / Intel i5-12400F

Setup: YOLO26n, COCO2017 validation set (5,000 images), Resolution: 640x640, Conf thresholds: [0.35, 0.3, ..]

Results are for rough reference only.

EP Image
Processor
DType Batch Preprocess Inference Postprocess Total
TensorRT CUDA FP16 1 ~233µs ~1.3ms ~14µs ~1.55ms
TensorRT-RTX CUDA FP32 1 ~233µs ~2.0ms ~10µs ~2.24ms
TensorRT-RTX CUDA FP16 1
CUDA CUDA FP32 1 ~233µs ~5.0ms ~17µs ~5.25ms
CUDA CUDA FP16 1 ~233µs ~3.6ms ~17µs ~3.85ms
CUDA CPU FP32 1 ~800µs ~6.5ms ~14µs ~7.31ms
CUDA CPU FP16 1 ~800µs ~5.0ms ~14µs ~5.81ms
CPU CPU FP32 1 ~970µs ~20.5ms ~14µs ~21.48ms
CPU CPU FP16 1 ~970µs ~25.0ms ~14µs ~25.98ms
TensorRT CUDA FP16 8 ~1.2ms ~6.0ms ~55µs ~7.26ms
TensorRT CPU FP16 8 ~18.0ms ~25.5ms ~55µs ~43.56ms

📦 Model Zoo

Status: ✅ Supported  |  ❓ Unknown  |  ❌ Not Supported For Now

🔥 YOLO-Series
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
YOLOv5 Image Classification
Object Detection
Instance Segmentation
demo
YOLOv6 Object Detection demo
YOLOv7 Object Detection demo
YOLOv8 Object Detection
Instance Segmentation
Image Classification
Oriented Object Detection
Keypoint Detection
demo
YOLO11 Object Detection
Instance Segmentation
Image Classification
Oriented Object Detection
Keypoint Detection
demo
YOLOv9 Object Detection demo
YOLOv10 Object Detection demo
YOLOv12 Image Classification
Object Detection
Instance Segmentation
demo
YOLOv13 Object Detection demo
YOLO26 Object Detection
Instance Segmentation
Image Classification
Oriented Object Detection
Keypoint Detection
demo
🏷️ Image Classification & Tagging
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
BEiT Image Classification demo
ConvNeXt Image Classification demo
FastViT Image Classification demo
MobileOne Image Classification demo
DeiT Image Classification demo
RAM Image Tagging demo
RAM++ Image Tagging demo
🎯 Object Detection
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
RT-DETRv1 Object Detection demo
RT-DETRv2 Object Detection demo
RT-DETRv4 Object Detection demo
RF-DETR Object Detection demo
PP-PicoDet Object Detection demo
D-FINE Object Detection demo
DEIM Object Detection demo
DEIMv2 Object Detection demo
🎨 Image Segmentation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
SAM Segment Anything demo
SAM-HQ Segment Anything demo
MobileSAM Segment Anything demo
EdgeSAM Segment Anything demo
FastSAM Instance Segmentation demo
SAM2 Segment Anything demo
SAM3-Tracker Segment Anything demo
🗺️ Open-Set Detection & Segmentation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
GroundingDINO Open-Set Detection With Language demo
MM-GDINO Open-Set Detection With Language demo
LLMDet Open-Set Detection With Language demo
OWLv2 Open-Set Object Detection demo
YOLO-World Open-Set Detection With Language demo
YOLOE Open-Set Detection And Segmentation demo
SAM3-Image Open-Set Detection And Segmentation demo
✨ Background Removal
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
RMBG Image Segmentation
Background Removal
demo
BEN2 Image Segmentation
Background Removal
demo
🏃 Multi-Object Tracking
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
ByteTrack Multi-Object Tracking demo
💎 Image Super-Resolution
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
Swin2SR Image Restoration demo
APISR Anime Super-Resolution demo
✂️ Image Matting
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
MODNet Image Matting demo
MediaPipe Selfie Image Segmentation demo
🤸 Pose Estimation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
RTMPose Keypoint Detection demo
DWPose Keypoint Detection demo
RTMW Keypoint Detection demo
RTMO Keypoint Detection demo
🔍 OCR & Document Understanding
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
DB Text Detection demo
FAST Text Detection demo
LinkNet Text Detection demo
SVTR Text Recognition demo
TrOCR Text Recognition demo
SLANet Table Recognition demo
DocLayout-YOLO Object Detection demo
🧩 Vision-Language Models (VLM)
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
BLIP Image Captioning demo
Florence2 A Variety of Vision Tasks demo
Moondream2 Open-Set Object Detection
Open-Set Keypoints Detection
Image Captioning
Visual Question Answering
demo
SmolVLM Visual Question Answering demo
SmolVLM2 Visual Question Answering demo
FastVLM Vision Language Models demo
🧬 Embedding Model
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
CLIP Vision-Language Embedding demo
jina-clip-v1 Vision-Language Embedding demo
jina-clip-v2 Vision-Language Embedding demo
mobileclip Vision-Language Embedding demo
DINOv2 Vision Embedding demo
DINOv3 Vision Embedding demo
📐 Depth Estimation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
DepthAnything v1 Monocular Depth Estimation demo
DepthAnything v2 Monocular Depth Estimation demo
DepthPro Monocular Depth Estimation demo
Depth-Anything-3 Monocular
Metric
Multi-View
demo
🌌 Others
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
Sapiens Foundation for Human Vision Models demo
YOLOPv2 Panoptic Driving demo

Documentation

🔧 Cargo Features

❕ Features in italics are enabled by default.

  • Core & Utilities

    • ort-download-binaries: Automatically download prebuilt ONNX Runtime binaries from pyke.
    • ort-load-dynamic: Manually link ONNX Runtime. Useful for custom builds or unsupported platforms. See Linking Guide for more details.
    • viewer: Real-time image/video visualization (similar to OpenCV imshow). Empowered by minifb.
    • video: Video I/O support for reading and writing video streams. Empowered by video-rs.
    • hf-hub: Download model files from Hugging Face Hub.
    • annotator: Annotation utilities for drawing bounding boxes, keypoints, and masks on images.
  • Image Formats

    Additional image format support (optional for faster compilation):

    • image-all-formats: Enable all additional image formats.
    • image-gif, image-bmp, image-ico, image-avif, image-tiff, image-dds, image-exr, image-ff, image-hdr, image-pnm, image-qoi, `image-tga: Individual image format support.
  • Model Categories

    • vision: Core vision models (Detection, Segmentation, Classification, Pose, etc.).
    • vlm: Vision-Language Models (CLIP, BLIP, Florence2, etc.).
    • mot: Multi-Object Tracking utilities.
    • all-models: Enable all model categories.
  • Execution Providers

    Hardware acceleration for inference. Enable the one matching your hardware:

    • cuda: NVIDIA CUDA execution provider (pure model inference acceleration).
    • tensorrt: NVIDIA TensorRT execution provider (pure model inference acceleration).
    • nvrtx: NVIDIA NvTensorRT-RTX execution provider (pure model inference acceleration).
    • cuda-full: cuda + cuda-runtime-build (Model + Image Preprocessing acceleration).
    • tensorrt-full: tensorrt + cuda-runtime-build (Model + Image Preprocessing acceleration).
    • nvrtx-full: nvrtx + cuda-runtime-build (Model + Image Preprocessing acceleration).
    • coreml: Apple Silicon (macOS/iOS).
    • openvino: Intel CPU/GPU/VPU.
    • onednn: Intel Deep Neural Network Library.
    • directml: DirectML (Windows).
    • webgpu: WebGPU (Web/Chrome).
    • rocm: AMD GPU acceleration.
    • cann: Huawei Ascend NPU.
    • rknpu: Rockchip NPU.
    • xnnpack: Mobile CPU optimization.
    • acl: Arm Compute Library.
    • armnn: Arm Neural Network SDK.
    • azure: Azure ML execution provider.
    • migraphx: AMD MIGraphX.
    • nnapi: Android Neural Networks API.
    • qnn: Qualcomm SNPE.
    • tvm: Apache TVM.
    • vitis: Xilinx Vitis AI.
  • CUDA Support

    NVIDIA GPU acceleration with CUDA image processing kernels (requires cudarc):

    • cuda-full: Uses cuda-version-from-build-system (auto-detects via nvcc).
    • cuda-11040, cuda-11050, cuda-11060, cuda-11070, cuda-11080: CUDA 11.x versions (Model + Preprocess).
    • cuda-12000, cuda-12010, cuda-12020, cuda-12030, cuda-12040, cuda-12050, cuda-12060, cuda-12080, cuda-12090: CUDA 12.x versions (Model + Preprocess).
    • cuda-13000, cuda-13010: CUDA 13.x versions (Model + Preprocess).
  • TensorRT Support

    NVIDIA TensorRT execution provider with CUDA runtime libraries:

    • tensorrt-full: Uses cuda-version-from-build-system (auto-detects via nvcc).
    • tensorrt-cuda-11040, tensorrt-cuda-11050, tensorrt-cuda-11060, tensorrt-cuda-11070, tensorrt-cuda-11080: TensorRT + CUDA 11.x runtime.
    • tensorrt-cuda-12000, tensorrt-cuda-12010, tensorrt-cuda-12020, tensorrt-cuda-12030, tensorrt-cuda-12040, tensorrt-cuda-12050, tensorrt-cuda-12060, tensorrt-cuda-12080, tensorrt-cuda-12090: TensorRT + CUDA 12.x runtime.
    • tensorrt-cuda-13000, tensorrt-cuda-13010: TensorRT + CUDA 13.x runtime.

    Note: tensorrt-cuda-* features enable TensorRT execution provider with CUDA runtime libraries for image processing. The "cuda" in the name refers to cudarc dependency.

  • NVRTX Support

    NVIDIA NvTensorRT-RTX execution provider with CUDA runtime libraries:

    • nvrtx-full: Uses cuda-version-from-build-system (auto-detects via nvcc).
    • nvrtx-cuda-11040, nvrtx-cuda-11050, nvrtx-cuda-11060, nvrtx-cuda-11070, nvrtx-cuda-11080: NVRTX + CUDA 11.x runtime.
    • nvrtx-cuda-12000, nvrtx-cuda-12010, nvrtx-cuda-12020, nvrtx-cuda-12030, nvrtx-cuda-12040, nvrtx-cuda-12050, nvrtx-cuda-12060, nvrtx-cuda-12080, nvrtx-cuda-12090: NVRTX + CUDA 12.x runtime.
    • nvrtx-cuda-13000, nvrtx-cuda-13010: NVRTX + CUDA 13.x runtime.

    Note: nvrtx-cuda-* features enable NVRTX execution provider with CUDA runtime libraries for image processing. The "cuda" in the name refers to cudarc dependency.


🚀 Device Combination Guide

Scenario Model Device (--device) Processor Device (--processor-device) Required Features (-F)
CPU Only cpu cpu vision (default)
GPU Inference (Slow Preprocess) cuda cpu cuda
GPU Inference (Fast Preprocess) cuda cuda cuda-full or cuda-120xxx
TensorRT (Slow Preprocess) tensorrt cpu tensorrt
TensorRT (Fast Preprocess) tensorrt cuda tensorrt-full or tensorrt-cuda-120xxx

⚠️ In multi-GPU environments (e.g., cuda:0, cuda:1), you MUST ensure that both --device and --processor-device use the SAME GPU ID.


❓ FAQ

  • ONNX Runtime Issues: For ONNX Runtime related errors, please check the ort issues or onnxruntime issues.
  • Other Issues: For other questions or bug reports, see issues or open a new discussion.

⚠️ Compatibility Note

If you encounter linking errors with __isoc23_strtoll or similar glibc symbols, use the dynamic loading feature:

cargo run -F ort-load-dynamic --example

Why no LM models?

This project focuses on vision and VLM models under 1B parameters for efficient inference.

Many high-performance inference engines already exist for LM/LLM models like vLLM.

Pure text embedding models may be considered in future releases.

How fast is it?

Refer to YOLO performance benchmarks in the Performance section above.

This project uses multi-threading, SIMD, and CUDA hardware acceleration for optimization.

While vision models like YOLO and RFDETR are optimized, other models may need further interface and post-processing optimization.

🤝 Contributing

This is a personal project maintained in spare time, so progress on performance optimization and new model support may vary.

We highly welcome PRs for model optimization! If you have expertise in specific models and can help optimize their interfaces or post-processing, your contributions would be invaluable. Feel free to open an issue or submit a pull request for suggestions, bug reports, or new features.

🙏 Acknowledgments

Thanks to all the open-source libraries and their maintainers that make this project possible. See Cargo.toml for a complete list of dependencies.

📜 License

This project is licensed under LICENSE.