Getting Started¶
Welcome to usls!
This guide will help you get up and running with the library in just a few minutes.
🚀 Start with YOLO Demo¶
Let's run the YOLO-Series demo to explore models with different tasks, precision and execution providers:
- Tasks:
detect,segment,pose,classify,obb - Versions:
5,6,7,8,9,10,11,12,13,26 - Scales:
n,s,m,l,x - Precision (DType):
fp32,fp16,q8,q4,q4f16,bnb4 - Devices:
cpu,cuda:0,tensorrt:0,coreml,openvino:CPU
First, clone the repository and navigate to the project root
git clone https://github.com/jamjamjon/usls.git
cd usls
Then, run the demo:
# Object detection with YOLO26n (FP16)
cargo run -r --example yolo -- --task detect --ver 26 --scale n --dtype fp16
# Requires "cuda-full" feature
cargo run -r -F cuda-full --example yolo -- --task segment --ver 11 --scale m --device cuda:0 --processor-device cuda:0
# Requires "tensorrt-full" feature
cargo run -r -F tensorrt-full --example yolo -- --device tensorrt:0 --processor-device cuda:0
# Requires "coreml" feature
cargo run -r -F coreml --example yolo -- --device coreml
For a full list of options, run:
cargo run -r --example yolo -- --help
📊 Performance Reference¶
Environment: NVIDIA RTX 3060Ti (CUDA 12.8) / Intel i5-12400F
Setup: YOLO26n, 640x640 resolution, COCO2017 val set (5,000 images)
| EP | Image Processor |
DType | Batch | Preprocess | Inference | Postprocess | Total |
|---|---|---|---|---|---|---|---|
| TensorRT | CUDA | FP16 | 1 | ~233µs | ~1.3ms | ~14µs | ~1.55ms |
| TensorRT-RTX | CUDA | FP32 | 1 | ~233µs | ~2.0ms | ~10µs | ~2.24ms |
| TensorRT-RTX | CUDA | FP16 | 1 | ❓ | ❓ | ❓ | ❓ |
| CUDA | CUDA | FP32 | 1 | ~233µs | ~5.0ms | ~17µs | ~5.25ms |
| CUDA | CUDA | FP16 | 1 | ~233µs | ~3.6ms | ~17µs | ~3.85ms |
| CUDA | CPU | FP32 | 1 | ~800µs | ~6.5ms | ~14µs | ~7.31ms |
| CUDA | CPU | FP16 | 1 | ~800µs | ~5.0ms | ~14µs | ~5.81ms |
| CPU | CPU | FP32 | 1 | ~970µs | ~20.5ms | ~14µs | ~21.48ms |
| CPU | CPU | FP16 | 1 | ~970µs | ~25.0ms | ~14µs | ~25.98ms |
| TensorRT | CUDA | FP16 | 8 | ~1.2ms | ~6.0ms | ~55µs | ~7.26ms |
| TensorRT | CPU | FP16 | 8 | ~18.0ms | ~25.5ms | ~55µs | ~43.56ms |
Multi-Batch Performance
When using a larger batch size (e.g., batch 8), CUDA Image processor significantly improves throughput on GPUs.
Next Steps¶
-
Installation
Install usls in your own project
-
Integration
Learn how to integrate usls into your code