Getting Started¶

Welcome to usls!

This guide will help you get up and running with the library in just a few minutes.

🚀 Start with YOLO Demo¶

Let's run the YOLO-Series demo to explore models with different tasks, precision and execution providers:

Tasks: detect, segment, pose, classify, obb
Versions: 5, 6, 7, 8, 9, 10, 11, 12, 13, 26
Scales: n, s, m, l, x
Precision (DType): fp32, fp16, q8, q4, q4f16, bnb4
Devices: cpu, cuda:0, tensorrt:0, coreml, openvino:CPU

First, clone the repository and navigate to the project root

git clone https://github.com/jamjamjon/usls.git
cd usls

Then, run the demo:

CPU (Default)NVIDIA GPU (CUDA)NVIDIA GPU (TensorRT)Apple Silicon (CoreML)

# Object detection with YOLO26n (FP16)
cargo run -r --example yolo -- --task detect --ver 26 --scale n --dtype fp16

# Requires "cuda-full" feature
cargo run -r -F cuda-full --example yolo -- --task segment --ver 11 --scale m --device cuda:0 --processor-device cuda:0

# Requires "tensorrt-full" feature
cargo run -r -F tensorrt-full --example yolo -- --device tensorrt:0 --processor-device cuda:0

# Requires "coreml" feature
cargo run -r -F coreml --example yolo -- --device coreml

For a full list of options, run:

cargo run -r --example yolo -- --help

📊 Performance Reference¶

Environment: NVIDIA RTX 3060Ti (CUDA 12.8) / Intel i5-12400F
Setup: YOLO26n, 640x640 resolution, COCO2017 val set (5,000 images)

EP	Image Processor	DType	Batch	Preprocess	Inference	Postprocess	Total
TensorRT	CUDA	FP16	1	~233µs	~1.3ms	~14µs	~1.55ms
TensorRT-RTX	CUDA	FP32	1	~233µs	~2.0ms	~10µs	~2.24ms
TensorRT-RTX	CUDA	FP16	1	❓	❓	❓	❓
CUDA	CUDA	FP32	1	~233µs	~5.0ms	~17µs	~5.25ms
CUDA	CUDA	FP16	1	~233µs	~3.6ms	~17µs	~3.85ms
CUDA	CPU	FP32	1	~800µs	~6.5ms	~14µs	~7.31ms
CUDA	CPU	FP16	1	~800µs	~5.0ms	~14µs	~5.81ms
CPU	CPU	FP32	1	~970µs	~20.5ms	~14µs	~21.48ms
CPU	CPU	FP16	1	~970µs	~25.0ms	~14µs	~25.98ms
TensorRT	CUDA	FP16	8	~1.2ms	~6.0ms	~55µs	~7.26ms
TensorRT	CPU	FP16	8	~18.0ms	~25.5ms	~55µs	~43.56ms

Multi-Batch Performance

When using a larger batch size (e.g., batch 8), CUDA Image processor significantly improves throughput on GPUs.

Next Steps¶

Installation

Install usls in your own project

Install →
Integration

Learn how to integrate usls into your code

Integrate →