Skip to content

Data Types (DType) and Quantization

usls provides multiple precision ONNX models for each algorithm, using DType for model selection and EP configuration.

Quick Reference

DType Precision Description
fp32 32-bit float Original precision, exported from PyTorch/TensorFlow
fp16 16-bit float Mixed precision fp16/fp32
q8 8-bit Dynamic quantization using onnxruntime.quantization.quantize_dynamic.

Auto-selects int8/uint8 based on operators (Conv, GroupQueryAttention, MultiHeadAttentionuint8, otherwise int8)
int8 8-bit Static quantization for TensorRT EP using NVIDIA Model-Optimizer.

Note: May fail on non-TensorRT EPs
s8s8 8-bit Static quantization (signed weights, signed activations), using onnxruntime.quantization.quantize_static
s8u8 8-bit Static quantization (signed weights, unsigned activations)
u8u8 8-bit Static quantization (unsigned weights, unsigned activations)
q4 4-bit Float32 to 4-bit int using onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer
q4f16 4-bit + 16-bit float q4 with cast to fp16
bnb4 4-bit 4-bit quantization using FP4/NF4 via onnxruntime.quantization.matmul_bnb4_quantizer.MatMulBnb4Quantizer

Basic Usage

Global (All Modules)

let config = Config::sam3()
    .with_dtype_all(DType::Fp16)
    .commit()?;

Per-Module

let config = Config::clip()
    // Fast visual encoding
    .with_visual_dtype(DType::Fp16)
    // Accurate text encoding
    .with_textual_dtype(DType::Fp32)
    .commit()?;

Model Availability

usls does not provide all quantization variants for every model. Refer to the quantization schemes in this guide to create your own quantized models as needed.


TensorRT EP

Use FP16

TensorRT automatically converts FP32 to FP16 for performance. Use FP32 models with TensorRT for optimal speed:

Config::default().with_model_file("<fp32-model>.onnx")

Or use --dtype fp32 in usls examples.

Use int8/FP8/Int4

Method 1: NVIDIA Model-Optimizer

Method 2: ONNX Runtime Quantization

  • Step 1: Provide FP32 ONNX model and calibration table
  • Step 2: Configure TensorRT EP INT8 calibration:
Config::default()
    .with_<module>_int8(true)
    .with_<module>_int8_calibration_table_name()
    .with_<module>_int8_use_native_calibration_table()

Data Type Selection & Method Selection

References