RTMO-s (body7) — acaua mirror (pure-PyTorch port)
This is a pure-PyTorch port of RTMO-s hosted under CondadosAI/ for use with the acaua computer vision library.
RTMO (Lu et al., CVPR 2024) is a one-stage real-time multi-person pose estimator that integrates coordinate classification into a YOLO-style architecture. This variant was trained on the body7 composite dataset (COCO + AI Challenger + CrowdPose + MPII + sub-JHMDB + Halpe + PoseTrack18), producing a 17-keypoint COCO-schema skeleton.
The architecture has been re-implemented in pure PyTorch under acaua.adapters.rtmo — no mmcv, no mmengine, no mmpose, no trust_remote_code. The model.safetensors in this mirror is converted from the upstream .pth checkpoint to safetensors with the acaua adapter's state_dict key naming. It is NOT drop-in compatible with mmpose — weights are laid out to load cleanly into our nn.Module tree via load_state_dict(strict=True).
Provenance
| Upstream code | open-mmlab/mmpose @ 759b39c13fea6ba094afc1fa932f51dc1b11cbf9 (Apache-2.0) |
| Upstream weights URL | https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.pth |
| Upstream weights SHA256 | dac2bf749bbfb51e69ca577ca0327dff4433e3be9a56b782f0b7ef94fb45247e |
| Conversion script | scripts/convert_rtmo.py |
| Paper | Lu et al., "RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation", CVPR 2024, arXiv:2312.07526 |
| Mirrored on | 2026-04-22 |
| Mirrored by | CondadosAI/acaua |
Usage
import acaua
model = acaua.Model.from_pretrained("CondadosAI/rtmo_s_body7")
result = model.predict("image.jpg")
# Result is a PoseResult with shape:
# result.boxes -> (N, 4) float32, xyxy
# result.labels -> (N,) int64 (person = 0)
# result.scores -> (N,) float32
# result.keypoints -> (N, 17, 2) float32, xy in image pixels
# result.keypoint_scores -> (N, 17) float32
# Skeleton edges + keypoint names live on the adapter:
import supervision as sv
kp = result.to_supervision()
sv.EdgeAnnotator(edges=model.skeleton).annotate(image, kp)
Architecture
- Backbone: CSPDarknet (YOLOX-lineage),
widen_factor=0.5,deepen_factor=0.33 - Neck: HybridEncoder (RT-DETR–style transformer encoder + FPN/PAN fusion),
hidden_dim=256 - Head: RTMOHead with per-level YOLO-style box + visibility predictions and a Dynamic Coordinate Classifier (DCC) decoded via softmax expectation over
(192 × 256)coordinate bins - Parameters: ~9.87M
- Input: 640 × 640 letterboxed, RGB raw pixel values (no mean/std normalization per upstream
PoseDataPreprocessor)
Reported performance (upstream)
| Variant | Dataset | COCO val AP | COCO val AR | V100 FPS |
|---|---|---|---|---|
| RTMO-s | body7 | 68.6 | 74.3 | ~141 |
License and attribution
Redistributed under Apache-2.0, consistent with the upstream code and weights declarations. The acaua adapter is itself a derivative work of the upstream PyTorch implementation — see NOTICE for the required attribution chain (code AND weights).
Citation
@misc{lu2023rtmo,
title={{RTMO}: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation},
author={Peng Lu and Tao Jiang and Yining Li and Xiangtai Li and Kai Chen and Wenming Yang},
year={2023},
eprint={2312.07526},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- -