SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences

Ruipeng Wang ^*

University of Pennsylvania

Langkun Zhong ^*

The University of Hong Kong

Miaowei Wang

The University of Edinburgh

arXiv 2026

^*Equal contribution

Paper Code arXiv

**Figure 1. Comparison of our method vs Puppeteer.** Our method (top, blue) yields a complete, temporally consistent skeleton with smooth, coherent skinning weights, whereas Puppeteer (bottom, red) produces an incomplete skeleton with missing hand rigging and unstable, blocky skinning.

TL;DR

Since an animated sequence represents the same underlying object, we enforce a consistency prior to fine-tune existing rigging models, enabling them to learn robust, pose-invariant rigs from abundant unlabeled data.

Abstract

State-of-the-art rigging methods assume a canonical rest pose—an assumption that fails for sequential data (e.g., animal motion capture or AIGC/video-derived mesh sequences) that lack the T-pose. Applied frame-by-frame, these methods are not pose-invariant and produce topological inconsistencies across frames.

Thus we propose SPRig, a general fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigs on top of existing models. We validate our approach on rigging using a new permutation-invariant stability protocol. Experiments demonstrate SOTA temporal stability: our method produces coherent rigs from challenging sequences and dramatically reduces the artifacts that plague baseline methods.

Method Overview

Our key insight builds on a fundamental assumption in computer graphics: an animated sequence represents a single object, which should therefore have a single, pose-invariant rig. We operationalize it as a self-supervised signal: a canonical rig derived from an anchor frame serves as a target to fine-tune the pre-trained model.

**Figure 2. Skeleton generation overview.** Point clouds sampled from mesh sequences are fed to a Transformer-based skeleton generator. [cite_start]An anchor skeleton from the original generator defines a canonical target; token-space and geometry-space consistency losses fine-tune the model so that decoded tokens yield temporally consistent skeletons. [cite: 138-140]

**Figure 3. Skinning generation pipeline overview.** A high-quality anchor teacher is first generated using a pretrained generator on the anchor frame. A skinning generator is then fine-tuned: its predictions on all query points are compared against the single anchor teacher using our articulation-invariant consistency loss, forcing the model to learn a pose-invariant mapping and produce temporally consistent skinning.

Experimental Results

We evaluate SPRig on challenging animated sequences from the DeformingThings4D dataset, demonstrating superior temporal stability compared to the state-of-the-art baseline (Puppeteer).

Skeleton Generation Consistency

Our method produces complete and topologically consistent skeletons across frames, whereas the baseline often suffers from flickering topology and missing joints.

**Figure 4. Qualitative comparison of skeleton predictions.** Our method produces temporally stable and complete skeletons.

Skinning Temporal Stability

We visualize the temporal inconsistency error. Our method effectively suppresses the “flickering” artifacts seen in baseline methods.

**Figure 5. L1 Error Heatmap of temporal consistency.** We visualize the per-vertex L1 error between the prediction on a perturbed frame and the static anchor teacher. Blue indicates zero error; red indicates high error. The baseline (left) shows large high-error regions on limbs, while our method (right) almost completely eliminates this inconsistency.

Skinning Generation Quality

Our method ensures that skinning weights (visualized by color) remain consistent on the same body parts throughout the motion sequence.

**Figure 6. Qualitative comparison of skinning generation.** The visualization colors each point based on the joint with maximum influence. Ours (top) produces consistent skinning assignments for arms and legs across all frames. In contrast, the baseline (bottom) exhibits significant instability, with joint assignments flickering between frames (e.g., arm changing colors).

Quantitative Analysis

Our method achieves state-of-the-art temporal stability while maintaining or improving static generation quality.

Table 1. Temporal Stability & Static Quality (Skeleton). Our method reduces geometric jitter (PJDD) by over 25×.

Model	PJDD (↓)	BLRD (↓)	GSD (↓)	JAD	MPJPE@Anchor
Puppeteer	17.46	34.37	0.062	0.343	0.592
Ours	0.68	17.74	0.056	0.380	0.731

Table 2. Temporal Consistency (Skinning). We achieve a 30.3% reduction in L1 error and a 51.3% reduction in SymKL divergence.

Method	L1 (B,C→A) ↓	SymKL (B,C↔A) ↓	Entropy ↓
Puppeteer	1328.80	2226.63	1368.45
Ours	925.77	1084.71	1396.95
Improvement	30.3%	51.3%	-

BibTeX citation

@misc{wang2026sprigselfsupervisedposeinvariantrigging,
      title={SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences},
      author={Ruipeng Wang and Langkun Zhong and Miaowei Wang},
      year={2026},
      eprint={2602.12740},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.12740},
}