PACT

Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation

Lingxuan Wu, Zijian Zhu, Lizhong Wang, Chengyang Ying, Huayu Chen, Xiao Yang, Fangming Liu, Jun Zhu
TSAIL, Tsinghua University
ICML 2026 · Spotlight
Abstract

Physical safety Alignment for Constrained Trajectories

Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT reduces safety violations by 31.0% on average while improving task success by 30.7%.

TL;DR

PACT enables aligning your diffusion policies solely from self-rollouts

  • Self-Evolving: uses self-rollouts W/O demos, interventions, rewards, or outcome annotations.
  • Efficient Optimization: distills constraint gradients into the policy, providing dense supervision across all diffusion timesteps.
  • Curriculum alignment: progressively tightens constraints to preserve task competence while improving safety.
  • Foundation-model compatible: Plug-and-play to any diffusion-based policies, including flow policies, VLAs, WAMs, W/O arch. modifications
  • Theoretical control: bounds policy shift and ensures monotone improvement.
  • Proven in practice: improves both task success and safety across simulation and real-world manipulation.
Method

A Scalable Path to Safe Robot Policies

The Problem

Diffusion/flow policies have become a dominant paradigm for robot manipulation, but they can still produce unsafe behaviors such as unintended collisions, unsafe grasps, or physically risky motions.

What is the most scalable way to align robot policies for safety?

Our Key Insight

Safety should be enforced during post-training, similar to how LLMs are aligned after pretraining.

  • Preserve capabilities learned from large-scale pre-training. v.s.train/develop-time methods that require inductive bias or training-pipeline changes.
  • Introduce zero additional cost at deployment time. v.s.test-time guardrails which cost grows with fleet size.
Pretraining capabilities plus post-training safety alignment

A Scalable Formulation

Most existing approaches require:

  • Safety-specific demonstrations
  • Expensive to collect.
  • Preference or outcome annotation
  • Requires human annotators.

Instead, We formulate safety alignment as regularized projection of a pretrained policy $\mu_{\phi}$ into a safe policy set $\Pi_{\text{safe}}$ defined by physical constraints $\mathcal{C}=\{x \in \mathbb{R}^d \mid c_k(s, a) \leq 0\}$:

$$ \begin{aligned} \Pi_{\mathrm{safe}} = \Big\{ \pi: \mathbb{E}_{(s, a) \sim d^\pi}c_k(s, a) \le d_k,\; \forall k \Big\}, \end{aligned} $$

which can be reduced to solve:1

$$ \min_{\theta} \underbrace{\mathbb{E}_{(s, a) \sim d^\pi}}_{\text{on-policy}} \Big[\underbrace{D_{\text{KL}}\left( \pi(\cdot \mid \vs) \| \mu_{\phi}(\cdot \mid \vs) \right)}_{\begin{array}{@{}c@{}}\text{avoid forgetting given no}\\\text{pre-trained datasets and utility signals}\end{array}} + \underbrace{\sum_{k=1}^m \lambda_k (c_k(s, a) - d_k)}_{\text{reduce constraints violation}} \Big]. $$

then, PACT is nearly self-evolving only requring two ingredients:

  • Self-rollouts, i.e. learn from experiences.
  • Physical constraints, automatically computed from environment states2
PACT scalable formulation figure

Distillation for Efficient Optimization

A key observation is that the safety alignment objective admits an analytical optimal score function $\epsilon^*(a_t, s, t)$, namely the implict safe teacher. PACT directly distills from this implcit teacher3.

    $$ \begin{aligned} \min_{\theta}\;& \mathbb{E}_{t, \epsilon, (s, a) \sim d^{\pi_{\theta}}} \| \epsilon_{\theta}(a_t, s, t) - \epsilon^*(a_t, s, t) \|^2, \\ \text{with}& \quad \epsilon^*(\va_t, \vs, t) \triangleq \epsilon_{\phi}(\va_t, \vs, t) -\sum_{k=1}^m\lambda_k\nabla_{\va_t}c_{k,t}(\vs, \va_t) / \sigma_t \quad \text{(implicit safe teacher)}. \end{aligned} $$
  • Dense Supervision at every denoising step, achieving ~5x higher optimization efficiency than PPO-style diffusion RL.
  • Solver Flexibility enable efficent sampling with any solvers during rollout.
Analytical teacher distillation versus PPO-style diffusion RL
Results

Safer Policies, Better Task Success

31.0%
Average safety violation reduction

Across simulated and real-world robotic manipulation.

30.7%
Average task success improvement

PACT resolves the usual safety-performance tradeoff.

0
External annotations, demonstrations or interventions

Alignment runs near self-evolution.

Real-World Case: GPU Assembly

GPU assembly requiring millimeter-level precision, PACT enables emergent fine-grained correction behaviors that adjust trajectories into feasible regions, ensuring accurate and safe insertions.

Detailed Results
PACT success rate results
PACT safe rate results

Training efficiency comparison with on-policy baselines. Success Rate (left) and Safe Rate (right), are averaged over four tasks across training iterations. Our method demonstrates the most training efficiency and stability.

RoboTwin detailed result table
Task success and safety across bimanual manipulation benchmarks in RoboTwin. We report Success Rate (succ.) and Safe Rate (safe) for each base policy before and after post-training with PACT.
Real-world qualitative results
Qualitative results of real world evaluation. base policy (top) v.s. policy aligned by PACT (bottom) across four manipulation tasks. PACT reduces unsafe contacts and improves task completion by correcting key failure modes: avoiding poking to securely grasp the egg (Transfer Egg); aligning the gripper with the nail head, preventing lateral or tilted insertion (Nail Insertion); eliminating bottle poking and cup-rim misalignment (Pour Water); avoiding poking heat-sink and corrects installation misalignment (GPU Assembly).
Citation

BibTeX

@inproceedings{wu2026pact,
  title={PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation},
  author={Wu, Lingxuan and Zhu, Zijian and Wang, Lizhong and Ying, Chengyang and Chen, Huayu and Yang, Xiao and Liu, Fangming and Zhu, Jun},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026},
  url={https://github.com/thu-ml/PACT}
}