SOCIAL MEDIA TITLE TAG
Overview
We propose RLRC, a three-stage method: (1) we apply structured pruning to the VLA model, specifically targeting the LLM component, to remove redundant structures in a hardware-friendly manner; (2) we employ a performance recovery stage that combines SFT with RL to restore the model's effectiveness on downstream tasks; (3) we introduce optional quantization to further reduce the memory footprint, enabling efficient deployment on resource-constrained robotic platforms.
Stage 1: Structured Pruning
We adopt LLM-Pruner as an off-the-shelf structured pruning framework. Specifically, we apply pruning at the block-wise level and utilize the Taylor importance criterion. To preserve the representational capacity and stability of the model, we retain both the first and last decoder layers, applying pruning only to the intermediate layers. Guided by empirical observations from our earlier experiments, we adopt an aggressive 90\% overall pruning ratio, aiming to substantially reduce the model size.
Stage 2: Performance Recovery
Compared to unstructured pruning, structured pruning imposes greater performance degradation, especially under high pruning ratios like 90%. To mitigate this, we first apply supervised fine-tuning (SFT) on task-specific data, allowing the pruned VLA model to adapt to its reduced architecture. However, SFT alone cannot fully recover performance, particularly after aggressive 4-bit quantization.
To address this, we incorporate reinforcement learning using Proximal Policy Optimization (PPO), which dynamically adjusts model parameters to recover fine-grained decision-making capabilities. Following Liu et al., we design a shared Transformer backbone for the actor-critic setup, where the critic estimates value from the hidden state of the first action token via a lightweight MLP. Sparse rewards and strong SFT initialization enable efficient and scalable RL fine-tuning (RLFT), enhancing performance in both seen and unseen tasks.
Stage 3: Quantization
After applying SFT and RL, the pruned VLA achieves task execution performance comparable to, or even surpassing, that of the original VLA. Building upon this strong foundation, we further explore 4-bit quantization to achieve extreme memory compression.