OpenAI has developed a new method called Rule-Based Rewards (RBRs) to improve the safety behaviour of AI models without relying on extensive human data collection.

The method is designed to align AI models with desired safety behaviours, more efficiently than traditional reinforcement learning from human feedback (RLHF).

The RBR approach uses clear, simple rules to evaluate model outputs against safety standards. This new method integrates with standard RLHF pipelines to balance helpfulness and safety, categorising desired model behaviour into hard refusals, soft refusals, and compliance. One of the key advantages of RBRs is their flexibility, allowing for quick updates to safety guidelines without the need for extensive retraining.

OpenAI reports that RBR-trained models have demonstrated safety performance comparable to those trained with human feedback, while reducing instances of incorrectly refusing safe requests. The company has already implemented this method in models since the GPT-4 launch, including the recent GPT-4o mini.

While emphasising the potential of RBRs, OpenAI acknowledges that this approach can be combined with human feedback for tasks requiring more nuanced judgement. RBRs could accelerate the development of AI models that better align with human values and safety requirements. However ethical consequences must be considered, such as reduced human oversight and the risk of amplifying biases if not carefully designed.



Share this post
The link has been copied!