Deliberative Alignment Safer Language Models through Reasoning

Рет қаралды 34

Күн бұрын

Ref: assets.ctfasse...
Also read Anthropic: assets.anthrop...
The paper introduces a novel alignment paradigm, Deliberative Alignment, designed to enhance the safety and trustworthiness of large language models (LLMs), particularly in safety-critical applications.
Core Problem: Current LLM safety training methods, primarily relying on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), face limitations. Models lack the capacity for deliberation and indirectly infer safety standards from labeled examples, leading to vulnerabilities like jailbreaks, over-refusals, and poor generalization to novel scenarios. As the authors state, "This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks."
Proposed Solution: Deliberative Alignment
This new paradigm directly teaches models safety specifications and encourages explicit reasoning over these specifications before generating responses. This approach addresses the core limitations by:
Enabling Deliberation: Leveraging Chain-of-Thought (CoT) reasoning, models analyze user prompts, identify relevant safety guidelines, and formulate safer responses. See Figure 1 in the paper, where the model decodes a user’s ROT13 encoded request and identifies its illicit nature by referencing the safety policy.
Explicit Safety Specifications: Instead of implicit learning from labeled data, the model directly learns the safety specifications, promoting better understanding and adherence.
Methodology:
Deliberative Alignment involves two core stages:
Stage 1: Supervised Fine-Tuning (SFT): A dataset of (prompt, CoT, output) examples is created using context distillation. A "spec-agnostic" reasoning model is prompted with safety specifications and generates CoTs that reference these policies. These CoTs are then used to fine-tune the base model, teaching it to reason through safety considerations. As explained in the paper, this stage provides "the model with a strong prior for reasoning through safety considerations."
Stage 2: Reinforcement Learning (RL): A high-compute RL stage refines the model's reasoning abilities. A "judge" LLM, provided with the safety specifications, provides reward signals. This process enhances the model’s ability to generate effective CoTs without directly optimizing the CoT itself, thereby mitigating the risk of deceptive CoTs.
Key Results:
Improved Safety Performance: The o1 models trained with Deliberative Alignment demonstrably outperform GPT-4o and other state-of-the-art LLMs on various safety benchmarks, including jailbreak resistance and adherence to content policies. This is visualized in Figure 2, demonstrating a Pareto improvement in reducing both malicious prompt compliance and benign prompt over-refusal.
Reduced Over-Refusals: Deliberative Alignment decreases over-refusal rates on benign prompts, enhancing the model’s helpfulness while maintaining safety. The paper shows significant improvements in handling “transformation exceptions,” where translating or analyzing potentially harmful content is permitted.
Enhanced Generalization: The approach exhibits strong out-of-distribution generalization, successfully handling unfamiliar safety scenarios and adversarial attacks like encoded prompts and multilingual jailbreaks. Table 3 illustrates this generalization capability.
Impact of Inference-Time Compute: Allowing the model more compute for CoT reasoning leads to improved performance on challenging evaluations, particularly jailbreak resistance and complex safe completion scenarios. This is illustrated in Figure 13.
Significance and Implications:
Deliberative Alignment offers a scalable and potentially more effective approach to LLM safety training. By explicitly teaching models safety specifications and encouraging deliberative reasoning, the approach addresses key vulnerabilities of current methods. The reliance on synthetic data generation reduces dependence on large-scale human labeling. The demonstrated improvements in safety, robustness, and generalization highlight the potential of Deliberative Alignment for deploying safer and more trustworthy LLMs in real-world applications.
Future Directions:
While promising, the authors acknowledge the need for ongoing research, particularly in monitoring CoTs for potential deception. The paper concludes with a discussion of the broader challenges in aligning increasingly sophisticated AI systems with human values, emphasizing the importance of continued research in this area.
Created with NotebookLM