Deliberative Alignment Safer Language Models through Reasoning

  Рет қаралды 34

Srikanth Bhakthan

Srikanth Bhakthan

Күн бұрын

Ref: assets.ctfasse...
Also read Anthropic: assets.anthrop...
The paper introduces a novel alignment paradigm, Deliberative Alignment, designed to enhance the safety and trustworthiness of large language models (LLMs), particularly in safety-critical applications.
Core Problem: Current LLM safety training methods, primarily relying on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), face limitations. Models lack the capacity for deliberation and indirectly infer safety standards from labeled examples, leading to vulnerabilities like jailbreaks, over-refusals, and poor generalization to novel scenarios. As the authors state, "This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks."
Proposed Solution: Deliberative Alignment
This new paradigm directly teaches models safety specifications and encourages explicit reasoning over these specifications before generating responses. This approach addresses the core limitations by:
Enabling Deliberation: Leveraging Chain-of-Thought (CoT) reasoning, models analyze user prompts, identify relevant safety guidelines, and formulate safer responses. See Figure 1 in the paper, where the model decodes a user’s ROT13 encoded request and identifies its illicit nature by referencing the safety policy.
Explicit Safety Specifications: Instead of implicit learning from labeled data, the model directly learns the safety specifications, promoting better understanding and adherence.
Methodology:
Deliberative Alignment involves two core stages:
Stage 1: Supervised Fine-Tuning (SFT): A dataset of (prompt, CoT, output) examples is created using context distillation. A "spec-agnostic" reasoning model is prompted with safety specifications and generates CoTs that reference these policies. These CoTs are then used to fine-tune the base model, teaching it to reason through safety considerations. As explained in the paper, this stage provides "the model with a strong prior for reasoning through safety considerations."
Stage 2: Reinforcement Learning (RL): A high-compute RL stage refines the model's reasoning abilities. A "judge" LLM, provided with the safety specifications, provides reward signals. This process enhances the model’s ability to generate effective CoTs without directly optimizing the CoT itself, thereby mitigating the risk of deceptive CoTs.
Key Results:
Improved Safety Performance: The o1 models trained with Deliberative Alignment demonstrably outperform GPT-4o and other state-of-the-art LLMs on various safety benchmarks, including jailbreak resistance and adherence to content policies. This is visualized in Figure 2, demonstrating a Pareto improvement in reducing both malicious prompt compliance and benign prompt over-refusal.
Reduced Over-Refusals: Deliberative Alignment decreases over-refusal rates on benign prompts, enhancing the model’s helpfulness while maintaining safety. The paper shows significant improvements in handling “transformation exceptions,” where translating or analyzing potentially harmful content is permitted.
Enhanced Generalization: The approach exhibits strong out-of-distribution generalization, successfully handling unfamiliar safety scenarios and adversarial attacks like encoded prompts and multilingual jailbreaks. Table 3 illustrates this generalization capability.
Impact of Inference-Time Compute: Allowing the model more compute for CoT reasoning leads to improved performance on challenging evaluations, particularly jailbreak resistance and complex safe completion scenarios. This is illustrated in Figure 13.
Significance and Implications:
Deliberative Alignment offers a scalable and potentially more effective approach to LLM safety training. By explicitly teaching models safety specifications and encouraging deliberative reasoning, the approach addresses key vulnerabilities of current methods. The reliance on synthetic data generation reduces dependence on large-scale human labeling. The demonstrated improvements in safety, robustness, and generalization highlight the potential of Deliberative Alignment for deploying safer and more trustworthy LLMs in real-world applications.
Future Directions:
While promising, the authors acknowledge the need for ongoing research, particularly in monitoring CoTs for potential deception. The paper concludes with a discussion of the broader challenges in aligning increasingly sophisticated AI systems with human values, emphasizing the importance of continued research in this area.
Created with NotebookLM

Пікірлер
Build something great with Open Source AI Stack
23:54
Srikanth Bhakthan
Рет қаралды 34
Neural and Non-Neural AI, Reasoning, Transformers, and LSTMs
1:39:39
Machine Learning Street Talk
Рет қаралды 82 М.
How Strong Is Tape?
00:24
Stokes Twins
Рет қаралды 96 МЛН
coco在求救? #小丑 #天使 #shorts
00:29
好人小丑
Рет қаралды 120 МЛН
Сестра обхитрила!
00:17
Victoria Portfolio
Рет қаралды 958 М.
To Brawl AND BEYOND!
00:51
Brawl Stars
Рет қаралды 17 МЛН
Do you want the GOOD news or the BAD news?
9:00
Just Have a Think
Рет қаралды 124 М.
Learning Dominant Dynamics for Continuum Robot Control (John Alora, PhD Defense)
1:02:36
Stanford Autonomous Systems Laboratory
Рет қаралды 686
Attention in transformers, visually explained | DL6
26:10
3Blue1Brown
Рет қаралды 1,9 МЛН
Noam Chomsky: On China, Artificial Intelligence, & The 2024 Presidential Election.
1:03:24
Through Conversations Podcast
Рет қаралды 1 МЛН
What Is (Almost) Everything Made Of?
1:25:49
History of the Universe
Рет қаралды 3 МЛН
Pattern Recognition vs True Intelligence - Francois Chollet
2:42:55
Machine Learning Street Talk
Рет қаралды 53 М.
Generative AI in a Nutshell - how to survive and thrive in the age of AI
17:57
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,1 МЛН
Alexandra Daisy Ginsberg - BETTER NATURE
1:38:52
Architecture of Territory | Prof. Milica Topalovic
Рет қаралды 274
Scouting Frontiers in AI for Biology: Dynamics, Diffusion, and Design, with Amelie Schreiber
1:45:55
Cognitive Revolution "How AI Changes Everything"
Рет қаралды 10 М.
How Strong Is Tape?
00:24
Stokes Twins
Рет қаралды 96 МЛН