Рет қаралды 91
Invited talk by Adam Gleave on September 16, 2024 at UCL DARK.
Title:
Adversarial Robustness of Superhuman AI Systems
Abstract:
A combination of algorithmic advances and increased model, dataset size and training compute have produced increasingly capable models in the average-case, even achieving superhuman performance in a wide variety of tasks. However, safety-critical tasks demand not just good average-case performance, but worst-case guarantees. We will start by sharing vulnerabilities we discovered in superhuman Go AIs, and our attempts to defend them. We will then turn our attention to jailbreaks in LLMs, comparing scaling trends in capabilities and robustness. Our results suggest that model scale alone does little to improve robustness - but that defences such as adversarial training are more sample efficient in larger models.