EPFL AI Center - Adversarial attacks as a baby version of A(G)I alignment

EPFL AI Center - Adversarial attacks as a baby version of A(G)I alignment - Stanislav Fort

Рет қаралды 53

Күн бұрын

This talk is part of the AI Fundamentals seminar series organized by the EPFL AI Center.
Title
Adversarial attacks as a baby version of A(G)I alignment
Abstract
Adversarial attacks pose a significant challenge to the robustness, reliability and alignment of deep neural networks from simple computer vision to hundred-billion-parameter language models. Despite their ubiquitous nature, our theoretical understanding of their character and ultimate causes, as well as our ability to successfully defend against them, are noticeably lacking. This talk examines the robustness of modern deep learning methods and the surprising scaling of attacks on them, and showcases several practical examples of transferable attacks on the largest closed-source vision-language models out there. Building on biological insights and new empirical evidence, I will introduce our solution proposed in [1], in which we make a step towards the alignment of the implicit human and the explicit machine vision representations, closely connecting interpretability and robustness. I will conclude with a direct analogy between the problem of adversarial examples and the much larger task of general AI alignment.
[1] Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness. Stanislav Fort, Balaji Lakshminarayanan
Bio
Stanislav Fort is a senior research scientist at Google DeepMind, specializing in robustness, interpretability and safety. He received his PhD in 2022 from Stanford University with Prof. Surya Ganguli. In the past, Stanislav spent time at Google Brain as an AI Resident, worked on the Claude model at Anthropic, and led the language model team at Stability AI. He received his Bachelor's and Master's degrees in theoretical physics from the University of Cambridge.
Academic publications: scholar.google...
Personal website: stanislavfort....