Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure (CHEAP)

  Рет қаралды 226

ML for protein engineering seminar series

ML for protein engineering seminar series

Күн бұрын

Tuesday, October 1st, 4-5pm EST | Amy Lu, PhD student (UC Berkeley)
Existing protein machine learning representations typically model either the sequence or structure distribution, with the other modality implicit. The latent space of sequence-to-structure prediction models such as ESMFold represents the joint distribution of sequence and structure; however, we find these embeddings to exhibit massive activations, whereby some channels have values 3000x higher than others, regardless of the input. Further, on continuous compression schemes, ESMFold embeddings can be reduced by a factor of 128x along the channel and 8x along the length, while retaining structure information at 2A scale accuracy, and performing competitively on protein function and localization benchmarks. On discrete compression schemes, we construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of all-atom structure that can be obtained from sequence alone. We term this series of embeddings as CHEAP (Compressed Hourglass Embedding Adaptations of Proteins) embeddings, obtained via the HPCT (Hourglass Protein Compression Transformer) architecture. CHEAP is a compact representation of both protein structure and sequence, sheds light on information content asymmetries between sequence and structure, democratizes representations captured by large models.
Preprint: www.biorxiv.or...

Пікірлер: 2
@patrickjiang402
@patrickjiang402 6 күн бұрын
Wonderful talk! Thank you so much for sharing again!
@hannespeter1484
@hannespeter1484 5 күн бұрын
Would you think that can be applied for generative protein structure decoders aswell?
Rapid protein evolution by few-shot learning with a protein language model
58:18
ML for protein engineering seminar series
Рет қаралды 375
Sequence-Augmented SE (3)-Flow Matching For Conditional Protein Backbone Generation
48:47
ML for protein engineering seminar series
Рет қаралды 339
Миллионер | 3 - серия
36:09
Million Show
Рет қаралды 2,2 МЛН
Чистка воды совком от денег
00:32
FD Vasya
Рет қаралды 4,8 МЛН
How To Choose Mac N Cheese Date Night.. 🧀
00:58
Jojo Sim
Рет қаралды 111 МЛН
How AI Cracked the Protein Folding Code and Won a Nobel Prize
22:20
Quanta Magazine
Рет қаралды 271 М.
The Core Equation Of Neuroscience
23:15
Artem Kirsanov
Рет қаралды 142 М.
Simulating 500 million years of evolution with a language model
1:05:08
ML for protein engineering seminar series
Рет қаралды 2,4 М.
Dr Gabor Mate answers question about October 7th during conference
12:53
Middle East Eye
Рет қаралды 573 М.
Norway Sovereign Wealth Fund's Investment Philosophy
12:47
Bloomberg Television
Рет қаралды 89 М.
Enzyme function prediction using contrastive learning (CLEAN)
54:16
ML for protein engineering seminar series
Рет қаралды 670
Have we lost control of METHANE gas?
13:41
Just Have a Think
Рет қаралды 133 М.
How I Inherited a 50 Room Cotswold MANOR! featured in Poldark & Rivals
13:58
How AI 'Understands' Images (CLIP) - Computerphile
18:05
Computerphile
Рет қаралды 216 М.