UoA ML Seminar: William Stafford Noble - DL applications in MS proteomics and single-cell genomics

Рет қаралды 859

Machine Learning Group - University of Auckland

2 ай бұрын

William Stafford Noble - Deep learning applications in mass spectrometry proteomics and single-cell genomics
William Stafford Noble is a Professor in the Department of Genome Sciences and in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. He received the Ph.D. in computer science and cognitive science from University of California, San Diego in 1998. Dr. Noble’s research applies statistical and machine learning methods to the analysis of complex biological data sets. He is the author of more than 300 peer reviewed publications and has advised 34 postdoctoral fellows and 24 PhD students. William is the recipient of the International Society for Computational Biology Innovator award, is on the Clarivate Analytics list of “Highly cited researchers,” and is a Fellow and former member of the Board of Directors of the ISCB.
In this talk, I will describe several recent and ongoing projects that apply deep neural networks to the analysis of large compendia of biological data. The first two projects operate on protein tandem mass spectrometry data. We first show how a Siamese architecture can be trained in a supervised fashion to embed individual mass spectra into a 32-dimensional space, yielding a compact representation that enables large-scale, highly accurate clustering of the spectra and significantly enhancing our ability to assign observed spectra to their corresponding peptide sequences. The second project uses a model with a transformer architecture to perform de novo peptide sequencing, by translating directly from a mass spectrum (a sequence of peaks) to a peptide (a sequence of amino acids). The resulting model, trained from 30 million spectra, outperforms existing methods and enhances our ability to interpret various types of mass spectrometry data. Finally, the third project aims to jointly analyze two types of single-cell genomics data, one measuring gene expression (scRNA-seq) and a second measure local chromatin structure (scATAC-seq). The model is trained using a combination of co-assay data and traditional “single-assay” data, first learning an autoencoder for each data modality and then using just the co-assay data to train a translator between the embedded representations learned by the autoencoders. The resulting model is able to translate between modalities with improved accuracy relative to state-of-the-art translation techniques and also produces a matching of cells across modalities.