Pseudo-bulk analysis for single-cell RNA-Seq data

Pseudo-bulk analysis for single-cell RNA-Seq data | Detailed workflow tutorial

Рет қаралды 26,888

Bioinformagician

Күн бұрын

Пікірлер: 76

@surfer101ist Жыл бұрын

Super helpful as a biologist with some CS training. Thank you!!!

@jimmylao349 Жыл бұрын

Very good explanation, I got the final script to try. Thanks

@prasadchaskar8542 2 жыл бұрын

Thanks a lot for the tutorial. Could you please add a tutorial on trajectory analysis?

@Bioinformagician 2 жыл бұрын

Working on that. Please stay tuned! :)

@prasadchaskar8542 2 жыл бұрын

Thanks a lot.

@learningtime1367 2 жыл бұрын

Thanks so much! Can you please do a video on GO analysis/KEGG for bulk rna-seq analysis? Thanks again

@Bioinformagician 2 жыл бұрын

Thanks for the suggestion. I have plans to make a video covering this topic. Please stay tuned :)

@pegahhejazi8399 Жыл бұрын

Hello, thank you for the super helpful tutorial. I have a question regarding my own dataset. I have 3 groups (each has 3 rep), young, old+treatment, and old w/ treatment, does this tutorial apply to compare 3 groups? if not do you have other tutorials for that kind of dataset?

@raghavsharma4347 Жыл бұрын

Why do you have a young dataset, is it meant to be a control? You will need to set your model matrix as ~ age + treatment, and your contrasts will need to compare the treatment to the no treatment.

@aravindsundar4968 Жыл бұрын

Great tutorial! Thanks for sharing.

@bondjams8084 2 жыл бұрын

Thank you so much! Your videos are so good!

@jakobhansen5477 11 ай бұрын

Thankyou for a great video! what if I have very different cellcounts in clusters I want to compare? I would expect very different expression just due to different cell counts. Will a normalization step in deseq2 cancel out this difference?

@davidepasini3807 2 жыл бұрын

Hi, thanks for the video and the nice explanation, this video happens at the right time, in fact I had thought to try this kind of analysis these days, I watched and tried your tutorial and I wondered how much can weigh the amount of cells per sample, for example in your case you have (looking at B cells) 864 with ind 1015 and 81 with ind 1039 this affects the analysis?

@Bioinformagician 2 жыл бұрын

If I am understanding you correctly, you mean to ask does the amount of cells per sample affect the analysis? I would think not, because we are aggregating instead of averaging the counts across all cells to the sample level. So the number of cells should not affect the count values.

@anguscampbell3020 Жыл бұрын

@@Bioinformagician There are a number of methods which argue that the drop out in scRNA-seq data needs to be accounted for. It would be great if you could do a tutorial on MAST which is supposed to be able to account for this and differentiate between biological and technical variability in cell specific UMI.

@saraalidadiani5881 6 ай бұрын

Thank you for the nice video. Just a question, how to account for two covariates in differential gene expression of single cell RNA seq data like sex and Age? thanks!

@subhasen2611 2 жыл бұрын

Thanks for the nice tutorials. Will you be adding any tutorial for trajectory analysis/ Cell Fate Decisions?

@Bioinformagician 2 жыл бұрын

Yes, I will be making a video covering these topics. Thanks for the suggestion! :)

@tushardhyani3931 2 жыл бұрын

Thank you for this video !!

@mayconmarcao4554 2 жыл бұрын

Graceful tutorial! I wonder which would be better to modeling a phenotype prediction (as input): i) pseudobulk or ii) single cell expression levels? Thanks for your existence =].

@Bioinformagician 2 жыл бұрын

What is the outcome that you are hoping to predict? I do not have experience with statistical modeling, I am afraid I might not have useful inputs.

@mayconmarcao4554 2 жыл бұрын

@@Bioinformagician I think I misunderstood the pseudobulk concept. Pseudobulk turns a single cell matrix into a patient-based matrix (as bulk RNAseq). What I thought was pseudobulk: I thought that with pseudbulk I'd be able to concatenate similar cells within a cell cluster to increase gene expression signals. But in this way pseudobulk would not represent patients but subclusters. Do you know if I can adapt pseudobulk strategy to aggregate subclusters?

@熊飞-b5k 2 жыл бұрын

Hi, thanks for the video，this is very helpful，Will you be adding any tutorial for monocle3? thank you again for these wonderful videos.

@Bioinformagician 2 жыл бұрын

Yes, I definitely have plans on making videos using monocle3.

@熊飞-b5k 2 жыл бұрын

@@Bioinformagician Hi there，In the my study I face to another problem: Is it possible to compare two conditions without repetition within a certain cell type？Which analysis method could be used, or what package could be used？Hope for your reply.

@Bioinformagician 2 жыл бұрын

@@熊飞-b5k Can you explain what do you mean by "compare two conditions without repetition within a certain cell type?" You mean you want to restrict comparison between two conditions to only certain clusters?

@zahraabdi1613 2 жыл бұрын

@@熊飞-b5k I have same problem. If you have found the solution, would you mind expalining it to me, please?

@wi1lhunting 3 ай бұрын

Very good videos, but why my 'cts' is not a table, because seurat v5? Can u tell me the answer? THX U!!!

@ncedilemankahla9758 4 ай бұрын

excellent video

@singhh5050 2 жыл бұрын

Hi! Do you think that pseudobulk analysis or GSEA is better for downstream analysis of scRNA-seq data? Especially when considering that there may be two different conditions (experimental and control). What are the advantages and disadvantages for using each method?

@Bioinformagician 2 жыл бұрын

Pseudobulking and GSEA are completely different methods serving different purposes. Each of the downstream analysis would make sense, depending on what the goal of your analysis is. Typically, pseudobulking is performed to find genes differentially expressed followed by which we use enrichment methods to find what pathways/GO terms are enriched.

@singhh5050 2 жыл бұрын

@@Bioinformagician Okay, that makes sense!! Thanks so much :)

@rosaicelalunaramirez1284 2 жыл бұрын

Thank you for the great tutorials, they've helped a lot on my research. I am currently working with my own single-cell data that I obtained from 6 samples (3 controls and 3 experimental). I have tried your tutorial but I get stuck on the part where you include the ind, the individual identification. Cell ranger only gives me the cell sequence followed by a -1 so I tried that and adding the condition. It looked like this CONTROL_ACCAACAGTGCATTAC-1 but when I use the aggregate expression function it gives me 12,972 columns as if it was taking each of the cells as individual sample. How can I perform your analysis without an identification number? or how can I assign it? Thank you!!

@Bioinformagician 2 жыл бұрын

The goal is to aggregate counts at sample level. In my case, each sample belong to an individual hence counts are aggregated to ind level. In your case, you might not need ind information. You could simply add a 'sample' column in your metadata, merge all samples and aggregate counts to the sample.

@urmom.com629 Жыл бұрын

@@Bioinformagician how do you "merge all samples"?

@wanisajad785 9 ай бұрын

@Bioinformagician: Are you suggesting to use raw counts (slot =count) for un-integrated data and normalized counts (slot=data) for integrated seurat object?

@koushikponnanna831 25 күн бұрын

Even with integrated data, RNA assay; slot=counts can be used

@blackmatti86 2 жыл бұрын

Your videos have been truly instrumental for me to grasp the concept of bioinformatic data analysis, especially for single cell RNA-seq. As far as I understand, scRNA-seq (or scATAC-seq) can be divided into droplet-based (e.g. 10X) and plate-based approaches, e.g. SMART-seq2. There seem to be a fair amount of help guides and instructions for the former method but not so much for the latter, I have noticed. Is there a resource that you know, that can guide a novice through a single cell (or single nucleus) RNA-seq performed using a plate approach (e.g. single cells FACS sorted into 384 WPs)? Thank you! xx

@Bioinformagician 2 жыл бұрын

To get an overall idea of the pipeline, check this out: www2.stat.duke.edu/~sayan/Sta613/2018/singlecellrnaseq-170131050320.pdf This paper performs a comparison analyses between 10X and Smart-seq2: www.sciencedirect.com/science/article/pii/S1672022921000486#s0055 Seurat also provided a vignette to integrate multiple datasets across different technologies (which includes smart-seq2): satijalab.org/seurat/archive/v3.1/integration.html This can give you an idea of how these datasets are processed before integration. Hope this helps!

@blackmatti86 2 жыл бұрын

@@Bioinformagician Thank you ❤️

@xiaosajackxu4242 Жыл бұрын

If I have 4 conditions, how to modify the codes to find DEGs that is enriched/depleted in at least one condition?

@zahraabdi1613 2 жыл бұрын

It was great! Thanks so much❤What should I do if my Seurat object doesn't have 'ind' column? I mean each cell just has the information about its cluster and the condition but not the individual information.

@Bioinformagician 2 жыл бұрын

Can you tell me where did you download your data from?

@baymin4827 Жыл бұрын

Your videos have been very helpful to me! What should I do if my Seurat object doesn't have 'ind' column? I am analyzing my own dataset. Thanks in advance

@maytelopez-cascales6113 Жыл бұрын

Very nice tutorial, I have a question, how could I do a differential expression analysis making the contrast between counts coming from different experiments, I have already done the pseudobulk with the single cell experiments, and I want to compare them with the counts from my RNAseq. Could I make a matrix with the data coming from two different techniques? will you make a tutorial about that, thanks.

@raghavsharma4347 Жыл бұрын

You can add your counts from your RNA-seq as another sample then adjust your contrasts so that it is your RNA-seq data minus your single cell datasets.

@张凯-z4w 2 жыл бұрын

good video! thank you sooooooo much!!!!

@abassohilebo2213 2 жыл бұрын

Thank you for the video Can you organize workshop?

@Bioinformagician 2 жыл бұрын

I haven't given a thought on organizing one yet. I shall think about it.

@abassohilebo2213 2 жыл бұрын

@@Bioinformagician please do People tends to love workshop more, and it will double if not triple your subscribers

@bigteeth5644 2 жыл бұрын

Hey there! First of all, I'd love to express my thanks to you! Your videos are helpful for our analysis. Although I ran into some problems trying to follow your tutorial. Our dataset is the aggregated snRNAseq dataset from six samples. We performed doublet removal, SoupX, scTransform normalization and integration. Some of the assay 'RNA' values are not integer. When I was searching for a solution, I read from the DESeq2 vignette that we should use un-normalized data. Do you have any suggestions on this issue? Thank you!

@Bioinformagician 2 жыл бұрын

Which slot in 'RNA' assay are you particularly referring to i.e. counts, data or scale slot? As for the demonstration here, we have used 'counts' slot which stores un-normalized raw counts to aggregate across samples.

@khr1138 2 жыл бұрын

because of SoupX, it makes raw counts rational number. Use round() function! in DESeqDataSetFromMatrix

@Iman_1987 2 жыл бұрын

could you please demonstrate isoform analysis by nanopore?? thnx

@thwoals456 2 жыл бұрын

Hello, really thank you so much for your video!!!!! I have one question. I have followed your single-cell tutorial video using my single cell data. However, there is no 'ind' column in my seurat object. Could you tell how to make that column? Additionally, I did scRNA seq for one control sample and for two treatment samples (total 3). Then, is it possible to make an 'ind' column in the control sample? And, the ratio of control versus sample (1:2) can affect the downstream analysis?? Sorry for my many questions..

@Bioinformagician 2 жыл бұрын

The 'ind' column was already present in the dataset, I did not create it. Did you download the data the same way I did it in the tutorial? Are the two treatment samples replicates or separate samples?

@thwoals456 2 жыл бұрын

@@Bioinformagician Instead of using the data in your video, I used my scRNA-seq data for pseudo-bulk analysis. So I asked how to make the column similar to the 'ind' column. And "the former" is my reply to your second question. I have two replicates of the treatment sample.

@Bioinformagician 2 жыл бұрын

@@thwoals456 Oh I get it now. So basically "ind" column is nothing but information about samples in my dataset (ind stood for individuals). If your dataset have sample information, you could use that column to aggregate your counts to sample level.

@mischmuuu 2 жыл бұрын

Thank you for this great tutorial! Is it possible to do a pseudo-bulk DE analysis with only one single-cell sample per condition? How would the statistics work?

@akundiraghukiranvydhyanath9939 2 жыл бұрын

I'm afraid that won't be possible. Deseqw requires atleast 2 biological sample replicates. The other alternative would be edgeR but you have to give an dispersion value

@Bioinformagician 2 жыл бұрын

DESeq2 is not designed to work without replicates.

@albanaisai3429 2 жыл бұрын

Hi there great video, do you know how to ise Kallisto?

@SerorONG 2 жыл бұрын

Hey there, great tutorial! May I just ask, how did you get so proficient with RegEx (regular expression). I feel that its one of the few core skills that would help immensely and is highly transferrable, especially during the initial stages of data-processing. Jus wanna know if you could recommend any resources to learn RegEx?

@Bioinformagician 2 жыл бұрын

I first learnt regex when I learnt Perl. The more I kept using regex, the more it started to make sense. I use regexr (regexr.com/) often to practice and build my regex. Here are a few resources that could help you practice it more - 1. regexone.com/ 2. regexlearn.com/ 3. www.hackerrank.com/domains/regex Hope this helps!

@bumpingbell 2 жыл бұрын

Hi, I am analyzing differentially expressed genes in a snRNA-seq dataset (GSE159812), for subsequent pathway analysis. Using FindMarkers, I get extremely small p-values for differentially expressed genes. However when I aggregate the counts by cell type & sample and perform pseudo-bulk analysis, less than 0.1% genes are significant (p

@Bioinformagician 2 жыл бұрын

FindMarkers tend to inflate p-values as each cell is treated as a sample (as cells within a sample are not truly independent of each other) unlike pseudo-bulk where counts are aggregated at the sample levels. Both methods will not give you the same differentially expressed genes as single cell methods tend to identify variation between cells and pseudo-bulking will identify variation among samples (between populations). Also single-cell methods tend to identify highly expressed genes as differentially expressed and exhibit low sensitivity for genes having low expression. Did you aggregate counts by samples or by both - samples & cell types?

@bumpingbell 2 жыл бұрын

@@Bioinformagician I meant that I aggregated counts across the cells to sample level, and for each cell type I made comparisons between 8 case & 8 control samples. I think pseudo-bulk is closer to my expectations, but as mentioned, only few genes have significant adj p-values from this method (which is surprising to me). This makes comparing single genes barely possible. If we use pathway analysis, where we can just input the log2FC of each gene from pseudo-bulk, we may not need to care about the statistical significance of each gene, but we still need to filter with adj p-values for the input to be valid. Am I right in this sense? We’re using Ingenuity Pathway Analysis. (Sorry if this is a bit off-topic in any way)

@smachead 2 жыл бұрын

Hi there! I was wondering why you are using the normalised and scaled data to generate the aggregate counts - should we not use the raw data?

@Bioinformagician 2 жыл бұрын

I am using "counts" slots that stores raw counts to generate aggregate counts. cts

@NicholasJohnson-m9l Жыл бұрын

Aren't we not supposed to normalize first? DESeq requires raw read counts. Or is the counts slot raw?

@raghavsharma4347 Жыл бұрын

When she aggregates counts, the function pulls data from the raw counts. Normalized counts are only used for the Seurat pipeline, but not used for differential expression analysis.