Gene set enrichment analysis in R

No video

Gene set enrichment analysis in R

Рет қаралды 28,345

Күн бұрын

In this workshop, we introduce gene set analysis relevant to RNA-sequencing data. In it, we cover:
- Broad Molecular Signatures Database (MSigDB) gene sets
- hypergeometric enrichment with clusterProfiler
- gene set enrichment analysis (GSEA) with fgsea
Materials
github.com/hawn-lab/workshops...

Пікірлер: 18

@azure-hawk 2 жыл бұрын

Great video! I learned about msigdbr and the dplyr::separate function. I just want to mention a few things. 1. The GSEA ranking metric doesn’t have to be fold-change. I use the gene wise average moderated t-statistic from limma or the signed -log10-transformed p-value. There are a ton of ranking metrics to choose from. Both of these are very similar, and we can compare their density plots to get an idea of how they would alter the GSEA results. 2. Over-representation analysis is not great as a follow-up to differential analysis because of the arbitrary significance threshold that you mentioned and the fact that there may be duplicates at the gene level. Also, we lose information about the direction of change, since ORA only tells us which sets are more present in the significant group than what we expect by chance. However, it is great when genes uniquely map to discrete clusters, so it is good as a follow-up to WGCNA or K-means clustering. 3. The figures you use to introduce GSEA show the phenotype permutation approach, but most R implementations (including fgsea) use the gene permutation approach, which is much faster but has a slightly different interpretation. 4. For ORA, it may be useful to plot the ratio of the number of significant genes in the gene sets to the total number of significant genes along the x-axis and change the bars to points scaled according to the -log10(adjusted p-value). Gene sets that include all significant genes (ratio of 1) may be interesting to look at, even if their adjusted p-values are hovering near 0.05. 5. The fora function in fgsea can be used for ORA as well. Personally, I find it easier than dealing with the bulkier clusterProfiler results objects.

@CorruptedSon Жыл бұрын

Thank you! Spend a while trying to figure out how to do pathway analysis in R and most guides always expected you already have some sort of GO or Kegg library where you can refer to and don't go into specifics how these libraries work and what to do when they do not work. This step-by-step guide was enough to get me from DEG lists into proper pathway analysis - and I even understood why and what I am doing in each step! I am working with rat sequencing data and some columns I had were very different from the example data you had here but after checking specific points a few times I managed to filter and re-format all the necessary information from my data.

@Stop-and-listen Жыл бұрын

I really enjoyed your presentation. I learned quite a bit. Thank you!

@tonkatsuburger3531 2 жыл бұрын

Thank you so much this was so helpful!

@jajaja20703 2 жыл бұрын

Very clear explanation, thanks for this amazing content! Would you have any additional bio-inf analysis tutorials?

@kdillmcfarland 2 жыл бұрын

Thanks! I have other R workshop videos kzbin.info/aero/PL_Oo8UFoIb007lGeg78awOu44Ido35zsY with materials for those and other workshops that don't have videos at github.com/BIGslu/workshops and github.com/hawn-lab/workshops_UW_Seattle

@hamidnikbakht1295 2 жыл бұрын

Thank you for the very clear explanations. One question is that for the purpose of GSEA (either simple or gsea), what type of normalization of the counts should one use? Or does it even matter? If so, how would it be different between the two methods? Thank you!

@kdillmcfarland 2 жыл бұрын

For RNAseq GSEA, we use fold changes calculated from TMM normalized log2 counts per million (see limma package tutorial) or estimates output by whatever linear model we ran. In essence, whatever data normalization needs to be done for stats should also be done before calculating fold changes for GSEA. For simple enrichment, it's similar. Treat the data however is best for statistical tests. Then find significant genes from those tests and input those gene lists into enrichment

@yt.abhibhav Жыл бұрын

Thankyou! I was just wondering which paper to cite when performing the hypergeometric "simple" enrichment?

@joeyoviedo5202 10 ай бұрын

Hi, great video and clarification of types of enrichment analyses. I have a question, what is the best way to create a ranked list of genes for 3 treatment and 3 control samples in one data frame using just normalized read counts. I want to rank the gene list from all genes not DEGs then do enrichment analysis. Thank you!

@vaibhavsunkaria7291 2 жыл бұрын

Hi how can we do the gsea analysis for dna methylation genes i have beta values of samples and logFC cutoff of the same, thank you

@jessehines4044 Жыл бұрын

I'm new to this and I'm wondering why do you need to see how much your significant genes overlap with a larger or other gene set? Is that to elucidate what transcriptional regulation network controls the significant genes and or to discover other similar genes relative to the genes of interest?

@naveedkhan-fi6ux Жыл бұрын

can we do the gene set enrichment analysis for rice using the same code and databases

@pragatigupta8999 2 жыл бұрын

HELLO, How can we add gene count and pvalue in same histogram by using clusterprofiler package of R?

@kdillmcfarland 2 жыл бұрын

Do you mean the FDR by count histograms around 35min? You can add the total # of genes (count) to the top of each bar in a histogram with stat_bin(geom="text", aes(label=..count..)) And to plot Pvalue, I would make a new plot with x=Pval instead of x=FDR