Integrate single-cell RNA-Seq datasets in R using Seurat (CCA)

Integrate single-cell RNA-Seq datasets in R using Seurat (CCA) | Detailed Seurat Workflow Tutorial

Рет қаралды 51,025

Bioinformagician

Күн бұрын

Пікірлер: 107

@shrabantimazumder3039 Жыл бұрын

Thank you so much for the excellent video. It is very helpful to understand, why and how to remove the batch effect.

@kitdordkhar4964 2 жыл бұрын

You are awesome! I can understand the group.by command better now. Thanks!

@Bioinformagician 2 жыл бұрын

That's awesome! I am glad this video was helpful! :)

@mostafamalmir3621 2 жыл бұрын

Your Tutorial are very useful for me!!! Thank you a lot.

@mrbarakgut 10 ай бұрын

You the best. The vid is really thoughtful and clear. Thanks!

@yz2652 2 жыл бұрын

Jesus， I am desperately struggling with integration. You saved my life. Your Video is very clear and understandable. Thank you so much.

@Bioinformagician 2 жыл бұрын

Thank you, I am happy to hear my video was helpful!

@qbaseathens 2 жыл бұрын

@@Bioinformagician you are truly amazing

@mahimabose 2 жыл бұрын

Hi, this was indeed very useful. You are a lifesaver. Could you also make tutorials on pseudotime analysis and RNA velocity analysis packages like Monocle 3, Velocyto etc in the future? Thanks

@Bioinformagician 2 жыл бұрын

Thank you, I am happy to hear my video was informative! I have plans to make videos on the topics you mentioned, hopefully, will be able to post them soon. Thanks for the suggestion! :)

@nayande2151 2 жыл бұрын

@@Bioinformagician @ did you make video on pseudo time analysis and RNA velocity determination from single cell analysis....

@theresahutchins2035 Жыл бұрын

Your videos are so amazing wowie!

@navyav.b8572 6 ай бұрын

anchors

@dannyk938 5 ай бұрын

Thank you for the walkthrough, your videos have helped me a lot so far as someone with no programming or data science background. I am trying to do a horizontal integration of a KO and a rescue sample - I made it through to the very end, but when I run the IntegrateData step I am getting the following error: "Error in .subscript.2ary(x, i, , drop = TRUE) : subscript out of bounds". I get that this error occurs when trying to access an index that doesn't exist, but I'm not sure what's causing that issue in my data. Any ideas on where to look would be much appreciated. Thanks!

@MsZhang666 Жыл бұрын

Thank you very much again!!! So great

@chiranjitdas3959 Жыл бұрын

Just a question, when you initially ran the basic Seurat pipeline for normalization, scaling, etc you merged the datasets from different patients and tissue types. But later while you run the integration steps, you split the data based on patients and again ran the normalization and variable feature steps. So is it necessary to run the second normalization and variable feature step before integration and if so why? (Since we had run those steps initially)

@明明-v1y Жыл бұрын

thanks! good course!

@mahamoussa5712 Жыл бұрын

Actually, you are the best bioinformatician! Do you use your laptop for doing data integration or do use a supercomputer? I can not run this on my laptop.

@Bioinformagician Жыл бұрын

I performed demo on my laptop. CCA can be very slow and computer intensive. Try rpca method, runs significantly faster.

@tahadinc1302 2 жыл бұрын

I love your tutorials. They helped me understand the fundamentals of single-cell RNA seq. You're a great teacher. Thank you!

@harishnarasimhan7367 2 жыл бұрын

These tutorials are great, looking forward to the next ones. Could the gzip function be used instead to unzip the files?

@Bioinformagician 2 жыл бұрын

These are two different compression methods and hence gunzip cannot be used to uncompress .zip files and vice-versa.

@kimayatekade5267 Ай бұрын

Hey, great video thanks! Can one use SelectIntegrationFeatures on log normalized data? I thought it is only for SCTransformed data. Please correct me if I am wrong :)

@priyaamadhukaran4745 Жыл бұрын

Hi, very helpful video. I am a beginner and would like you to do a detailed video on where the downloaded file is supposed to be saved and how do you open them in seurat etc. At 5.52 (time ) of the video you use a screen, what is it? how did you get it (data bash 80x24) which is used to convert the tar file to normal file without tar.I see you brining up the screen as you work on seurat.

@张凯-z4w 2 жыл бұрын

Your tutorials are great! I have a question. I wanna find the differences between primary tumor and metastatic tumor using scrna data, do I need to integrate these two datasets? Thank you!

@Bioinformagician 2 жыл бұрын

The goal of integration is to align cell types from one condition/tumor type with the same cell types in another condition/tumor type. This can aid in cell type identification and comparison across specific cell types across conditions/tumor types. If that is what you are hoping to do, then yes, you should integrate your data.

@mdnaveedkn 8 ай бұрын

Hi, please make a video on vertical integration scRNA-seq and ScATAC-seq from same cell❤

@kowshicroy1418 11 ай бұрын

Thank you so much

@riyarakshit1166 6 ай бұрын

hi...I am using R V4.3.3 and seurat V5 was trying to do ananlysis of scRNA-seq data but the commands for unzipping the gz.tar files are not working in my laptop. I found your way of explanation very easily understandable. Please help

@seakayaker20 Жыл бұрын

Really nice tutorial. Admitedly I was lost between 24:33-25:19 when you say ' clearly see'. I've played the section back many times and still don't follow. I'd love to see a more detailed explanation for this. Many thanks and keep up the great work!

@hathormaat8078 2 жыл бұрын

Thanks for the amazing tutorial. However I have one question: when running: seurat.integrated

@Bioinformagician 2 жыл бұрын

How large are the datasets you are trying to integrate and how much memory are you using? Also, are you using CCA method to integrate? If yes, try 'rpca', it is computationally less intensive. Also check out this thread: github.com/satijalab/seurat/issues/1355

@iheartmcyrus 2 жыл бұрын

you are a life saviour!! could u do tutorials on how to run integration via harmony and pseudotime analysis in the future please

@Bioinformagician 2 жыл бұрын

Those are definitely in the pipeline, will put out videos on these topics soon :)

@naVn1111 3 ай бұрын

I could not find the link to the video for QC, could you please put that in description. Thanks.

@xiaosajackxu4242 2 жыл бұрын

Great Job! I have a quick question: Let's say we integrate single-cell datasets "object_A" and "object_B" into "object_AB". In the integrated "object_AB", we have the 10 clusters with cluster labels as "AB-1, AB-2, AB-3.....AB-10". If I want to transfer these clusters labels to a UMAP projection in the original "object_A" based on corresponding cells' names (or barcode IDs), what kind of code can I use? Note that the cell names (or barcode IDs) of object_A did not change in "object_AB". Thanks!

@Bioinformagician 2 жыл бұрын

Create a separate data.frame with cell barcodes and corresponding cluster labels from integrate object_AB like this - cell_cluster_mapping

@chintanbhavsar5681 2 жыл бұрын

I'm trying to see if it is possible to use seurat for proteomics data. By using this seurat object, I plan to use cell - cell communication pipelines like NATMI, LIANA or CellCall for analysing my proteomics dataset. Any insight in this would be very helpful as I'm just getting started.

@Bioinformagician 2 жыл бұрын

Unfortunately, my experience with proteomics is very limited and I do not want to mislead you by giving suggestions that I am not confident about. Perhaps digging up some papers for proteomics data can be resourceful.

@juliabalewska6846 8 ай бұрын

you are a hero, explaining things in a very clear way. big thanks!

@joshuagrant4569 2 жыл бұрын

Really useful tutorial, thank you! By any chance do you know if it is possible to merge two h5 files to run this analysis on the merged matrix?

@Bioinformagician 2 жыл бұрын

You could read each h5 run into a Seurat object and then merge two Seurat objects.

@tahadinc1302 2 жыл бұрын

I see that the ram usage on your Rstudio is pretty low. How do you keep the ram usage that low although you have all those data structures in the environment?

@Bioinformagician 2 жыл бұрын

I think that's because I am not running memory intensive processes on all those data structures at once. I am sure my RAM usage must be going up when I am running memory intensive Seurat functions. It must be coming back down when I am wrangling or just visualizing my data.

@tahadinc1302 2 жыл бұрын

@@Bioinformagician Thank you for letting me know and I am looking forward to future episodes!

@jaskaransingh2813 7 ай бұрын

Hi, Great Tutorial....I have one fundamental question: If individual Seurat objects that are merged have already been through the Normalization, find variable, scaling, and Run PCA... Do we have to again run these parameters after merging (The ones that we run before they are integrated)?

@surinderpal9498 Жыл бұрын

Hello Ma'am, I am facing this error from last 4 days, can't resolve it, please help me to solve it, Thanks.. > merged_seurat_filtered

@germanovicente4616 Жыл бұрын

Hi! I have performed quality control individually for each data set in my analysis, but when I try and merge the seurat objects, I get an error saying: Error in `.rowNamesDF

@ahmedadelelbaz1694 Жыл бұрын

Is this different than RUNCCA ?

@purplepandaoverlord7780 Жыл бұрын

Hi. Non-computational, struggling lab person here. I am trying to do a snRNAseq analysis but I am using just one sample. How can I make a Seurat object from a Seurat list again so I can skip the Integration step (which makes a new Seurat object by default). I need to use the object rather than a list for subsequent steps and I can't find anywhere online an answer to this. Thank you!

@alpr1864 2 жыл бұрын

Hey! Thank you for this informative video! I have a question. The goal of these steps is to integrate/merge multiple datasets into one unified Seurat object. After performing these steps, I guess in order to proceed with the standard workflow of single-cell RNA sequencing, I need to normalize the "seurat.integrated" via the function of NormalizeData() in Seurat package, and then find the "Variable gene" via the function of the FindVariableFeatures() in Seurat package. I think that after scaling, dim reduction, clustering, and identifying the cluster name, I am ready to present a UMAP which represents the cells of those samples. Am I right or not?? Thank you in advance!

@Bioinformagician 2 жыл бұрын

Yes, the approach seems sensible, merge the datasets first, visualize and determine whether integration is really required. Also, make sure there is no unwanted biological variation like cell cycle effects. If you do find such unwanted variation, then you will have to regress it out. Check this article out which explains how to check for it and regress out the variation - github.com/hbctraining/scRNA-seq_online/blob/master/lessons/06_SC_SCT_normalization.md Once data is integrated, then the standard workflow steps you mentioned above make sense.

@alpr1864 2 жыл бұрын

@@Bioinformagician Thanks for your response. I followed the standard workflow. However,, the Rstudio does not like to normalize the data after the integrated Seurat object has been created. P.S. I think that we already had normalized our data during the creation of the integrated Seurat object. Thus, I guess that is the reason for the R's error; however, I am not sure!

@shubhamoyghosh6005 2 жыл бұрын

Hi It was very useful. Wondering whether scanpy has similar methods for integration.

@Bioinformagician 2 жыл бұрын

I am sure there must be...

@meetukaur0909 Жыл бұрын

I have made the Seurat objects just like you said, but on doing the merged_seurat process i am getting an error. It says the said seurat object is not found. IDK why

@maanasss 11 ай бұрын

Hey BioinforMAGICIAN.. the tutorials are truly amazing and extremely easy to understand for a novice like me. I am a dentist learning to conduct scRNAseq on dental tissues. And strictly following the steps performed here in the videos. I had a query regarding merging datasets. I have merged 3 datasets from 3 different patients; but when I view the metadata - the orig.ident does not show whether that row is from patient 1, 2 or 3... all of them show "SeuratProject". I am unable to detect which rows are from which sample; and so, I cannot have different colors in the UMAPs. Can you please let me know how can I address this, Thanks in advance.

@cats_like_felix Жыл бұрын

Hi, thanks so much for the videos. Can I ask please, I'm trying to merge datasets where one dataset is missing a prefix to the rownames thats was added when trying to seperate features between samples run at the same time from different species. Is there a way to add or remove a prefix from all the rownames or features from one dataset? Thanks again!

@kuldeepmakwana7242 Жыл бұрын

Hi! I have a little different question to ask. How can I create an Anndata object file from Seurat object to then run ran velocity estimation?

@sreejas1302 Жыл бұрын

Hi, after integrating the dataset by CCA analysis how we can extract the correlation coefficients of the integrated dataset?

@treponema6977 2 жыл бұрын

Thank you for making this tutorials they are very helpful. Can you provide is information about the computational resources that you used for this data set, thank in advance

@Bioinformagician 2 жыл бұрын

I have mentioned the software/tools/packages that have been used to perform this analysis in the video. Hardware wise, I have a MacBook Pro with Apple M1 pro chip and 16 gigs of RAM.

@treponema6977 2 жыл бұрын

@@Bioinformagician I have 16 gigs of ram too but I couldn't finish the tutorial cuz ram issues

@Bioinformagician 2 жыл бұрын

@@treponema6977 Are you using the same dataset?

@treponema6977 2 жыл бұрын

@@Bioinformagician yes exactly the same data set, running on Ubuntu 22.04 R version 4.2.1 R Studio 2022.07.1 Build 554, idk what is causing the high use of Ram finally I had to create a 32gigs swapfile to finish the tutorial

@Bioinformagician 2 жыл бұрын

@@treponema6977 Wow, that's strange! I cannot think of anything that could be causing you memory issue if you have the exact same config.

@abdullahugurlu2622 Жыл бұрын

did not we already used filtered data in the beginning? why did we do QC and filtering again?

@MrQiushenfeng 2 жыл бұрын

The first for loop, i am seeing "Error in url(description = uri) : URL scheme unsupported by this method"

@Bioinformagician 2 жыл бұрын

Can you send me the command you are trying to run? Also, you are sure the paths to matrix, feature and barcode files you are providing are correct?

@Bioinformagician 2 жыл бұрын

Apparently, another user encountered the same issue. The user could solve the issue - quoting the user (@Alp R): Solved! The problem was from the new version of R. For windows users, you can install R version 4.0.5 (2021-03-31). For more info: github.com/satijalab/seurat/issues/5687 Hope this helps you as well!

@johnreddy1817 2 жыл бұрын

Changing R version to 4.0.5 didn't work. You can also use Read10X function to solve the above issue. for(x in dirs) { name

@zkzhang4131 10 ай бұрын

谢谢！

@josyulavijaysai2223 Жыл бұрын

Hi, I really like the information and thanks a lort. I was wondering if there is a way I can perform differential expression between the samples in each cluster rather than between the clusters?

@lisaszmolyan4381 Жыл бұрын

Wow thank you so much, this is exactly what i looked for and so clearly explained! Keep up the great work! Thanks a lot :)

@amitrupani9898 2 жыл бұрын

There is always something new to learn from your videos. Keep it up and coming. :) Cheers!

@amitrupani9898 2 жыл бұрын

I didn't really understand the need for "re" normalizing samples here 26:36 (we already performed normalization once here 21:00). Just curious. Also, you should be able to see the plots side by side (or up/down) just by using pipe operator for instance, p1 | p2 (side by side ) p1/p2 (up and down).

@Bioinformagician 2 жыл бұрын

@@amitrupani9898 As far as my understanding goes, NormalizeData() works off of the counts slot and overwrites the data slot. After splitting the objects based on patients, it should not affect the normalized counts, as normalization depends on library size and not number of samples (in our case cells). Hence, you are right, we would not require a second normalization after splitting. However, my thought process behind doing it this way is - 1. This is a good practice, before integrating your data, ensuring you have normalization performed on each object separately. Let’s say you read in objects separately, perform QC and filtering steps, and performed integration without merging and performing normalization as a part of standard workflow steps (when you know for sure your data has batch effects or have data from different conditions or modalities you want to integrate). Then it is important to normalize and find variable features for each object individually. 2. Running log normalization twice or a couple of times BEFORE INTEGRATION (I want to emphasize this point) will not necessarily change your normalization values, it would simply use raw counts and overwrite data slot over and over again. Long story short, was it necessary to “re-normalize” after we have performed normalization of merged seurat object? - No. However, I wanted this code to reflect the best practice for integrating data and be applicable to scenarios where a prior normalization may not be performed as in our case. Thank you for pointing this out to me and also for showing me how pipe operator and forward slash can be used to arrange plots. This is so cool, I am definitely using this henceforward!

@ljing65 Жыл бұрын

after I run the create seurat object code, I got this "Warning: path[1]="GSE180665_RAW/HB17_background_filtered_feature_bc_matrix/matrix.mtx.gz": No such file or directoryError: Cannot find expression matrix at GSE180665_RAW/HB17_background_filtered_feature_bc_matrix/matrix.mtx.gz" Any idea? Thank you very much.

@ljing65 Жыл бұрын

solved.

@saafvaaf3286 10 ай бұрын

Thank you for your uploading，they are very useful

@PranavKatragadda-w4r 2 ай бұрын

this is so good, i was paralyzed and stood up to turn it up

@manjushagovindh4527 Жыл бұрын

Hi, I have a doubt, for single-cell RNA data taken from GEO there will be 3 raw data (count matrix, barcodes, and gene expression ) so should we take all 3 data or only the count matrix? or load all 3 raw data into R and do the analysis??

@Bioinformagician Жыл бұрын

I had previously created a video that would answer your question: kzbin.info/www/bejne/aanGhaOnht-IrbM

@kylereese6463 Жыл бұрын

Hi, your videos are incredibly helpful for my ugrad research. At 18:05, you use the function PercentageFeatureSet with the pattern set to '^MT-'. I looked through the data that we're using in the video, and I couldn't find any kind of variable with the substring 'MT' in it. Where exactly is the regex expression pulling that pattern from? Thank you! Additionally, do you know of a way to get the gene expression values for each ensembl gene for each sample in this example?

@Bioinformagician Жыл бұрын

PercentageFeatureSet function calculates percentage of all counts belonging to a subset of features (i.e. genes). So we here we are calculating percentage of counts corresponding to mitochondrial genes which start with MT. I have explained these single cell RNA-Seq basics in this video: kzbin.info/www/bejne/a3mlq5qpr52kr80

@tushardhyani3931 2 жыл бұрын

Thank you for this video !!

@arianescajeda639 Жыл бұрын

Grate videos. I am having this Error in validityMethod(as(object, superClass)) : object 'CsparseMatrix_validate' not found each time I try running: marged_seurat.s

@Bioinformagician Жыл бұрын

It seems the issue is stemming from "Matrix" package. Can you try to re-install or update the package and see if you still get the error?

@arianescajeda639 Жыл бұрын

@@Bioinformagician You are so nice for answering, I uninstall it and reinstalled but it did noit work. is it possible to skip this part or use another method ?

@kritisen 2 жыл бұрын

Wow this was a superb tutorial! Many thanks for putting this together

@bigteeth5644 2 жыл бұрын

Thank you so much for putting together all these tutorials! They are super helpful! I used to use the findIntegrationAnchors method to integrate data until I got some questions about some of the downstream analysis. Some bioinformaticians suggested me to use SCTransform to normalize data and Harmony to integrate data. Do you have any comments on this? Thank you!

@Bioinformagician 2 жыл бұрын

SCTransform performs more effective normalization and effectively removes technical effects from the data. SCTransform replaces NormalizeData(), ScaleData(), and FindVariableFeatures(), so I would recommend to use that over standard log-normalization. In terms of choosing an integration method, I don't have a strong opinion on which integration method I would choose. I guess, if I need batch corrected expression values to be return I would choose CCA (more computationally intensive) and if not then I might go with Harmony.

@faisalaziz8411 6 ай бұрын

Great work.

@anamikapandey4769 2 жыл бұрын

thankyou for this video, i have one question if we do not have tar.gz file in the provided GEO accession no. THEN how should i start with ? please suggest as i am quite puzzled with this thought, the files provided in the accession no are the peaks tables. please kindly drop your suggestion. my aim is to identify the expression of particular gene in a particular cell.please suggest. thankyou

@Bioinformagician 2 жыл бұрын

Can you confirm the data you are looking at is a RNA-Seq dataset?

@anamikapandey3613 Жыл бұрын

@@Bioinformagician yes it is RNA seq dataset ma'am

@alyaahessin7784 2 жыл бұрын

Thank you so much for such useful tutorials, every time I download the files from GEO, they are not directly showing up in R program like yours; would you advice how I can transfer them from downloads to RStudio? so I can follow the tutorial with you

@Bioinformagician 2 жыл бұрын

After downloading files from GEO, I load data into R using using commands to read files in. What commands are you using to read your files in R?

@alyaahessin7784 2 жыл бұрын

@@Bioinformagician Thank you for replying, would you please share with me the command you used to read files in?

@chrisdoan3210 2 жыл бұрын

Hi @Bioinformagician. I have 2 data from a healthy and a diseased person and I would like to compare 2 data sets and see differently regulated genes. Could I use integrate workflow? Thank you so much!

@Bioinformagician 2 жыл бұрын

Yes, you can.

@chrisdoan3210 2 жыл бұрын

@@Bioinformagician This advice made me confused: "Integration is more complicated where it is attempting to find cells with similar expression profiles and uses them as anchors, but it is only appropriate in certain situations. Merging is just putting 2 data sets in the same Seurat object, so is a lot simpler." What do you think about this?

@ziqifu2232 2 жыл бұрын

fantastic introduction!

@ravimore5786 2 жыл бұрын

I like your ScRNA-Seq session to explain the basics and logic behind each and every step in detail. This is really helpful for the beginner in this field. Thank you very much for educating the Bioinformatics community with your expertise.

@dotheneedful55 Жыл бұрын

Thank you so much for this information. 5:25 , I am having trouble with the ReadMtx command. I continue to receive an Error: Cannot find expression matrix at ....Rproj.usermatrix.mtx.gz. I've tried a variety of solutions. Do you have any hints?

@dotheneedful55 Жыл бұрын

I figured it out. I simply had the wrong working directory