Pangenomics: a comparative genomics approach

Рет қаралды 7,363

Күн бұрын

Пікірлер: 48

@mmars4eva 4 жыл бұрын

Would k-mer frequencies be expected to be consistent across a pan-genome or can individual genomes within a pan-genome be differentiated with k-mer frequencies? (Within a species)

@merenbey 4 жыл бұрын

Depending on the genomes included in a pangenome, the underlying k-mer frequencies across genomes may be identical or variable. Both is quite possible, and in some cases may be completely invisible to the pangenome even if individual genomes have vastly different k-mer frequencies. Why? Well, since gene clusters in pangenomes are typically calculated from amino acid sequences, differences in DNA sequences (especially those that will influence k-mer frequencies but will yield identical amino acid sequences) may be completely oblivious the the pangenome :) I hope this clarifies.

@MerenLab 4 жыл бұрын

[Zoom question paraphrased from Fabrizio]: What algorithm is used to cluster genes/what does MCL stand for?

@MerenLab 4 жыл бұрын

[Zoom answer paraphrased from Iva]: MCL is used to cluster genes. MCL stands for Markov Cluster Algorithm and it is a clustering strategy for graphs.

@MerenLab 4 жыл бұрын

[Zoom question paraphrased from Clotilde, crusley and Aiswarya]: Can you input synteny information into a pangenome? What programs are able to analyze synteny? Are there examples of papers where the authors incorporated synteny information into their pangenomic analysis?

@ivaveseli3867 4 жыл бұрын

Definitely. Synteny information is great to pair with pangenomics (for instance, you could take a set of gene clusters - perhaps one that represents an operon - and see if the orthologs always end up in the same order in each genome in your pangenome). Shameless self-advertisement here: anvi'o has a program, anvi-analyze-synteny (merenlab.org/software/anvio/help/programs/anvi-analyze-synteny/) for synteny analysis, written by our very own Matt Schechter. There is also the MCScanX toolkit (pubmed.ncbi.nlm.nih.gov/22217600/) and a eukaryote-specific tool called Pangloss (www.ncbi.nlm.nih.gov/pmc/articles/PMC6678930/), but full disclaimer: I found those by googling and do not actually have experience in using them. :) Finally, here is one paper I found that seems to have done some synteny analysis in conjunction with pangenomics: www.frontiersin.org/articles/10.3389/fcimb.2017.00459/full

@aiswaryaprasad9783 4 жыл бұрын

@@ivaveseli3867 Thanks for the links Iva, this was exactly what I was looking for!

@MerenLab 4 жыл бұрын

[KZbin question paraphrased from Anuradha]: What algorithm is usually used to create gene cluster networks? Can you use WGNA or CoNet?

@MerenLab 4 жыл бұрын

[KZbin answer from Emily]: I don't know about WGNA or CoNet, but MCL is the most commonly used algorithm to resolve the network.

@MerenLab 4 жыл бұрын

[Zoom question from Indu]: Do droplet culture techniques allow for growth of rare, slow-growing organisms?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: Yes, those are precisely the organisms that could now possibly be cultured using droplet culture techniques.

@MerenLab 4 жыл бұрын

[Zoom question paraphrased from GAMA]:  What does it mean when the relationship between genomes inferred by gene cluster presence/absence mirrors that inferred by phylogenomics? Does that mean that the microbial functions in that environment drive the evolution of a clade?

@MerenLab 4 жыл бұрын

[Zoom answer paraphrased from Daan]:  It is important to keep correlation and causation clean. A set of genes correlating with an environment doesn’t necessary imply they drive adaptation , but it is a good spot for hypothesis generation :)

@MerenLab 4 жыл бұрын

[Zoom answer from Hugo]: I would say it is usually the case. I think the first to notice this was www.nature.com/articles/ng0199_108. You might like this recent paper :) www.nature.com/articles/s41467-019-13429-2

@MerenLab 4 жыл бұрын

[KZbin question from Ali]: Can metagenomic reconstruction and assembly help identify novel lineages that cannot be captured by reference-based methods?

@MerenLab 4 жыл бұрын

[KZbin answer from Emily]: Yes! You can bin genomes from metagenomes without the use of references at all. Meren recently developed a strategy to use single copy core genes to assign taxonomy to these genomes, using single copy core genes. Check out this blog post if you're interested! merenlab.org/2019/10/08/anvio-scg-taxonomy/

@MerenLab 4 жыл бұрын

[Zoom question paraphrased from Biji]:  Is there an archaeal gene database to find core or accessory archael genes?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: I haven’t heard of people making databases to store pangenomic core genes/accessory genes for different groups of microbes. Unless you are talking about single-copy core genes for archaea, which there are databases for. This is the one we use in anvi’o: www.nature.com/articles/nature12352

@MerenLab 4 жыл бұрын

[Zoom answer from Daan]: The genome taxonomy database would be a good place to start, but as Iva says not precomputed DBs I know of .

@MerenLab 4 жыл бұрын

[Zoom question paraphrased from Aiswarya and Tamara]: Where do the genomes used in a pangenome come from? Do the genomes need to already be known?

@MerenLab 4 жыл бұрын

[Zoom answer from Mike]: The input for a pangenomic analysis does need to be genomes, but these can be genomes recovered from metagenomes, or isolate genomes, or genomes from single-cell genomics.

@MerenLab 4 жыл бұрын

[Zoom answer from Daan]: The genomes used in a pangenome analysis can be derived from all possible sources. So, in principle you could mix genomes from a metagenomic sample recovered using the binning strategies discussed last week, single cell genomes and/or genomes of isolates.

@MerenLab 4 жыл бұрын

[Zoom question from Indu]: Is pangenomics limited to organisms in the same genus or it can be expanded to all organisms present in an environmental samples?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: It can be expanded if you are interested in the core/accessory groupings in all organisms in your sample, but it won’t be as resolved because you would be comparing vastly different groups of organisms :)

@MerenLab 4 жыл бұрын

[Zoom answer from Mike]: Great question! Pangenomics is only limited by the scope we are interested in. When we attempt to infer what was present in our last universal common ancestor (LUCA), this is a form of pangenomics across all domains.

@MerenLab 4 жыл бұрын

[Zoom question from Dylan]: What does "splitting paralogs" mean?

@ivaveseli3867 4 жыл бұрын

Just to clarify, this question was asked in relation to the question at 1:21:33 by Juan. Paralogs are homologous genes (genes with high amino acid sequence similarity that likely exhibit the same functions) that are present in the same genome - ie, multiple copies of the 'same' gene in one genome. (This is opposed to orthologs, which are homologous genes coming from different genomes - ie, a copy of a gene in genome A and a similar copy of a gene in genome B). Since they are homologous, by definition paralogs will be put into the same gene cluster. Therefore in pangenomics, 'splitting paralogs' refers to taking out the paralogs from your gene clusters so that your gene clusters only contain orthologs (ie gene clusters with at most one gene per genome). Juan's question was - how do we know whether to split paralogs or not? And the answer was - in phylogenomics, you have to take them out, but in pangenomics, we don't know and we (the Meren lab) have never encountered a pangenomics situation in which we needed to split paralogs from our gene clusters. But it seems to be a dataset-specific question, and other analyses may benefit from it. :) Finally, here is more discussion from the Zoom chat about 'splitting paralogs' in phylogenomics analysis: [Zoom response from Daan Speth]: @Dylan, homologous genes can be divided into orthologs and paralogs. The paralogs arise from duplication within a genome and will confuse the clustering analysis Meren talked about in the seminar. [Zoom response from Mike Lee]: @Dylan, i’d add that one of the core principles of phylogenetics in general is that the genes being considered are under similar evolutionary pressures, and this gets less likely to be a safe assumption when there are multiple copies of the same gene. [Zoom follow-up from Dylan Baker]: @Mike lee is this because one gene has more room for evolutionary change since there will always be a “working copy” of the other gene? [Zoom follow-up from Mike Lee]: @Dylan, yeah that’s the thinking. When paralogs are present (from gene duplication like Daan mentioned), then there is more freedom for divergence.

@MerenLab 4 жыл бұрын

[Zoom question from Aiswarya]: Does somebody know for approximately what percentage of known species we can make a reasonable guess about their set of core genes?

@MerenLab 4 жыл бұрын

[Zoom answer from chequita]: I could be totally wrong, so take this with a grain of salt, but where I have seen pan genomes mentioned in the literature they seem to be exclusively focused on model organisms with human-disease relevance. A quick google scholar search also suggested that this is the case. Knowing that the microbes associated with humans is a very small percentage, I would say a very small percentage of orgs have “known” pan genomes ;)

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]:  @chequita I think you are right that it is a small percentage because pangenomics is usually a specific look at a certain group (hence is not high-throughput and it hasn’t been around long enough for lots of organisms to be pan genome-ed), but I also want to note that they are not exclusively focused on model organisms. There are several pan genome studies in ocean microbes, for instance :)

@MerenLab 4 жыл бұрын

[Zoom question from Gurpeet]: How is codon degeneracy taken into account when making gene clusters?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: We are using sequence similarity at the Amino Acid level, hence any position with codon degeneracy already has the same amino acids.

@MerenLab 4 жыл бұрын

[Zoom question from Aiswarya]:  Any recommendations from the community for tools that can be used for pangenomics? I am not a bioinformatician by training but I hope to do exactly what Meren said with regards to getting my hands dirty with the data, so a tool that is not a blackbox is better!

@MerenLab 4 жыл бұрын

[Zoom answer from Mike]: Yep :) anvi’o merenlab.org/2016/11/08/pangenomics-v2/

@MerenLab 4 жыл бұрын

[Zoom question from Indu]:  Can we use gene cluster presence and absence to assign taxonomy/phylogenetic position for a given organism? If so, will this eventually replace traditional methods?

@MerenLab 4 жыл бұрын

[Zoom answer from Mike] It is definitely another way of grouping organisms that is informative in its own way, but it is just a different way than doing it phylogenetically (which is more the agreed upon, and I think sensible way to assign taxonomy).

@MerenLab 4 жыл бұрын

[Zoom question paraphrased from mh]:  What are the technical limits in computing and visualizing a pangenome? Could you make a pangenome with several thousand genomes?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: Yes, this could work if you had enough computational power (memory, mostly, is the bottleneck I believe).

@MerenLab 4 жыл бұрын

[Zoom question from jon]:  Can pangenomics help assign functional annotations to hypothetical proteins found in genomes?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: That is a good idea! I would argue though that most proteins are annotated computationally by sequence similarity to proteins of known function, so it is unlikely that hypothetical proteins would end up in a gene cluster with other ‘known’ genes.

@MerenLab 4 жыл бұрын

[Zoom question from Kinga]: What is the difference between species and strain in microbiology?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: Species is a larger taxonomic group than strain. Strain usually refers to a variation (sub-population) of a species (population), it is a more specific group.

@MerenLab 4 жыл бұрын

[Zoom question from Indu]: What happens to your pangenome when it includes some partial genomes?

@MerenLab 4 жыл бұрын

[Zoom answer from Iva]: In that case, some of your partial genomes may be missing genes. So in that case you may get an inaccurate view of which genes are truly core or accessory.

@MerenLab 4 жыл бұрын

[Zoom answer from Mike]: We just have to accept it and be aware that the lack of seeing something doesn’t necessarily mean it is really absent .

@MerenLab 4 жыл бұрын

[Zoom question from choon]:  How many genomes do you need to make a “believable” pangenome?

@ivaveseli3867 4 жыл бұрын

It depends on what 'believable' means :) I personally, would interpret that in the context of doing a pangenome of a higher-resolution taxonomic group (like a species) as "how many genomes do you need to be sure that you have accurately represented the core and accessory genomes in your pangenome". If I am wrong in my interpretation, please let me know. But to give you an unsatisfactory answer to that interpretation, the more genomes you have of that species, the better, but I would arbitrarily say that a decent minimum is ~4-5 genomes. Also, I would like to bring up the following excellent points from our other experts in the field: [Zoom answers from Mike Lee]: @choon, depends what is meant by believable, I prefer to think of any given pangenome simply a pangenome of the genomes that were included in the analysis. to expand on that a little, I mean if I do a pangenome of 50 Alteromonas genomes, I’d say it’s a pangenome of these 50 Alteromonas genomes. If I do one of all available Alteromonas genomes, I’d say it’s a pangenome of all available Alteromonas genomes :) [Zoom answer from Daan Speth]: @choon, hard to say. if you add more and more, your core genome will shrink but approach the “true core”, whereas you pangenomes will grow and eventually approach ”all possible genes”. So as you can see, there is a question of which level of resolution you are most interested in here :)

@ivaveseli3867 4 жыл бұрын

I also want to record here the follow up discussion to this question: [Zoom follow-up from choon]: let say from my MAGs, I have 8 genomes. Should I just use these 8? Or perhaps add in more genomes from GTDB to calculate the core/non-core genes? [Zoom response From Mike Lee]: @choon, it’s up to you and what you want to say. If you want to compare those MAGs, you could do just them. If you have 2 MAGs with a distribution across your samples that are of interest, you might want to do separate pangenomes of those MAGs along with other closely related references. It is very much guided by your interest