 Welcome to this deeper dive into genomic context methods. The goal of genomic context methods is to identify functional associations between genes, that is, which genes work together, based on nothing but a collection of genomes. These methods are included in the string database, which you can learn more about in this brief introduction to the string database. The first type of genomic context method to be developed was based on gene neighbourhood. The idea is that in prokaryotes you have so-called operons where you have multiple protein products being encoded by a single polycystronic transcript. They can be identified by looking for runs of genes and genomes, that is, a set of genes sitting next to each other being transcribed in the same direction. The problem is that this can easily happen by random chance. Genomes sit next to each other in the genome and there is a 50-50 chance of two neighbouring genes pointing in the same direction. One way to address this is to look for intergenic distances. If two genes sit far from each other, there might be a promoter in between and they may not be an operon. Conversely, if two genes sit very close to each other, like in the lower example, there is no space for a promoter and they almost have to be an operon. The other option is to look for evolutionary conservation of the operons. That is, look at draws a number of different genomes and see if it is evolutionarily conserved over long distances that these genes sit together in runs. You can also look for bi-directional promoters. That is, instead of looking for just genes pointing in the same direction, you can look for divergent transcription of two genes that sit next to each other. In this case, you may have a single promoter in between that regulates both. Very often what you see is that you have on one side a transcription factor and on the other side other targets. And these form a feedback loop where the transcription factor regulates both its own expression and the other target genes. Interestingly, this kind of bi-directional promoter unit is also seen in eukaryotes, not just in prokaryotes. A completely different type of genomic context method is the gene fusion method. The idea is that if you have two genes in one organism, sitting, encoded in completely different places in the genome, possibly even on different chromosomes, you can have a gene fusion event happen and have a single fusion gene like this. The fusion gene will encode a fusion protein which will typically be a multi-domain protein. The problem is again, fusions can happen by random chance even though it's much less common than for gene neighborhood. But more importantly, they can stem from annotation errors. Especially in eukaryotes where you have entrance, you can easily accidentally fuse two neighboring genes and think that it is a fusion gene. When biologically it is in fact not. One way to address that is to look at multiple species and see if you can find the same fusion protein in not just one genome, but multiple genomes. And even better, you can look at this in combination with a species tree and see if it's just one group of genomes that have this fusion protein or whether you have in fact had multiple independent fusion events happen between the same two genes. The problem is that fusions are relatively rare events and observing multiple independent fusions of the same two genes is therefore even more rare. The last method I want to cover is phylogenetic profiling. The idea is that you can look at the presence-absence patterns of genes like this where you have the red, the yellow and the green and each row corresponds to a different genome. If like here they have identical presence-absence profiles or similar presence-absence profiles, you can infer that they likely work together in a functional unit. So what you do is to calculate profile similarity which could for example be done using the GerCard index. You can gain even more power by making use of the sequence similarity by not just having a simple presence-absence profile, but instead calculating a best hit profile where you quantify how similar is the best hit for this gene in each other genome in your database. And then calculate the Pearson correlation coefficient instead of the GerCard index for these profiles. In either case, you have the problem of genome redundance. That is, you can have many very similar genomes, for example, different strains of the same species being sequenced. And these are of course evolutionarily dependent, meaning that they are not independent observations. And you can easily get very similar profiles just because of a gene being present in one species and not anywhere else. One way to address this is by using tree-based methods. What you do is a gain to take a species tree and put it next to the profiles. And then instead of just counting how many genomes have each gene and how good is the overlap, you can quantify the number of independent evolutionary events in terms of gene losses or gene gains that are needed to explain the similarity. Problem is, this does not address what I call lifestyle similarity. It does address evolutionary similarity, but it doesn't address that for example intracellular parasites will tend to have very similar profiles just because they lose the same set of genes that they no longer need because they steal metabolites from the host. For this reason, String instead uses the SVD-Fi algorithm. We first calculate a best-hit matrix quantifying how good is the best hit for each gene in the genome of interest in every other genome in String. And then use singular value decomposition to calculate a reduced latent space in which we calculate peers and correlation coefficients. That way we're effectively compressing the genomes that are very similar regardless of whether it's because of them being different strains of the same species or them being, for example, intracellular parasites. I want to end on a real example. Imagine you're interested in this particular gene in streptomyces. You can search across all the different genomes and find the best hit and you really only find a good hit in one specific other genome. You can go and do the same search for all the other genes in your genome of interest and rank these profiles by how similar they are to the query gene originally. That way you can get a rank list of genes like this where you see that the query gene was a putative secreted cellulase and most of the other genes are also putative secreted enzymes that break down sugar bonds plus a couple of likely transcriptional regulators or DNA binding proteins that likely regulate the expression of all the others on the list. It turns out that in this specific case there's not much putative about it. This is the cellular zone. It's a set of extracellular enzymes that are secreted by cells in order to break down cellulose outside the cell into smaller sugar units that can be imported into the cells and then metabolized further. That's all I wanted to say about genomic context methods. If you want to learn more about how they are integrated with other types of evidence in the string database, go have a look at this presentation. Thanks for your attention.