 All right, good afternoon. Let's go ahead and get started. We have Aidan Lakshman from University of Pittsburgh, Department of Biomedical Informatics, Doctor of Candidates, so please take a flight. Okay, sorry, can you guys hear me okay? Everyone's here, wow, great. Great, thank you. If at any point you can't hear me or if I just start drifting, then yell at me or something. Yeah, so make sure the presentation is working right. Sorry about the bottom bar, but we'll just go with it. My name's Aidan Lakshman. I'm a doctoral candidate at University of Pittsburgh in the Department of Biomedical Informatics. I work with Dr. Wright. He's right there. He's the tallest person at the conference, I hope. And yeah, I'm gonna be talking about how we can use the decipher and Cinextend packages for comparative genomics. This is a workshop talk, so there are a ton of resources for people and my clicker's not working, so that's pretty cool. See if we can restart it here. If not, I'll just click. Okay, no worries, I'll just click, it's fine. I did, but that's, oh, I wasn't on the page. Yeah, it's crazy how doing a PhD in biology, computational genomics, and I can't use a computer. So again, this is a workshop. There's lots of resources for this. I have, of course, prepared vignettes online. You should be able to access the website through Bioconductor if you can't find it for any reason. This is the link, I'll walk you through it. There's also a Docker container, I believe, with all the scripts and stuff available on Orchestra. The title is comparative genomics with Cinextend and decipher. It takes a minute to load, so if you're planning to follow along with the code, launch it. If you don't, that's not working again. It doesn't like, if you want to, you can use the Orchestra to follow the code. You can also just kind of live code with me, or you can not code, so that's okay too. There's not a lot of coding, I promise. We're gonna keep it mostly on talking about the package and what it does, in some ways you can go through it. Yeah. Well, it's working on here. Great, thanks. Okay, everyone good? If you have questions at any point, stop me. I'm not happy to do this pretty informally. So again, we're back in the title slide, comparative genomics with decipher and Cinextend. These are two packages we maintain on Bioconductor. They are both set up to solve big problems in computational genomics. So I have two problems that I wanna address today that we have created with these packages or solved with these packages. The first is pretty specific to me and the second is specific to you guys. So the first of these is about, man, just when I feel like it's working, it stops. I'm just gonna use my computer. So, sorry. First of these is protein functional prediction within the kind of modern era of big data. I'm sure everyone is hopefully tired of people saying that we have big data and big data and there's so much sequencing data and sequencing is cheap. But if you didn't know, sequencing data is now very cheap. There's lots of sequencing data out there, but now the main cost is in analysis. How do we find people to analyze this data and learn something from it? And so a huge problem, especially if you're not studying model organisms like we do, is that within the amount of information there is in the universe of biological knowledge, we don't know that much. We know a lot about some very specific things. We know a lot about humans. We know a lot about E. coli. We know a lot about that one yeast, which is great. But once you get outside of the realm of the very commonly studied organisms, we don't have a lot of information. And so with all the sequencing data, we can find proteins and then we can upload them to the internet and we can say this is probably a protein and then we just kind of move on. So there's a lot of undiscovered knowledge out there and these kind of uncharacterized proteins could be anything. They could be maybe new antibiotics that we could be using. They could be new biosynthetic materials. They could be new sources of virulence and pathogens. They could be new ways to cure cancer, whatever you can imagine. Just because there's so much that we haven't even looked at yet. And so the goal of my PhD work and the reason that I contributed to bioconnector within our cynics then package is because we are trying to functionally annotate this proteomic dark matter. So all these kind of uncharacterized stuff out in space here, we wanna figure out what it's doing. We wanna take this tiny speck of information we have and extend it out to the entire universe and then find some way to kind of reconcile it with things humans can understand. So actual labels, not just this is a Y or a B. Wheel labels like antibiotic, material, et cetera. And so the way we do that is there are a lot of ways. The first of these is the way that you would probably think of naively, which is to just take a particular gene or protein of interest and just look at it. And you can say, here's a protein. It is very pretty and great. And that works really well, but it takes a lot of time and it's very labor intensive. And so it doesn't scale very well. So we're just looking at a single kind of sequence or single protein properties. This is what AlphaFold does where they look at protein folding really good, but doesn't scale super well to figuring out new information. And so with the rise of comparative genomics, we've gone past this a little bit. So instead of looking at individual proteins, we start to look at what's called homology. So we take proteins that look like ones that we've already studied. I'll use the laser pointer for people online. We use proteins that look like ones we've already categorized. So these ones all have similar structure. They all derive from a similar ancestor in the past. And then we can say that if we know maybe what one of these does and the rest of these all look like one of them, then they all do the same thing in theory. So to do this, instead of looking at a single sequence, we look at a lot of sequences that all derive from the same common ancestor and we get some kind of evolutionary signal over time. And so also very good. This is also great. So we can take things we know and we can extend it to things that look like things we know. So now we've categorized everything that looks like everything we've already studied. But the big problem that you've probably realized to hope is we haven't studied everything and we don't know what everything looks like. And so to get at things that we just have never studied before, my work is looking at how proteins interact with each other. How do they work together to form function and can we kind of examine how they're working together and use that as like guilt by association that they are always appearing together, and as a result, they're doing a common function. And I promise I'm getting into the code and the work, but the way we do this is by looking at how these proteins evolve together over time, looking at many genes at once and looking at common like co-evolutionary patterns over time. I'll get into exactly what I mean by that later on, but this is the central idea. And so this gets us into like large scale compared to genomics. We need to get a lot of data from a lot of genes and compare them all against each other and we need tools to be able to do that. So the way that we do that is we start and then we make new inferences. And everyone here that is doing genomics knows that there's a lot more complexity to this. We don't just find the answers. We typically have to go through a very involved pipeline that takes us through a lot of different steps. Each of them with their own programs and quirks and inputs and outputs to get to our end goal of inferring novel function. And while the end goal is specific to me, the broader pipeline and the broader problem is a lot more broadly applicable. A lot of these things that we're doing in this pipeline are common in most genomic analyses like aligning sequences, like building trees, et cetera. So this gets us into our second big problem, the one that most people will care about more with the Cypher and Cinextend, which is that we are solving the problem of availability, accessibility, and consistency of tools within comparative genomics. So as I mentioned before, we have this entire pipeline, right? We have to start here. We have to get our sequencing data. We're going to have to find the genes within that. We have to find a homology, build gene trees. There's a lot of steps. And each of these steps has their own tools. There's probably some familiar names to people out there. You don't need to know all of them, but some kind of big hitters are like IQ tree for a tree building. A lot of people know of Hammer or like Anti Smash for finding biosynthetic gene clusters. Lots of tools. They're color coded by the language that you use them in or how you access them or how you build them. There's a lot of colors up here. The one that is notably missing is red. There's one. And red is for R, which is your favorite because you're a bio conductor. So how do we solve this problem? What we've done is we've gone through these methods and we've re-implemented them. The goal here is we're building software for users, for people in comparative genomics. People in comparative genomics are not always computer scientists. They are very often biologists or chemists or they are going into a new field and exploring using computer science. And it's a lot to get into when you have all these, especially command line applications and a lot of file input outputs. And for some reason, they're never standard in what outputs and inputs they're giving and taking from. And so as a result, what we've done is we've re-implemented these again and published them within two big packages that I've mentioned a bunch of times. We'll say them again. Come on. Next slide. There it is. Sorry, my screen's working weird. Sorry, everyone. These phases we've replaced with the Cypher. All tools within the Cypher, one package in R. No file and input or output. Just RStudio, just lines of code, simple and accessible. The rest, Cinextend. So with two packages, just in one language, you can run an entire comparative genomics pipeline without saving off any files, without having to worry about what format your tree files are in or any other concerns you have. Just RStudio and the Cypher and Cinextend. Simple and accessible for end users. So when I made this workshop, I really thought I would have time to cover the entire pipeline, every single piece of it. And I gave one practice run and I made it to step one in two hours. So the good news for you guys is the whole pipeline is available in the workshop. Materials, all of their tutorials for every step of this, it's on the website, it'll be available. For this workshop today, I'm gonna start by skipping this step and then if there's time, we'll go back to it. But I wanna get to the fun stuff which is predicting protein function with no prior information or anything, just predicting protein function. And that's this phase. So we run this whole pipeline and you'll have to trust me that it works and I can prove it to you later. We get some data that comes out, it comes into these kind of last steps and then we're gonna do a co-evolutionary analysis on it. And I will tell you what that is, I promise, just stick with me for a little longer. So something that's always important to understand is what data are coming in and out? What data are we generating? What are we actually doing? So starting there, remember we run this whole pipeline so we start with some genomic data. It will run into our decipher and extend pipeline within R. That's why it's red. And from that we get out three main things that are all in the vignette. We get out clusters of orthologous genes. So these are different kind of groups of genes that thrive from the same common ancestor. They are presumed to have the same function in theory. Secondly, we have annotations from as many proteins as we can annotate as we can. So just writing down their function if we know what they are. If we don't, we'll say they're uncharacterized, which is most of them, but some you get. And lastly, we have gene trees for every single one of these clusters. So we've reconstructed the evolutionary history of our time for them. And again, all this is done from decipher and sin extend and there's a tutorial on how to do it all within the website. And so these three things are the inputs to our algorithm that is new to sin extend that we'll be covering in the workshop today. The other thing that's kind of important to know that if you're one of the six people that liked my tweets, you'll be wondering. Thank you to those people, special shout out is that we have genomic data from a particular species. Sorry, genus of organism. This is a gram positive bacterium. It is microcaucas, that's the answer to the tweet poll. And this particular one is microcaucas ludius, which is most well known as being the reason that your sweat smells. So take that information home. We chose it for this conference because they tend to have pretty small genomes. So we can do the analysis within a pretty short time. Each of them has a genome of around one to two megabases. So not that big. There are plenty of them available. We were able to pull 90 is genomes and analyze them. And this whole pipeline scales much larger, but this is just scaled down a little for our purposes within this workshop. Okay, there's our data. So now we can finally get to the code. I know people are dying to get to the code. I can see it on your faces. If this is the website, well, this is the website. This is the website under workshop. There's a tab called Covolutionary Networks. And that's where you're gonna go. And if you're not following along the code, then just give us a minute. So if you're here, you just go to workshop and you go to Covolutionary Network. And that brings you to the vignette. If you are using the Opera, wait, Opera Orchestra, Orchestra Container, you can launch it. And I actually have all of this pre-written for you guys. If you just open the, there is a particular file you can open. It's under conference materials. And then covolutionaryworkshop.rmarkdown. If people have any problems accessing any of this, let me know. Yeah, and I'm just gonna walk through how we actually do this code before we'll go back to some more slides. Let me get a drink. It really drives I didn't know. Shockingly enough. Right, okay, so we have data. As I mentioned, it comes out of our pipeline. And then we load it in to do our analysis. And so the first step of any analysis is to load in your data. And so that's what we'll be doing. You don't have to read all the papers, but we'll be loading in the data in the first line here. And all this data should be pre-loaded on the workshop. If you don't have it, you can download it here or again, just follow along. But I'm just going to load all our libraries. Oh, it's shrugging a little here. All right, load our two libraries and load in our data files. And these are all stored on the container. And they're all a little hefty. So give your computer a little bit of time to load them in. There's a lot of data there. I don't know what it's doing now. Sorry, folks, one second. I guess while I'm waiting for it to load, do people get in? Okay, does anyone have questions on getting to the code or data sets or any of the points, anything up to this point? Happy to, ah, we got them all. All right. Right. So we loaded the files. And again, there's three files because there are three data sources we had. We had the clusters of genes, which I'm going to call COGs for clusters of orthologous genes, COGs, kind of fun. There's an annotations file and there's a trees file. And we can look at them. So like if I look at, they're all lists. They're all pretty long. Out of the 90 genomes, I think we got 3,400 distinct COGs. And so we can look at them. If we look at the seventh entry to some random one, we get all these numbers. We can look at annotations. We can look at trees. And I want to make really sure that people understand the data set. So this is a vector of characters. There's all these individual letters, or letters, all these numbers here. They're all sets of three numbers. And each of these uniquely identifies a particular gene. So we had 91 genomes as our input. And we identify them with just a three number code. If you don't understand it, it's okay. The big takeaway is that the first number is which genome it comes from. And each set of three numbers uniquely identifies a gene by which genome and then which content on that genome and then which gene actually comes from. So just to walk through that on here, like 11919 is from genome number one. It's gene number 919. That's our gene. And so here we have 91 distinct genes in this clock. We have annotations for most of these. So you can look at them if you want. It's a list. They're formatted the same way. It's in this annotations file. And I can print that out. And we'll see a collection of annotations that we generated through decipher. And there's a lot here. Each particular gene in that cog has an annotation. So there's 91 annotations. I don't expect you to read 91 lines of text, but if we look at one particular one, it actually prints out like a hierarchical classification of what that gene does. So we have that it is some kind of gene that does information processing, one step lower. It does translation. Can you guys read this okay? Sorry, I should have checked in the back. Are you guys good? Okay. It's a part of the ribosome. And then most specifically, it's part of one of the ribosomal subunits. And so all these are annotated. You can look through them at your leisure. Some of them don't have annotations because we don't know what they do. And that's the focus of this. And lastly, again, we have gene trees for each of them. So I can print this out too. This is the corresponding tree. It happens to be a dendrogram object, which is how we store trees in R. And we can plot it. And if we plot it, we get a nice little tree. Look at that. Nice. I took out the labels because it makes it really cluttered. You can plot it without them. But this is really just to show you guys what the data look like. No real action necessary from you guys at this point. So for the workshop, we're gonna work with a subset of the data. Just so that it finishes quickly and so that it's easy to kind of showcase what it's doing and to double check that things are going well. So the next code block in here is just how we're subsetting our dataset. We're doing it based off of a couple different factors. They're specified in the vignette. Sorry, let me get down there. All this is in the vignette, so you can see the outputs. And we're subsetting it based on a couple things. So a lot of text, right? The first of these is we want genes that are present in a bunch of genomes. A bunch, not like singletons, not ones that are in just like two. We want them to be at least observed. So we have some data to check. And we also want them to not be like housekeeping genes because we want them to be interesting a little bit or like evolving or under some kind of evolutionary pressure. It gives us more signal to look at. We also want them to be annotated just for the workshop so that I can show you a cluster with annotations and not a big one that just says uncharacterized, uncharacterized, uncharacterized. So it's for you guys. Please re-run it without them. It's really cool, in my opinion. And a couple other just quick things. We remove ones that are born out of duplication events and so speciation, which is a very common trimming step in comparative genomics. And we're also only taking genes that are actually coding genes. No non-coding RNAs, ones that have a function in the cell. So that's what all these supply statements do. There are a lot of them. Each of them is basically iterating over our dataset and producing a true false value for which ones we're gonna keep and which ones we're not according to each of the requirements at the beginning. If people have questions on any lines here, I'm happy to go through them. But just as an example, yeah? Oh yeah, sorry, what are the two numbers? These are our two specifications for making sure we're getting genes that are not housekeeping genes and also not singletons with a little bit of error built in. So we just take genes, cogs that have at least five members. So they're from at least five genomes. And that are in fewer than 88 out of the 91. Just because housekeeping genes would expect to find them in every genome, there's a chance that maybe they missed them. So we give them a little bit of leeway. Same with on the back and on the lower end just to make sure that we get enough samples and examples, does that make sense? And I'll just walk through one line here just to make sure they're all very similar to each other. But for example, if we wanted to trim out the cogs, just the ones that have at least five different individuals in them, that is this one here. And what it's doing is we're S applying a function over our set of cogs, looking for ones where the unique number of elements is greater than our cutoff, which is five. Does this make sense to people at this point? Any questions over here? Okay, sorry, you guys are all our pros, so. At the end of this, we subset our data. So out to W cogs, this is the set of cogs, subset for the workshop, the set of annotations, subset for the workshop, the set of trees, subset for the workshop. Again, this will all run if you don't subset anything, but don't do it now because we don't have enough time. So I'm gonna do that in my workshop notebook that I'm following along with you guys in. It should be very quick. It runs straight through. And if we print out maybe that live coding requires two hands, sorry everyone. If we look at like the length of what's left, maybe just like cogs. Then we can see there's 88 cogs left, which is a lot less than we started with. We went from 3400 down to 88. Again, just for the workshop. Sorry, having hardware difficulties. Okay. The last thing we do, I know there's a lot of pre-processing, thank you for sticking with me. There's only one more thing I wanna do before we get to the actual analysis, which is really short. So it's gonna be a lot of build up to a very quick line of code is, I mentioned before that these annotation files have annotations for every single cog and every single gene in that cog. So this one, for instance, had 91 elements and so we have 91 annotations. That's great, but if we're going to be clustering the cogs, I would like to have like a single label for each one, just for visualization. And so what I do down here in the next code block is to just subset out, just to create a list of consensus annotations. So for each cog, we look at the annotations and we take the one that's most common for that one, just to get a very consistent kind of single label for each element. And so we're going to loop over all of the annotations. For each one, we just tabulate it by the annotations we have. We take the most common one and at the end, we get back a single list of characters called consenote. And if I print it out for you guys, I can show you what I'm talking about. So consenote one, oops, it's blurry. I can't spell, is an uncharacterized protein or a cog. So we don't have any information. I could look at the second one. It happens to be a dihydrofolate synthase, you guys need to come up with better names in biology. These words, it's of these words, right? So we have consensus annotations for each cog in our set, so that'll be helpful for later. Okay, pre-processing is done, now we get into the analysis and now I'm going to take a break from the code to just keep you a little longer to tell you what we're actually doing. We're looking at how genes co-evolve over time. What does that even mean, right? So to start, I just want to talk about why they co-evolve in the first place. I mentioned before that proteins interact with each other. This can be through a variety of mechanisms. They can touch each other to make complexes. They could interact on maybe the same substrate, a variety of different factors. But when they work together, when they're functionally associated, they have a shared selective pressure on them over time. They are evolving in concert because they are doing the same thing or contributing to the same overall function. And this leads to a co-evolutionary signal that we can actually find that allows us to back infer on what that functional association is without knowing what the protein is in the first place. So we can cluster them just based on the signal and figure out what they're doing in the cell or in the organism of interest. And what that actual signal looks like comes in a lot of different flavors. People are probably acquainted with some of them, maybe. They fit into three main factors, which I'm happy to talk about more if people are interested. But broadly, they could be conservation of how the residues of the proteins evolve. So for instance, if you have physical contacts of proteins, the points where they actually contact each other have to maintain the properties at connection. Otherwise the complex will just fall apart if it starts mutating. And so this is the cartoon of a multiple sequence alignment. And you can see there's consistency in the colors, which are amino acids across this alignment at the yellow spots and the blue spots. And that's because these particular cartoon proteins physically contact at the yellow spots and at the blue spots, respectively. Excuse me. The second way we see this a lot is in conservation of location and direction. This is really common in a lot of organisms. The genes will locate to the same region of the genome and conserve their relative location to each other as well as their transcriptional direction. So here we have a cartoon of a couple of genomes. With genes marked as arrows, the direction is their direction of transcription and the color is which gene it is. And you can see across these genomes in this cartoon, there's like very high consistency in which genes are where relative to each other, what direction they come in and so on. And this encodes a particular secondary product. This is from a journal. It's a human amycin. Lastly, we have conservation of present absence. This is huge in bacteria because they gain and lose genes frequently. We study a lot of prokaryotes, so it's very important for us. And the idea here is that if you need two particular proteins to do a task and you can't do it with just one, then there's a pressure to have both. If you lose one, you'll probably lose the other. If you gain one, it's really beneficial to gain the other one. And here we have a particular phylogenetic tree where we have present absence patterns for two particular genes. And all I want you to see here is that zero means they don't have that gene and one means they do. And there is high degree of consistency. Either they have neither or they have both. The key takeaway here is that there's a lot of ways that we see this signal. There's a lot of ways where, nice high fly. There's a lot of ways where proteins and the genes that encode them will mutate together over time. And we can find that and that signal gets stronger the more genes we have, the better and also the better tools we have to reconstruct their evolutionary history. And so that's what we do. And in short, the tool that's implemented in the newest version of Cinextend, it's called Protweaver named after Orweaver spiders, because it's making a web of proteins and I thought it was really fun. It does a lot of this analysis for you. It looks at all the signals and it will produce a prediction of functional association. And so now the moment is finally upon us to actually do that prediction. And it is only two lines. We initialize a protweaver object with our list of trees. Boom. And then we call predict. Woo, and it is going to do some computation. It shouldn't take that long. This is like the longest step. I'm gonna let it choke. I kind of like watching the progress bars like proofs to me that I've actually done work. Anyway, that's it. It's two lines of code. We generate the object. We predict based on it. And if you're satisfied at this point, you shouldn't be because we haven't even checked if we've actually done anything. So we have a result object. This predicts function gives us back what's called a protweb object because spiders weave webs and we have a web of proteins and I promise you it's really smart. So if you print it out, it'll print out that we have this protweb object. It'll tell you how it predicted it. It'll tell you how many genes are in there and also how many predictions we've made. It's just really a wrapper to stop it from printing out a ginormous matrix and there are future updates planned from that. But I've written an extra kind of method called get protweb data, which will just give you back the matrix or a data frame. So we can do that. We can look at it. Question. Yeah, yeah, so the question is why are there, let me print out the object again. The question is why are there 3916 predictions for 88 genes? The answer is you, by default, it makes predictions for every possible pair. And so every possible pair of 88 things, I will hope that the map is right, is 3916. If you wanted to do a subset, you could and it will do that for you and so you could do maybe just 88 predictions and then it would say it has 88 predictions and a bunch of NAs. Does that make sense? Yeah, I'd like to actually prove to you that we've done something cool and not just made a ginormous matrix that takes up a lot of space. So here is the matrix. It's pretty nice, I guess. But what we're doing in the next block is to validate that. So there's a reason that we actually made those consensus annotations prior and that's because we're going to cluster this and look at what clusters we get out of our predictions. So these are all pairwise predictions. For example, this 0.299 value is the strength of co-evolutionary signal between gene number 116 and gene 292 from our original data set. And so what we're doing in the next part is, can I make this any bigger? A little bit, I guess. I'm just going to use a very off-the-shelf clustering method from iGraph. If you stay with us for the next year, I'm going to write my own and it'll be better. So stay tuned. But for now, just an off-the-shelf method, I'm going to set the seed for reproducibility so we hopefully get the same results. Get the data out of our object with this getProdWebData method, which just returns a matrix. And then build a graph out of it with the iGraph package. So this is strength of association. That's essentially analogous to an adjacency matrix and a graph. And so we can build a graph using that and then we can cluster with just a quick off-the-shelf algorithm cluster fast greedy, which is a greedy clustering method that detects communities or areas that are highly associated. So, who's excited for results? Yeah, we have all these consensus annotations. We have some communities. What we're going to do is map the consensus annotations to each cog. So each gene, we map its consensus annotations and we replace it that way. So that instead of a community of like gene one, gene two, gene three, we have a community of annotation for gene one, annotation for gene two, et cetera. Does that make sense? All right, so that's what this loop here does. We run it, or we don't, or we do. It should be pretty quick. And then we print out our labels for the cluster and we get this, which is really cool to me, but I don't expect it to be cool to you yet. So, just to recap and the data we have and what we've done here, because it's quick, but important. We started at the very beginning before this workshop started with just genome data. No annotations, no gene calls, just raw sequences taken off the internet and CBI genbeck. We ran it through our pipeline using just the Cypher instance and Extend. We ran it through all these analyses, used this object and out of it we get a cluster of genes here, these eight, which the algorithm was unaware of, the annotations and these are the ones we have, which is the three subunits of UREAs, three UREAs accessory proteins, a nickel cobalt transporter and a viral integrase. And that is pretty cool for a couple of reasons. One is that a UREAs has three components and we have them here. We were able to successfully identify all the subunits of the UREAs complex without knowing what it is. Two, UREAs accessory proteins work with UREAs by merit of their name. And we were able to find that too, right? So that explains maybe entries three through eight, but there's two weird outliers. There's a nickel cobalt transporter and a viral integrase. I don't really know anything about UREAses, so I thought that was kind of weird, but I looked it up and UREAses are actually metalloenzymes and the activator for them or the metallic ion that catalyzes the reaction is nickel and so prior work has seen nickel cobalt transporters as being involved with UREAses and unfortunately I don't have a really fun explanation for the viral integrase other than to say it's probably facilitating horizontal gene transfer of elements in this complex if I had to guess. So that's pretty cool that we've identified a cluster of genes that work together without knowing anything about them. I have no knowledge on UREAses and I found this. And then it motivated further study. We could find out some other information and in the event we didn't have annotations we could still find clusters and then investigate them in the future. So we could find these kind of communities of proteins that work together and then give them to our wet lab friends and they can investigate them further. So extremely powerful for generating hypotheses about potential things to investigate as well as looking at maybe new functional pathways for previously characterized proteins. Yeah, question. Yeah, there are three or four. The other ones are not as good. This one is nice. It showcases a very nice cluster that's very small. I think the other ones are much larger. If you run the code you should have them. I can look at them if you'd like. But as we, this one is very small and has both of the Arabino galactans as well as some transcriptional regulators in a component of the ribosome, which I would assume is where it binds and is translated. I think the third one is very large. It has a lot of elements. And part of the reason for that is this clustering algorithm is not designed for this. It's just a very fast, quick off the shelf clustering algorithm. So the pairwise functional associations we find are very good. That's been benchmarked. The next step is to implement a custom clustering algorithm that does this, but better. So that's why you should stay tuned to our package for this specific application. But yeah, does that answer your question? Yeah, we started off with 88 genes. Everyone got assigned to a cluster. It decided by itself that there were four clusters. And in future work we're going to design algorithms specifically for clustering gene communities in networks like this. And it should hopefully perform better. There's more of an incentive for the algorithm to generate larger communities. Whereas for a biological interpretation, we would prefer smaller communities because they're more interpretable and more tightly linked. So that would be the focus of future work. To answer your question. Okay. Sorry. Quick drink. Stepping on stuff. Okay. Everyone clear so far following along. Any other questions before I keep going? Okay. So I want to get back to the bigger problem of accessibility of tools and applications. This is really cool for me. I love this stuff, but I don't expect everyone in this room to like run out and start doing covolution analysis on proteins. What people are doing is aligning sequences. They're building trees. They're calling genes from sequencing data, RNA-seq data, whatever, right? And so one of the things I want to really emphasize is we've put a lot of work into decipher and synxten to produce functions that are really easy to use and quick and that you don't need to understand the inner workings of for them to work well. If you do and you want to go further, we have that capability for you, but you don't need to. For example, to do this entire analysis, the predictions at least, only took just running Prottweaver as just with no arguments except the input and then running predict on it. The same is true of our tree-building method. The same is true of our alignment software. Same with our finding genes method. And to that end, we've incorporated a lot, why is it animated? A lot of different functions that all do different things at every kind of step of this pipeline from visualizing sequencing data, I'll use the pointer, to aligning sequencing data, to finding phylogenetic reconstructions, to finding genes as well as non-coding RNAs, annotating genes with function and finding functional associations. And it's not enough to just say we've done this and it is quick, right? But the key thing here is they're easy to use and every step of this outperforms the literature or meets it. So we're at least as good as other things out there every step of the way, right? So this is a toolkit for people that do genomics that they can use, that they don't need to be computer scientists or that they can just drop in and say, I wanna align sequences, I run align seeks, I get out of alignment. And that is good for us as a field. So to that end, that's the work, that's what the Cypher and Synec Center made for. That's what we are building tools actively and maintaining them to do. Something else I wanted to showcase as kind of future directions for this is I'm working on a web app that we can do this in so you don't even have to be able to write our code. You can just load in your sequences. Oh, it's frozen and browse them. And if you wanted to align them, you can just click align and you'll get out of alignment after waiting a little bit. That crashed. Well, this is really new. The curse of presenting. I can promise you it does work. It was in there working a minute ago. Very early stage, but the point is we're focused on and I also just disconnected the monitor. So presenting is really hard for the computer, it seems. I'm doing great. Don't worry about me. The point is that this is what we're doing at a big scale, like we have this whole pipeline. All these things are things that you would conceivably need to do as someone working with genomics data. I guarantee you there's at least one thing in our toolkit that is helpful to you. And all of it should be easy to use. All of it handles the details for you. All of it is detailed in the long workshop that I put together. And if you have questions on it, feel free to ask me or ask Eric. He also develops a lot of this stuff, well, most of it. And we're happy to take questions and also build on new functions if there's things in there that people don't see that they'd really like to have. So if there's extra time after questions, I'm happy to go back through some other parts of the analysis. I know people are, I'm really excited about building trees. But before I get there, I just wanna take one quick minute to thank my lab as well as notably people from Bioconductor, especially Erica for putting up with all my emails and Sean for helping me with all my bugs and also Jenny. So this is my lab, they're really cool. They do work on genomics. And I can take questions and I'm not sure how I'm doing on time, but we have about 30 minutes left. So if people like, I can also go through how we can build gene trees. But first we'll take a question or two. Where's the bigger tutorial that you're talking about? The whole tutorial is, sorry, yeah, that'd be good to know. The whole tutorial is on the website. It is on all these tabs if you go to overview. It has an overview of the whole process. And then it will walk you through how to download all the data, where all the example scripts are. There's a Docker file and it's got, you can download data yourself. And then it's got all the pages. So every page is dedicated to one piece of the pipeline. We have a page on how to visualize genomic data, how to find genes in it, how to find those clusters of orthologous genes, how to build trees, and then the one we just went through. Other questions? Do people wanna build trees? Yeah. Thank you. Thanks for your presentation. I just have a small question about implementation. You mentioned you re-implemented everything. Yes. You have wrappers around the existing method of you like re-implemented literally everything. There are things we have not re-implemented. Our one dependency as far as I'm aware is on bio strings because we really trust the guy behind that. He's a big deal. Wherever he is. He's in the front row. Aside from that, we have no dependencies. So everything is written in base R. The one dependency of sin extend is decipher and everything is rewritten from scratch. Thank you. Yeah. Oh, sorry. I'll wait for the mic. Hi. Great talk. Is there a way to get this information at like a per residue basis? So you could, for instance, maybe provide information about contacts between two proteins that are evolving together? Yeah, there is. There is, I have two responses to that. So the first is, sorry, gathering my thoughts. Yes. So the short answer is there are methods to do that that are bundled within this program. It runs a lot of predictors and then it combines them in a way that is smart. If you wanted any of those individual predictors or you can access them directly, that's in the last part of the tutorial. You just specify which method you want. They should all be documented in the docs. I will say that residue-based methods tend to be slower. And so I'm really working on making them faster. And we're actively adding more like predictors in different ways to do this analysis. So right now we do do an analysis using mutual information of the residues and how they evolve over time. You can get that out of there using one of the built-in methods. And, sorry, in the next release cycle, we'll have more. So I have a lot of plan improvements for this, one of which is going to be dedicated methods to look for specific ones. If you're only caring about residue-based methods, we can do just that or only caring about presence absence, we could do that. So that will be in the next release cycle, I hope, in October. So stay tuned, yeah, to answer your question. Yeah. Sure. The next question I have, too, is have you thought about ways to incorporate information from structural aligners? Have I thought about that? Outside of the... Great, great question. No, so the short answer is I have definitely considered it, I haven't gotten there yet. I think it's really important information. I would like to implement a way to look at conservation of protein domains. There's been a lot of work in the past that's looked at conservation of domains as an indication of shared function. And so that would be kind of the next step. We already have tools, if I remember correctly, to find structural patterns in at least the RNA transcripts for genes. So it should be fairly straightforward to implement, correct me if I'm wrong? Okay, I'm getting at this, so. I mean, it's extra workload for me, but yes, I have, they're not in there yet, but it's a great idea and there's definitely a lot of information there and it will be added as soon as I get to it in my list of to-dos. Any other questions on that? Happy to take more. Yeah, I have, I'm really asking a lot of me here. You had an example there at the bottom, yeah. Yeah, there is one. I see level MI or something like that. This one is relatively slow, but I can run a quick one if I subset the matrix. Let's do it just on maybe the first 15 genes. Yeah, I don't know which ones the urethas are if I'm being honest. Let's see if we can do that. Here comes live coding time. So let's see, making our notebooks. Okay, let's see if we can do that. I'm not gonna save this. So just to make sure we're on the same page, the output of this clustering algorithm is called clusters. Here is clusters and we can look at all the elements in that. And it's like an iGraph object, but it has all of the individual elements. So this one is presumably the one for the urethas. So let's check that out. Great, so there's a vector of elements from our, that we would wanna grab to look at just the urethas and we can make that like our subset. And I can subset out values just like we did before using, sorry, for scrolling so much. I'll just subset the trees object to just these. And then we can predict with, we'll make a new protweaver object and then we will predict with it. And we'll use an actual residue-based method. Just tends to be a little slower. Let's run that and see how it goes. It's going fast. So this is part of the issue and this is why we're, there's plan improvements to this is there's a lot of comparisons to do in this, especially when you do residue-based methods and you're looking at every nucleotide. If you wanted to do, we're doing 15, sorry, we're doing eight genes. And so if we look at the object here, out of eight genes, we made 36 predictions. But this scales quadratically, but even more than that, it scales not just with the number of pairs, it also scales with how many residues you have to look at. And so it's difficult to make this scale to big numbers. Our goal is to apply this to every sequence that exists and to find functional associations across the tree of life so we can get a better automatic annotation. And so it's really important to have a fast way to compress the data to analyze it quickly so that we're not scaling on the order of n squared genomes times the number of bases per genome. And so anyway, the output of this is, let's look at it. And we get fairly low correlation across the board for these. So is there a way to see the individual residues? Is there a way to see the individual residues? Which positions are covariating? Not yet. That's a great idea. I'll add that in. Good, I'm taking notes as we do this so that I know what to add for the next release cycle. This is fairly early stage. So this is the newest addition to Cinextend. This is very recent stuff. The rest of it has been around for a long time, mostly. So aside from the potweaver portion of this, everything else is very good and very stable. Not to say mine is not good. It does great, but it will do much better with every release cycle. And so stay tuned during my doctorate for that. Aiden, can you share your screen again for those? Oh, absolutely. Is it not sharing? Sorry, I'm having some issues with technology today. Because I lost the monitor. Share again. Yeah, happy to take, oh, I can even look at the questions in the chat. Would you like me to do that? There's, that was the only one. Oh, that was the only one, great. Too bad. That means that you are crushing it. Woo! Or you're really boring. Well, I would like to say it's the first of those and don't correct me, please. So I just want to, I guess, since we have extra time and if people have questions at any point, stop me. Just to showcase what I'm talking about with other functions. So I'm going to go to the one on building trees just to showcase what a workflow looks like there. I can tell you, I just came from a conference or a workshop that was dedicated to just learning this pipeline. And it took us an extremely long time to get all the code and download it all and not our pipeline. So the very commonly used one using like IQ tree and using like prodigal refined genes, et cetera. People are aware of those and we won't worry with the details. But all it takes to build a tree in R, we using R packages is once you have your sequences read into R, which you can do using bio strings, it takes two lines. You just align them with a line seeks or a line translation, one aligns the sequences, one lines based on their amino acid translation, they're both good. And then we run tree line, which is also a new method. Very nice. And it makes a tree for you. And that's all it takes. Two lines of code still in R. You get back a dendrogram object that you can plot and it looks like this for these sequences. You can run that yourself. For my phylogenetics people, you're aware that there are a ton of methods that we use to build phylogenetic trees. You can use maximum parsimony. You can use maximum likelihood. You can specify bajillion models. The sky is the limit for you with tree line because we implement all of them. So you can do any construction way you want. You can build maximum parsimony trees. You can build neighbor joining trees. You can build UPGMA trees. You can build maximum likelihood trees. For people that aren't in phylogenetics, these are all real things. I'm not just saying letters. You can also specify whatever sequence evolution models you want. Fine tune it as much as you want, but as far as the method goes, when you run it, it will automatically select the best method for you based on the data that you have and generate the best tree you possibly can for your data without you having to figure that all out yourself. Just read in the sequences. Just run tree line and it will give you back a really good tree. And for people that use other trees, Eric can correct me if I'm wrong, but we are better than IQ tree and RaxML at prediction of trees. And we're on par with RaxML next generation for short alignments and better for long alignments, for long sequences. For building the tree. We build better trees than the other people. And we're on par with the latest iteration of RaxML, but we use a completely different method. So look forward to a paper on that very soon. We're in the process of writing it. Questions on any aspect of anything I've talked about at any point in my life or today? There is a question in the chat about how large datasets can you work on a single machine? Great question on a single machine. That's an important stipulation. How large datasets can you work with on a single machine? So there's a couple answers to that question. You can process as much as you want to process. It might be kind of slow for very large datasets because it runs all pairs between all genomes, all genes you give it. The theoretical upper limit is the limit of the size of a matrix in R, which is on the order of 65-ish thousand. I've been able to get 40,000 genes to run on my machine. The way we do this at scale is a little different. All these algorithms parallelize. And so our actual implementation is a little different for this. Sorry, just to give you. We actually have a complete pipeline that code will be available for it at some point once it's a little more standardized. Where we do all of our calculations at scale. So we've been able to successfully deploy this running pretty quickly. Total compute time, if everything is completely parallelized, is about an hour and that's for all available genomes for actinobacteria, which is on the order of 3,000 genomes, which results in on the order of a couple hundred thousand genes. And the goal is to scale that larger. And so we wanna be able to do everything. The next task is how to reconcile across different genera or different data sets. But right now everything paralyzes. We can do gene annotation and parallel very quickly. We can do our calculation of COGS and parallel very quickly. We can do our calculation of convolution and parallel very quickly. And for instance, the residue method that I did earlier here that takes a while, I ran on all possible pairs in parallel and it took about maybe 10 minutes max. It takes only about a couple of seconds per machine. So our development, I keep stepping on the HDMI cord. Sorry everyone, especially to our virtual attendees. I apologize for the share screen. The paradigm that we've been going for is fast calculations of individual pairs so that we can split this up across many machines. And so far we've done that for every step of the process. So if you wanna set it up in parallel, it's very quick. If you wanna run it on your own computer, it is quick, as quick as you could say it can be for the amount of computing involved. Sorry, long-winded answer. I can do, I mean, I was able to do 300 genomes on my computer which resulted in 60,000 genes and well around 40,000 genes before I ran out of memory. So my computer didn't enjoy it but it worked, took maybe an hour or two for that. Yeah, any other questions, things I can talk about, people are interested in, questions for your own workflows, things that you would like to see? Not, I can, any online questions? Not where you can maybe give you guys a little time back. I know this is the last workshop of the conference so thanks for sticking around so long everyone. Appreciate it. Yeah. Yeah, thanks for sticking around. Thanks for dealing with the technical difficulties. Seems to be kind of the standard nowadays. If you have more questions or you think of something, feel free to talk to me or Eric about anything I've talked about today. Also please check out the website. I have a lot of content on there. So enjoy and yeah, that's it. Enjoy the rest of the conference. Thanks everyone.