 Hi, can you all hear me? Great. So my name is Eric Wright. I'm from the Department of Biomedical Informatics at the University of Pittsburgh. And sure, that makes life easier. And I'll be talking today about the evolution of the decipher package for comparative genomics. The cipher started with a very simple concept, and that is to empower users. And we do that by taking biological sequences in the form of genomes, genes, amino acids, nucleotides, whatever it is. And we provide powerful tools, and that converts biological sequences into results for users. And so the concept underlying everything that I'll talk about today is to empower users. And I think this is relevant to everyone here from a couple of different perspectives. One is that if you are interested in having tools, then we're always open to collaborating with people. And we'd be happy to work with you to make more powerful tools to add into our package. Secondly is that I'll be talking today about the ingredients that I believe are behind the success of decipher and what's required to empower users and how those ingredients have evolved into cipher over time. So the first thing is that I'm a believer that code can be long lasting and that the bio conductor project is a part of making that happen. So I asked three of my trainees to go and look at their different domains of bioinformatics that they're working in and find other programs and take those programs that are alternative ways of doing the same task in bioinformatics and to rank them based off of whether that code is still available, whether that code is still compilable, whether that code was usable. So did it scale to their problem size? Was it something that was limited to a web tool or did it have dependencies that made it a failure for distribution or something of this sort? Or was it runnable? And if we just take a quick look at this, you can see that each of these dots is a paper. And if you go back in time only 10 years, just about every single bioinformatics code that was released is no longer functional. There are some exceptions, but for the most part they're either not runnable or they're not available. And you could say, well, that's because we've gotten better at making bioinformatics software, right? You could say that, well, now everything works because we're solving this problem. But I would argue actually that bioinformatics software generally has a short half-life. And I think the bioconductor project is a really big part of making that half-life longer. If you look at decipher on bioconductor, unlike many of the packages that we've heard about today and yesterday, it's been around for a long time. So decipher's been in bioconductor for 10 and a half years. I've been developing it for 12, and it still builds, as you'll notice, on all platforms. So I'm very proud of that. The other thing you might notice here is that it has zero support questions in the past month. And the reason for that, I believe, is because I think it's possible to support users and have a well-documented code. And as an example of that, decipher has 12 vignettes, and we have another one coming in the devil version. So we're going to have 13 vignettes soon. I would really encourage the bioconductor folks to create a competition for the most vignettes because I'm pretty sure we'd win. It's so many vignettes that I've actually gotten rather bored with the process of creating vignettes. And I've started to theme them. And so I encourage you to go take a look, for example, at the magic of gene finding and see if you can find all 20 different idioms about magic in that vignette. It's been rather enjoyable creating these since I started doing this. The other ones maybe are too boring. The other thing I think about code that I think empowers users is I believe it's possible for code to be multifunctional. A lot of code in bioinformatics is single function. It does one thing and it has one name. And it seems like every single name on earth has been taken for a bioinformatics software and then unfortunately gone obsolete or is no longer available. But decipher is one thing that does many tasks to empower users. And what I'd like to do is walk you through the evolution of decipher over time because that didn't start at day one. You don't build a thing to do a bazillion things all from the outset. And so I'll walk you through how that occurred with decipher. The first thing that I did back when I started this package was to construct a code for microbiome analysis, a very specific part of microbiome analysis that I won't get into. And it actually took off and a lot of people used it and it was sort of a beginner's luck. But really decipher originally was intended for oligonucleotide design and you're going to laugh because I mean microarrays and microarrays really have gone extinct. So this is an example of a package that was originally intended for something that bioconductor used to be a lot about and we all used to work on a lot that has now something we don't even talk about anymore. And unfortunately I was getting into the microarray business around the time that everyone was getting out of the microarray business and going to sequencing. So this was a bad idea and nobody uses decipher for microarray design anymore. But a lot of those concepts of microarray design were applicable to other problems, oligonucleotide design in general. So fish probes or PCR primers. And some people still think of there's still a small but substantial community of users who use decipher specifically to design primers and think of it as a primer design software. But we didn't stop there. I actually wanted to expand decipher into a lot of different problems I was encountering in the genomic space. And so I went on to do multiple sequence alignment. And maybe some of you have used decipher's multiple sequence alignment tools before. It's rather popular for this in my opinion. It's one of the best multiple sequence aligners that's out there and it's completely within our. So there's papers both about protein multiple sequence alignment and nucleotide specifically non-coding RNA multiple sequence alignment and decipher just kept growing. So one of the problems I encountered relatively early on as a user of my own code is that it's very difficult to deal with tens of thousands of genomes that are available now. And so I developed a whole foundational database underpinning and published this and this is I think one of the underutilized parts of decipher that's going to continue to grow and gain traction over time, not to give too much of a spoiler alert. After this, I developed a code that has become a large part of what my lab does now called fine sentinine. It's about finding sentinine between genomes. So you give it to genomes. They can be whatever organisms you want and it will find sentinine between them very rapidly. If you're interested in where decipher's headed. One of my grad students in lecture man is here to give it a workshop tomorrow afternoon that I hope you'll be able to join. He's done quite amazing work with the outputs of fine sentinine. And we've actually turned the outputs into an entire another package which we call sin extent. That's being developed by three trainees in my lab and sin extend is on bio conductor. It's been there for two years now. It also builds as you'll notice and it basically extends upon sentinine objects which decipher can create post this I moved into a new area which was classification of sequences. And this is in two different forms that we can do this. So one is for microbiome. You can classify a sequence to an organism and say that it belongs to this phylum or this genus and that has really gained a lot of traction. But then after we built that there was a recognition that oh this is a problem that's more generalizable. We can actually do this for protein functional classification as well. So now we have the ability if you have a protein you can shove it into decipher into an algorithm we call ID tax and it will pop out what that protein does. It does a very, very good job of that and we've compared it to other tools for doing the same thing. That was published last year. The direction that we're headed is a variety of things that are unpublished currently because we're working on publishing them. These are the relatively new things in decipher one is a gene color. So given a genome find the genes. And one of the major problems that we had in this particular task is that there was no way of knowing which genes were right or wrong. So we actually had to stop and go backwards a little bit and develop a way of telling what was the right answer basically benchmarking gene calls. And we did that in the form of a package called assessor which has been on bio conductor over three years now and assessor basically using proteomics data will tell you whether a gene call is correct or incorrect as best as it can. After this we've moved into another area to empower users that's detecting tandem repeats tandem repeats underlie the evolution of a lot of genes. And so we have a tool for doing that. And most recently we have a tool called tree line which is an algorithm for making file genetic trees so if you need to make a maximum likelihood or maximum parsimony or a neighbor joining type of file genetic tree, then now decipher can do that using tree line. So as you can see, decipher is highly multifunctional and we just keep on growing and growing and growing it. There's a lot of different functions in there that I haven't talked about. Well, one of the challenges is if you're going to make lots of functions then or to justify doing that you have to be state of the art in each of those functions. And I believe it is possible to push the envelope in every single area of bioinformatics that there's still room for improvement. And I think we've shown that but what's cool is that not only have we shown that for our own tools but actually decipher has been around long enough now that other people have shown that for us. So this is an example of a benchmark that was developed by the folks who gave us cluster W and cluster Omega which are multiple sequence alignment programs that probably everyone has worked with some of the most popular bioinformatics codes in the world. And what they did is they developed a new benchmark a few years ago and they decided to include decipher along with muscle and math and all these other alignment programs. And they had two different ways of scoring. And what you can see here is that decipher appears in the upper right which means it has a high score in both of these different ways of scoring. And so decipher not only has been shown by us to be very good for multiple sequence alignment but also others and slowly but surely this is happening for other things that are within decipher as well. So that's really great to see. I talked about how we built the assessor of benchmark we did that before we constructed a gene caller. We went and compared our gene collars gene calls to other people's gene calls other ways of getting gene calls and we found that in the assessor of benchmark we also do much better there. So I believe it's possible to to be stay at our and to push the envelope in every single area of bioinformatics there's still room for improvement. Another thing that I believe is that our code can be both scalable and fast. And when we talked at different presentations here, you know the last few days, I think everyone would agree that our code can be scalable right there's big data we all talk about it but generally our code is thought to be relatively slow to other languages. And you need your code to be scalable and fast because the number of genomes has grown very consistently exponentially for the past decade or so. And so you're projected that we're going to have on the order of 100,000 complete genomes and gen bank by only two years from now, which is really remarkable. And so to be able to do anything at that kind of scale you have to be fast you don't have a choice but to be scalable and fast. And decipher the main way it does that is it's mostly written in C right just like our itself a lot of decipher's code is actually C code. But another thing that I that we do in the package is about half of our functions about anything that can be is multi threaded. And I'd like to show you why we do that and make a pitch for you all doing it in your own codes as developers as well. And that's because we all know about Moore's law that the number of transistors on a chip increases exponentially over time. That's the brown line here. And because this is in log space that's exponentially growing. And so Moore's law has held for now for many decades. But if you look at performance of any individual thread, which is the blue it's more or less topped out. Okay, so performance of computers on an individual thread basis hasn't really improved and there's physical limitations related to that. But if you look at what are they doing with all those transistors that right there's so many more transistors on a chip what do we do with them. Well, what's really happening is they're putting them into more logical cores. And so you have more and more cores available to you as a developer. And so I encourage you through open MP and other things like this to really invest in this because this is the wave of the future in in chip development. Okay, the last thing I'll say is that I believe it's possible for an R package to grow and grow and grow. Decipher, which started out relatively small and with not many functions now has gotten bigger both in terms of number of functions and in terms of number of users. In fact, it seems to be growing roughly exponentially. And I hope this trend continues. So if you'd like you're welcome to visit us at decipher codes. We have a whole website and a lot of tutorials and vignettes as I showed that are that make it possible to use decipher. And we actually have taken a lot of the functionality and put it online for people who don't know how to use R. So with that, of course, this is a package that although coded mostly by me has had contributions from many different people. And a lot of work has gone into this, including at the bottom here, the three trainees I have currently that are working on decipher and Sinexton. And I'd like to thank the bio conductor folks for inviting me to give a talk and NIA ID for funding. Thank you. Yeah, thank you. Great talk. Love the focus on robust bioinformatic software that's maintained. How do you distinguish at what point you're going to add on to decipher versus creating a new package? Like is it a function by function basis or is it some other measure? So decipher basically has this this foundational concept that we just want to empower users in the genomic space. So if we have a new challenge in the genomic space, we're going to try to add functionality for it. That's become a little bit difficult with having multiple developers and students work on decipher. So that's why we bifurcated and became Sinexton plus decipher. And everybody who's working on Sinexton is doing things with multiple genomes and trying to use sort of the sentinel objects as a framework. And that sort of become the future direction of our lab. But there's as far as I know, there's no guiding rule of thumb for for how to do that. I personally have defaulted towards doing more within one package rather than having many different packages. Because I think what happens from a lab standpoint over the course of decades is that it becomes intrinsically impossible to maintain these many, many, many different codes. And not only that, but they overlap. So for example, everybody needs to load in fast A files. So when they overlap like that, you have a lot of shared redundant functionality across different packages or codes or whatever you're producing as a as a group. And it's it's sort of a little bit wasteful from a coding standpoint. What if they do the contest, what should the prize be? Great question. I have no idea. Okay, so Ryan Thompson online says regarding decipher being multifunctional, what do you think of the merits of splitting those multiple functions into separate packages versus keeping them all in one? Yeah, the advantage of putting them in multiple packages is users start to equate a package with a function. I think we've shown that it's possible to not have them make that equation. But people want to be able to call something by a specific name. So what we've done is we've made our algorithms have names in a sense. So we have ID taxa, which is an algorithm within decipher, we have tree line, which is algorithm within decipher. And we refer refer to them as those algorithm names. And that helps users to say, you know, I create a file genetic tree with tree line instead of I create a file genetic tree with the decipher because then users sometimes get confused that decipher is also, you know, able to make multiple sequence alignments. So I think that it's it's a it's a possible way to do it. And it has the advantage of it allows users to talk about the thing without having to to speak about to get confused as to what they're they're all speaking about. Paul, if you can hear us, we're ready for you to start sharing your presentation.