 Hey there, my name is Cab. I lead the software team at DayZero Diagnostics where we're gonna change the way bacteria infections are diagnosed and treated by leveraging whole genome sequencing, machine learning and software like Kubernetes to run bioinformatics tools at scale. So today we're gonna talk specifically about a bioinformatics tool called Kraken which was developed in an academic research environment. So it was developed at Johns Hopkins to solve some really complex and specialized problems but wasn't necessarily designed for scale. So here's where we're headed today. We're first gonna lay some really basic groundwork around microbiology and sequencing data. Then we're gonna talk about Kraken, both the complex technical problem Kraken is trying to solve and why Kraken is tough to scale. And finally, we'll discuss leveraging Kubernetes, leveraging modern infrastructure to run Kraken at scale which leads us into our primary takeaway, I hope for this talk and it's not a technical takeaway. It's the recognition that some of our most complex and important scientific and healthcare problems which require specialized scientific expertise are going to necessitate researchers and engineers working together to not only solve the problem but find ways to solve these tough problems at scale. And I feel pretty confident saying that this is more important now than ever. On the engineering side, especially in this community here at KubeCon which is focused on modern scalable infrastructure. I think we should reflect on whether we're holding up our end of that bargain. So yeah, let's get into it. For a long time, the process of diagnosing and prescribing antibiotics to treat bacterial infections has been to actually grow the pathogenic bacteria in culture and perform tests on that grown bacteria to determine what bacteria is causing an infection in which antibiotics the bacteria is susceptible to or resistant to. And one of the main issues with this is growing bacteria literally feeding them and having them proliferate and then testing them is not a very fast process. So this certainly takes two, three, four days to complete. And during that time, prior to having diagnostic information, physicians are forced to treat patients with bacterial infections with broad spectrum antibiotics. So this is sort of like carpet bombing to just hope that one of the antibiotics that you provide this bacteria is susceptible to. Not only is this toxic for the patient, but contributes to faster proliferation of antibiotic resistant bacteria. However, it is now technically possible and economically feasible to instead sequence the DNA of the pathogenic bacteria and use that sequencing data, along with bioinformatics software and machine learning to try to identify the organism causing an infection and determine what antibiotics that bacteria is susceptible to. And for us at day zero, one of the technologies we use to do that is an Oxford Nanopore Median Sequencer, which looks a little bit like this over here. It starts at $1,000 and can produce a pretty significant amount of sequencing data in a few hours. So yeah, that's our microbiology state of the union. So now let's talk a little bit about the data that comes off of this sequencer. So what we get out of this Menon Sequencer is for each sample sequenced, a file that's somewhere between 500 megabytes and two gigabytes compressed. And it looks a little bit like this. It contains repeated sets of groupings called reads that each contain an identifier, a sequence, which is just a long sequence of A's, T's, G's and C's are for DNA bases and a couple of lines about the quality of those bases. And a single sequencing data file will contain tens of thousands to hundreds of thousands of reads, generally totaling somewhere between 100 million and two billion bases, depending on how long a sample is sequenced. And just for reference, that little Menon Sequencer we looked at on the last slide can sequence over 90 samples concurrently on that one small device. So each read here that's in the sequencing data represents a small section of the genome of the organism that was sequenced. The reads that are part of that sequencing data aren't necessarily in any order. They might be overlapping, they may be duplicates and they generally vary in length. And if this sounds like a hard enough problem to figure out where these reads should be aligned to the organism's genome that you put into the sequencer, it gets much, much harder. When you don't know what you put into the sequencer. So if you don't know the organism that was sequenced, you really have very little information to work off of to make any sense of these reads. It's sort of like having a 100,000 piece puzzle made up of only four colors and we have no clue what the final assembled puzzle looks like. So they're dedicated specialists thinking about how to effectively and correctly put this massive puzzle together. These are the computational biologists and the bioinformaticians and they've produced an entire ecosystem of tooling to try to take these reads with very little information and turn them into something meaningful. And one of the tools that's part of that ecosystem is called Kraken, which is a taxonomic sequence classifier. So Kraken takes reads and tries to determine what species genome a given read sequence belongs to. So given a sequencing data file, like this one in the bottom right, Kraken will say like this read belongs to this weird species and that read belongs to this other weird species all the way down the file. And in a nutshell, for each sequence of bases or read that Kraken receives, it looks at all of the subsequences which are comprised of 31 bases and from there tries to determine all of the bacterial species whose genome contains each specific subsequence. After doing that, for every subsequence in the larger read, it classifies the read as belonging to the species that was assigned the largest percentage of subsequences. So the entire read gets called as the species that got the most of these 31 base subsequences. An actual process, data structures are quite a bit more complicated than this, but even just this probably, you guess is a pretty computationally intensive process. But on top of that, Kraken requires reference database to compare each subsequence against. And that Kraken database can vary in size based on the number of organisms you wanna classify and how confident you want to be in your classification. And so our database at day zero contains primarily the bacteria where most interested in comes to 350 gigs in total size. And that entire database is loaded into memory at runtime of Kraken, which in general is achieved using a RAM-back disk, which allows us to keep the database in memory over multiple executions of the Kraken command line tool. So once that 350 gig database is loaded into memory, Kraken might take anywhere from a few seconds to 10, 20 minutes to classify all the reads in the sequencing file, just based on the size of the amount of data that's given to Kraken and somewhat based on the complexity of the data given to Kraken. So at this point, we've identified a complex and important problem and talk through the specialized scientific expertise both in sequencing data and tooling required to begin to attack this problem. So now we can start to talk about how we might also try to attack this problem at scale and how we might leverage Kubernetes to do that. So at day zero, we're currently a software team of four working alongside computational biologists and bioinformaticians to reduce the computational turnaround time for bacterial species identification and antibiotic resistance profiling to less than an hour. And so we wanna not only be able to produce meaningful data in less than an hour based on input sequence data, but we wanna be able to do this on lots of sequence data in parallel. And that's really a change from the early development at day zero, which was research oriented bioinformatics pipelines that didn't need to focus on scale. And so were focused mostly on the scientific, solving this complex scientific problem which is a hard enough problem in itself. And so bioinformatics workloads in a research environment at day zero would generally run with process level parallelism on standalone and large virtual machines. But as we needed to scale considering our small engineering team, this is a great opportunity to leverage Kubernetes generally and for these bioinformatics pipelines, we could leverage Kubernetes jobs which give us dedicated, repeatable environments. Each pipeline is decoupled from all other pipelines and gives us obviously improved reliability with pretty limited engineering investment. But on top of that, we knew that if we wanna leverage Kraken among the bioinformatics tools that we're gonna use in these bioinformatics pipelines, we've clearly got a problem. Each pod that we want to run Kraken in would require 350 gigs of RAM. And we're at day zero where we're looking to run 50, 100, 200 bioinformatics workloads in parallel. This is just not feasible to expect us to be able to deal with this large memory footprint at really any scale at all. So instead, we can centralize Kraken to control the parallelism of our Kraken service separate from the parallelism of our primary bioinformatics workloads. And again, this allows us alongside our Kubernetes jobs that are running our bioinformatics pipelines and other bioinformatics tools. We can stand up our Kraken API. So we can put together a Kraken deployment that runs our API and stand up a Kubernetes service easily in front of it. And from our primary bioinformatics pipelines, just make requests directly to our service. And on top of that, we can leverage Kubernetes and to support a cost-effective way to manage Kraken by running our Kraken API on a node pool made up of large preemptible nodes to support this large memory footprint required by Kraken. And so this is great. This allows us to support our bioinformatics pipelines at some scale and run Kraken at some scale. On top of that, we can take a step up a level and Kubernetes allows still our small engineering team to support an even more complex system that allows us to provide a variety of Kraken workloads and use cases. So we do things like utilize CADA to add redundancy to our Kraken deployment during workdays when most active sequencing is happening. And so we have a preemptible node that provides maybe a single replica generally of our Kraken API but can be preempted at any time and take some time to come back up. In addition, during the day, we can spin up a non-preemptible second replica of our Kraken API that gives us a little bit more reliability when sequencing is more likely to happen. Additionally, we can deploy a second Kraken service alongside our original deployment with a database that contains fewer species for cases where we wanna prioritize speed over accuracy. So this starts to become a pretty complex system that supports lots of use cases and some dynamic use cases but is still possible for a pretty small engineering team. Although does require, I think when we get to this system we get probably beyond to the place where having engineering expertise work with the Bioinformatics Group and the Computational Biologist becomes critical to supporting these more complex systems. And then on top of that, if there are use cases where we need to have repeatable, movable, larger systems, we can step up another level and leverage Helm, for example, to support non-hippocompliant and hippocompliant workspaces that include our Bioinformatics Pipelines, our Kraken deployments, include tooling that comes alongside that like Cata and Prometheus. And so with some engineering support working alongside the research side, we can move to where we can support pretty complex systems in real-world use cases at some scale and the value that Helm and Kubernetes, Cata, Prometheus provide to this scientific problem is pretty massive. And that's not to say that I think, as we know, leveraging Kubernetes for these scalability benefits does come with some trade-offs where having engineering expertise contributing to this scientific application is critical. And so outside of the sort of standard, normal Kubernetes complexities, this Kraken itself has sort of some edge use cases of Kubernetes features. So we leverage an init container to load the 350-gag Kraken database into memory medium empty-deer prior to the API becoming available. And so we have to deal with some features that may be a little less mature and a little non-standard and sort out how we're dealing with, can effectively deal with some of these edge case features Kubernetes, which is where engineering expertise provides, I guess, where the engineering side will have to provide input and provide expertise to the BioHab Informatics and Computational Biology side. So that's, I'll say that, software like Kubernetes, like Helm, like Kata, Prometheus working together provide pretty massive value for solving these types of difficult problems in real-world scenarios. And Kraken is a complex software. So our team is not going to reach into that database for performance improvement, excuse me, into that code base for performance improvements. And instead, scalable infrastructure is a really powerful option for us to work together with our research side to solve some of these problems at scale. So that brings us back to hopefully the take-home once more, which is that these really hard but really important problems are going to require academic researchers and engineers working together. And I think in looking at the KubeCon 2020 schedule, we have to, we should reflect on whether we're baking a large enough contribution on the engineering side. I think for me, like I could only find sort of one talk that I thought fell into this category was this what sounds like super interesting talk about Kubernetes contribution to research around the epidemic and by Chris Nova and Dr. Beta. And so I hope that in the future and future KubeCons, in the future, we have the opportunity to see more talks like this at future KubeCons, talks that highlight this community contributing to scientific research and healthcare. So if you haven't already, please go see this talk because I'm sure it's gonna be great. And I encourage you to support talks, projects, initiatives that apply best in class infrastructure solutions to scientific research and healthcare. And just in general, I think, yeah, that we should continue to make a concerted effort to be as welcoming as possible to other communities with non-engineering expertise that would get huge value by leveraging software like Kubernetes to try to solve some of the problems they're trying to solve. And if you're interested in Kraken in the intersection of research software, software like Kubernetes, or just interested in some of the stuff we're working on at day zero, please don't hesitate to reach out. I've also added two of our engineers from day zero here, Zach and Tim, and they are two of the engineers that are working daily on some of these complex and specialized problems and how we can apply scalable engineering principles to our research to solve these problems. So I think if you have questions about this, these Zach and Tim are the folks to connect with. Otherwise, thanks so much.