 Now, without further ado, we have our next two speakers, Corey Hudson is a computational biologist at Sandia National Laboratories. Corey leads teams in cybersecurity, machine learning, symbio and genomics. His main work is modeling and simulating cybersecurity risks in realistic and large skilled genomic systems and highly automated symbio facilities. And his co-speaker Charles Frachia is a bioengineer who has worked at the intersection of biology and computer science for the last decade. He is the founder and CEO of BioBright, a company dedicated to making biomedical workflows more data-centric and secure. And now without further ado, please welcome our speakers. Thanks so much. We're going to do back and forth a couple of times on this. I'm going to be talking today, and Charles and I are going to be talking about a particular buffer overflow in a common piece of genomic software that leads to some really unexpected and unintended consequences and how this is a more general case of some issues that are really common in biomedical file formats. So over the last decade, really over the last two decades, the costs of genomics have rapidly declined. It was expected that we'd be at about a penny, a genome, but really we're at about a thousand dollars a genome. It stayed pretty steady for a long time, but since it hit a thousand, what dramatically happened is that the value of just doing genomics in general became much easier and much more simple than doing standard genetic tests. You get a lot more out of it. There's a lot more reason to move to genomics in general. And this is being driven by two characteristics. One is that there are health care drivers in genomics. Genomics allows you to segment and split up populations based on disease prevalence. It allows you to take individuals that may have a particular phenotype that is, or a genotype that is unknown beforehand and really identify the correct precision treatment for them. That's what we're talking about when we're talking about precision medicine in general is this use. The second main driver in this has been in industry, and this industry has been growing considerably and has been really motivating a lot of the work that's happening in genomics. The issue is that you have two highly commercial fields with high amounts of data sensitivity industry and the medical field coming out of genomics, but this whole environment really came out of a much different trust model. The trust model for genomics over the last 20 years has really been about academia and academic collaboration. And so there was no necessary need to secure the system or to make sure that you weren't writing code that could potentially be incredibly risky in large part because the individuals who were interacting all knew each other in some way or another and could trust each other. This is very similar to what happened with the rapid growth of the internet and the standardization that had to happen in the 90s. And so what this has led to is coalescence around standardization and automation within this field. Yeah. So one of the things is if you look at kind of the birth of modern biomedicine right early, especially experimental biomedicine early 1600s, everything was manual where we would select a subject manually with our hands or dissect a cadaver with our hands. We'd observe what's happening with our hands and then we'd make an analysis in our heads and obviously some of those analyses are kind of comical these days or our understanding is comical. But today things have completely changed. We're really turning into digitally driven workflow completely. And I would say that this has changed in the last say 10 years or 15 years where this entire environment has moved from a manual process to a more digitally and automated workflow. So now we have liquid handlers that automate the subject selection. We have automated microscopes that will take pictures and images of what we're collecting. And then we have even automated analysis, which is an area of very rapid growth right now, especially in industry, where we're using, you know, kind of novel algorithms and statistical learning to describe what is going on in the sample itself. So if you were to map out what the kind of the stack, the but what the bio biodigital stack looks like today, you're actually using software at every single one of those environments, you're using it at the design step. So you're using computer aided design tools to generate your materials that you will then order, then you will get those built either as a service, which you can order DNA and get it back. So you literally send bytes and you get DNA back. So those will be automated with machines. And then the testing part where you will have an assay, that's perhaps the most manual today still, but that's very rapidly being automated. And finally, once you get the results from that test, you do it, you're using bioinformatic tools software to analyze those results. Right. And so what does that look like today? Just a quick example, right? So as I said, you can actually design a plasmid in software, something like Genius or, let's see, I'm blanking on the name of the other one, but there's a number of tools or vector and TI or things like that, where you're actually designing soft, you're using software to design your plasmid. You then order it at a DNA synthesis company and you get the vial back, right? And then you will build that into a robot, into your biological systems using liquid handling robots. Once you do that, you culture it up, you end up getting plays like this. You will pick individual clones and those will get sequenced. Again, every one of these steps has both digital inputs and digital outputs. And at the end, you'll analyze the biological, the digital results using bioinformatic pipelines with bioinformatics tools. So what really has kind of snuck up on us in this field is that all of the critical workflows in biology and biological research rely on digital tools. So what happens if those are denied, right? What happens if there's a denial of service or a crash or whatever happens on these tools, what happens then? And so in order to understand what the impact would be, you're really into understand how deeply these tools are connected and where do they come from in the first place? So the reality is that the vast majority of these tools actually come from an academic environment, at least at the very beginning. So the vast majority of these tools, things like, you know, biopython or bioconductor, gentle, Gromax, integrated genome browser, all of these are actually academic software at their roots. And unfortunately, one of the issues with academic software, which come out of PhD theses or whatnot, is that they actually don't get supported once the student graduates or rarely do. This is one of the major issues that we have. So just looking at the list of alignment software. This is just genomic alignment software that is from academia, just a small list. I don't know if you can read it, probably not. But it's to tell you how many of these are actually either reproducing the same functionalities or having a small improvement and they end up as kind of stale code on GitHub at best or at worst, just as part of a paper. On the business side, not to say that there aren't any tools that come from industry, there are lots of them, but they tend to follow the same kind of environments. They're monolithic software that's been around for decades in many cases. The release cycles are not modular. It's not fast. It's not lean in most cases. And so you have some of these tools that have been there forever that haven't been audited and yet the trust model has changed. Right. So putting that into context today, what we're going to be talking about is a vulnerability in BWA, which is one of the academic software that is more alignment focused. So Corey is going to talk about the vulnerability, then we'll wrap up a little bit later. So the way that genomics works in terms of actual data flow is you have a machine, an Illumina sequencer typically that's producing terabytes of images. Those terabytes of images are being transferred through a lot of pieces of software until you ultimately get some genetic variants. Once you get the genetic variants, then you're making clinical decisions based on that. The reason that the software gets transferred so many times through so many kinds of software is in large part because of the technology. So the technology, the way that it developed was to do to split the genome into really small pieces and then to put those pieces together or to map those pieces onto a reference. The software then has to interact in an environment. It's interacting with the sequencer. It's interacting with a firewall. It's interacting with government databases and with software repos. Somehow it's it's going through a lot of tracks and paths before it gets to where it wants. And so my team had a particular question and that we wanted to look at and it was could you hack the raw data using an in the wild vulnerability in this BWA software or any software to lead to a potentially malicious clinical outcome? Could you change it so that using only standard security assumptions, you had the data and it was completely corrupt, the entire raw data. You'd have to resequence every individual to get the data back and the outcome would be completely different. And so to do this, we looked into the main software pipeline that's been developed. This is the one that's going to be and most of your hospitals are most of the places that they're doing precision medicine to collect variants. They're going to run through what's called the Broad Institute GATK pipeline. And the first tool, the very first one that isn't just basic pre-processing is one called BWA. And what BWA does is take a human genome and then map the reads that are coming off the sequencer back onto it. But we all kind of know that there's not just one human genome, there's all of our human genomes. And we're all more or less the same give or take about 20 megabytes. Like me and the person in this room who are the furthest distance are about 20 megabytes apart, but there's 20 megabytes can be all over the genome. And so the more sophisticated techniques in the one that BWA developed in 2014 was not to just map it to a direct genome, but really to map it to what they called an alt aware. So knowing that there's alternative potential alleles at various points in the genome. So this was rushed through. It was a it was a add on to the main software that sort of showed up much later. But to have this kind of data, this population data doesn't make sense for any particular lab to have it or for you to hold all this data yourself. It makes sense to have this at a at a major repo where everyone's sort of collecting all the open source data and building this index for all the genomes out there. And this is done through NCBI. And so BWA provides a tool that just downloads these index indices, which are actually pretty big from from NCBI, and then the user stores them and runs the software as they want. So in terms of how the software actually actually interacts with the environment, we have a tool called BWA that's interacting with these files, which are unencrypted plain text files called fast queue files. They're interacting with GitHub through repo updates and they're interacting with NCBI in terms of how they how they go back and forth. And ultimately they're producing something called a SAM file, which is just a an alignment. So every individual that's running it, you see where they fit relative to the to a reference genome. And so when we went looking for vulnerabilities, this BWA is is written in C and so you kind of know where you're going with that for your first steps, at least. So in auditing the the software and looking into it, this is the all to wear part that I mentioned is shows up in 2014 as a new capability and you look at it and one of the easy first steps to go at in this is to look in the comment code and see where or in the issues and see where it's segfaulting. And when people are running this, what's happening is this particular part of the code right here, they had already recognized this causes a segfault in the vulnerabilities. And so then we asked the question, could we induce the segfault? Could we take control of the buffer by doing that and take some some command over the system? And so so we we developed a buffer overflow, delivered it to the software and unsurprisingly had a segfault. So we know how the exploit can happen at this point, but how do we get to it? How do we get to the soft to the actual software? Let's assume that the data that's in the database or that is coming off the fast queue can't be touched. We're not allowed to sort of manipulate that directly. Well, what we have is the interactions with NCBI. So NCBI assumed a particular threat model, which is that they were providing open source data so they didn't need to encrypt their traffic, but they don't provide checksums. So this is it's common to deliver the data over unencrypted channels, but you need to provide checksums, otherwise there's no one there's no guaranteeing that that you've that you've got what you wanted. And so they're delivering these indices, which are three point three gigabytes in size over an unencrypted channel. So what do you do? Sit on the network, sit on the network, deliver some art poisoning attacks, access the data as it's coming through. You know where to look for it coming from because it's having to go to this particular FTP site, interact with it in the middle, change the format, have it go out the back end. And so in terms of crafting the exploit, we set some particular limitations. We wanted to turn a single A into a C. We didn't want to change all the A's indices, just one particular instance in the genome, and we wanted to make it affect the raw data so that if this exploit was run, the entire raw database would be suspect and corrupted. And so to do that, or to design the actual target, we did a trick from PCR. So the polymerase chain reaction, when you want to target a particular nucleotide in a wet lab, what you do is you design things called primers, which are sequences on either side of the nucleotide you're looking for. And so we could do that. We decided to craft an exploit using that and we targeted a particular A on the 12th chromosome. It was a random choice. This isn't particularly clinically meaningful, but it can be done with any one. This was just the one we hit on. And seven base pairs upstream and nine base pairs downstream is all you need to target that site uniquely. And so the actual exploit ends up looking like this. You have the first part in blue, the part that triggers the overflow. You then create a shell, issue some said commands that are overwriting every FASTQ file and turning the sequence at the beginning into that sequence at the end. If you notice all that is different between those minus the regular expressions that have been added in is an A being turned into a C. Then you live off the land a little bit, you take a hidden file that you've put into this 3.3 gigabyte file, you overwrite the index to the right index, restart the program as it's supposed to run. And then you've changed all the .fq files in the directory. So all your data is potentially corrupted. So the question that we had is if we run this, what's the outcome? And so this is the pipeline. We want to look at the very end of the pipeline. Now what are called the VCF files, which are the variant calling format files, are those different? If you ran this, did you end up with an adverse clinical outcome or an adverse potential outcome? And so we set up three experiments. So I'll say right up front, we worked with the BWA developers to issue a patch on this and did some responsible disclosure on that so that a patch did exist before we before we ran this analysis. And so the first version of this is we we run all the standard analysis with no patch but no exploit. We then run the analysis unpatched with the payload where we all we're doing is changing the A to a C and then we run it patched with the payload. And so it's important to to understand what we're trying to do here. So really what we're trying to do is see what you would see at the very end. Most of the people who are analyzing this data are not doing the I.T. and not doing the bioinformatics well beforehand. All they're getting is a variant in front of them. They're getting a case that this individual has a C phenotype here or an AC heterozygous phenotype genotype, I mean, and as a result, they're making clinical decisions. So there's 200 medicines that currently require a genetic test right now. Those are being done with standard genetics by and large. But there's as genomics increases like I was talking about in the end, there's or in the beginning, there's no reason why you wouldn't do this using a genomics technique. You would you would do a lot more tests at once as a result of it. And so when we ran the first experiment, the unpatched version with no exploit, what we found was an individual all the raw data when it was run through at the final result always led to the genotype AA. So we're deployed organisms. We have an A in both positions. So because of how we run ran the exploit and because of the sequence that looked at the raw data doesn't know the direction of the DNA. So it only changes it in one direction. It doesn't change it in the other direction. So what we get is not a CC but an AC. So now an individual has a different genotype and potentially a different phenotype for this. They have a different clinical outcome from this. And we find that in the statistics of this is highly statistically significant as a result. In the first case we ran it, it's a probability less than two to the negative 200 very highly non random that that this would that this would happen by chance. In all the others it was a similarly high in terms of it in any statistical test that a clinician is going to do. You're going to end up with some significance there. And when we ran it on the patched version because we wanted to see how this actually looked when we when we deployed the same experiment on the patched version. We thankfully and unsurprisingly the software failed and it failed in the way that you hope it fails in this sort of situation where it just tells you we couldn't run this off. We couldn't run what we what we wanted to and it aborts. And so this is the result of this experiment. This is not the world's most sophisticated piece of malware or sophisticated vulnerability but it shows you a new area of where the vulnerability space is. So this is attacking raw data. This is using existing software. It's making very few security assumptions that are that are not standard. And as a result, you're leading to potentially some really bad possible clinical outcomes. Yeah. So how I'm just curious how many people in the room are biologists or have a genetics background? Awesome. So the TLDR of that is basically that with a very simple exploit, you can change the clinical outcome for a patient. So by just doing a very simple exploit, you can actually change how a patient is going to get treated based on how they come out of this diagnostic test. Not something that we want to be happening in hospitals and environments, which is why we're trying to raise awareness about these kinds of problems. And that's precisely the point is that this is kind of a larger problem in our field. If we look at the by the digital stack, as I explained it earlier, and we look at it with an adversarial approach. What are the areas that we can attack actually? Well, pretty much all of them, the design step is vulnerable, the built step is vulnerable, the test step is vulnerable and bioinformatics analysis pipeline, which is what we were talking about today is vulnerable. And so if you think about what is the broader issue that we are tackling or we really have on our hands here in our field is that it's actually these vulnerabilities are just all over the map and we need to start fixing them. So one way to look at it is what are the instruments and what are the tools that are actually at our disposal. So there are things, there's sequencers, there's mass spectrometers. This is a kind of a bridge list of all the instruments that are available in our workflows. In fact, if you were at the Biohack Care Village four years ago or three years ago, I forget when I gave my first talk here, but I showed how we reverse engineered freezers and fridges and how you could actually attack those two. Obviously, if you attack those, you could destroy all of the reference material that's in the freezers that constitutes millions of dollars of work often. So and this is just one of them. So if we look at this instrument list, let's look at the ones that have significant digital inputs, the ones that actually take files or digital input into the normal course of operation. Sequencers obviously do that, mass spectrometers, same thing. Mass spectrometers actually rely on a lot of similar approaches. It's not a reference file, but similar kind of ideas where there are tons of reference materials that are imported that are necessary to understand what output you're getting. Things like bioreactors. Bioreactors is what's used to produce the medicines that we all consume. So what would happen if you could deny those? Filtration machines, same thing, et cetera, grow on and on and on. Not only that, but what do they output? And so I tried to make a very simple color gradient here. You know, how much data do they output? So sequencers actually output quite a bit of data, high content imagers, which we work with a lot. It's not uncommon for those two output hundreds of thousands of files per day. So think about the kind of the human computer interaction problem here. There's hundreds of thousands of files being output by a machine, overseen by one often one human or a small team of two humans. And you are trying to comb through those to see if there are any exploits. Not going to happen. We just need better tools, right? So again, what does the biodigital stack look like and what can you attack at each one of these steps? So on the design step, you are going to be attacking software and file formats. This is those are being consumed by the software itself to generate instructions. At the build and test step, you're going to be looking primarily at firmware and operating systems. And finally, again, as we explained today in one example, you're going to be attacking software and file formats in the analysis step. So the vulnerability landscapes looks like this. On the software side, you've got things like privilege escalation and remote execution, which are right in the software. Again, this is software that has been around for decades, in many cases, and not really looked at from a security perspective. File formats, we explained today how some of the alt file can actually lead to buffer overflows. You can also have kind of more of a denial service of corruption. What if you could actually just corrupt the files going through the wire and stop the whole pipeline? You know, you could actually deny a medical decision being made by just kind of seizing up the pipeline, the workflow. And so the firmware, same thing, you can lead to kind of biological or workflow level denial service. Same thing, you can also do a financial attack on these things. So what if if you actually time your attack just right, you can actually lead to something like 80, 90 percent of all the cost of goods of producing a drug going down the drain. That's going to destroy a pharmaceutical company's ability to actually produce the drug and eventually putting it on the shelf so we can consume it. And on the operating system side, I just wanted to make a special slide for that because this is what we typically look at. Unfortunately, this is very, very, very common. The OS is often Windows XP or if you're lucky, Windows 7, it often has to be connected because it's generating hundreds of thousands of files. And so, you know, you're not going to be moving those with USB sticks, although some of our customers do. And often it's connected using SMB V1. So it's often vulnerable to eternal blue. Now, the kicker is the scientists are not allowed to upgrade any of these machines because it actually impacts their ability for the machine to actually work. And that's something that's really unique in the biological space is that the scientists don't even have the means. We're not talking about the knowledge, we're talking literally, I would like to upgrade this machine because I'm uncomfortable with having a Windows XP machine that's connected off to the Internet. And they're just not allowed to, right? So this is kind of how we feel when we look at the situation. It's like everything's fine. And that's what we want to kind of impart here is that it's high, high time that we work together to try to fix these problems. You know, Corey and his team and my team are working towards this, but we need to have more folks have eyeballs on this. And as I said, cyber biosecurity is kind of unique because there's this tension between the scientists and the IT department and every time the IT department is going to lose because it's the conversation kind of goes like this. The IT department was like, Hey, you really shouldn't have this thing on the network. It's really awful. And the scientist goes like, Yep, but I can't do my job if I don't have it. And that's that's one of the issues. So the scientist is always going to win, unfortunately on this one. Not only that, but IT departments and pharma companies and biotech are often seen as cost center. They're not really seen as kind of an actor that can help and really contribute to some of these issues. So here's what needs to happen. I said starting now, but really yesterday. One, we need to have hardened parses for common formats. They're really, really, really important and we're working on some of these things, nothing we can announce right now. But if you are interested in this field in this environment, you can help contribute. Please come and talk to us afterwards. Second, and this is pretty obvious, but we need bug bounties specifically for key software in our field. And this is important. Why is that different from a traditional bug bounty is that a very low, very simple cyber vulnerability can lead to an enormous biological vulnerability. A very simple thing can actually lead to a denial of a production of an entire drug. And there are example out in the wild with pharma companies being hit by ransomware and things like this, where it actually leads to very clear problems in the manufacturing supply chain. And finally, you know, instrument manufacturers should really publish file format specs and parser code. We are only going to be able to fix this if we put more eyeballs on it. Keeping it private and proprietary is not going to help. And I think I speak for everybody in this field to say we all want to help and fix it. So let's just get our heads together and do that. And I think there is really encouraging next door that some companies like BD, for example, are actually explicitly having a vulnerability disclosure process that's really unique in our field. We have to have more of that. So the buyer hacker village is very clearly spearheading and pushing the community side of this. We got to get together and really push this kind of year round. And so this really gets me to my to our final asks here. If you want to fund bad bounties, please come talk to us. There we have some ideas on this and and are pushing this in the community. Please come talk to us. We we really want to help make this happen for real. If you're an instrument vendor, please come talk to us. Likelihood is we have a ton of your files and file formats and ideas on on on how we can improve it. I think everybody again, as I said, we get to get together and fix this. And finally, send us sample files. If you have sample files from your workflows from your environment, if you're an instrument vendor, just go to that that goes to a draw box. We kind of like drop share. Just send us files and we will start looking at those. So with that, I think we'll take any questions and thank you very much. Yeah, so the question is if you can locally store the indices, take those indices, hash them and make sure as you're running the software that you're running the index that you think you are. Yeah, and this is what a lot of people do for for actually reasons of speed. So the reason that people have ever done this is because they wanted it locally so they could do a local query. So blast, for instance, a common NCBI tool frequently you won't blast against the Internet database. You'll blast against your own internal database. In a sense, that limits you in some particular ways. And so you create a bit of a moral hazard out of that. And that hazard shows up because you want to be searching against the most recent database. If the most recent database is being supplied over the FTP server, at some point, you're going to have to get it. And so you can go out and query offline and then transfer it in some other way. But round about, you're going to have to get that data from there to your machine. So you may not have to sit on the network, but you're going to have to get that file over. And so there's risk mitigation there, but there's also a potential risk scientifically of not having the most recent database. That's the reason why people continuously download NR, which is the non redundant nucleotide database from NCBI. That's right. And so those dots were designed, nonetheless, to be so the regular expression was designed to be singly indicative of that particular reasons or that particular region. So it was pinpointed even with the regular expressions. The reason to have a regular expression not be exceedingly long or not be exceedingly strict is that if there's any mis if there's any errors in the raw data, you're going to catch the errors and not change it. So you want to change it in instant and as many possible instances with this small irregular expression or as in specific a regular expression as you can. So I don't know. And one of the things that is an issue here is that we haven't had very much trust and communication of this. This is handled internally and there's a there's no necessary disclosure that an IT department is is seeing this. You might have some. Yeah, I can so I can add to that a little bit. Given how wide open the attack surface is. So just to be very clear, we have not seen evidence in the wild that this is being open. But what we do see wide open is very worrisome. Like the technical proficiency that's required for doing some of these attacks is very, very, very, very, very low. In all scenarios, you have to think about kind of what's the advantage of doing something like this. And so you can start thinking about what kind of adversary would want to do what kind of attack. So the bottom line is the fact that there's no evidence is more a reflection on the fact that nobody's looked rather than it's not happening. Incidents around the world indicate that some kinds of facilities like this are actually targets for attack. But no publicly available evidence of that.