 Hi, I'm Katie DeRoshak. I am coming from the University of Washington from the Apology Allen School of Computer Science and Engineering with the Molecular Information Systems Lab, where we are really interested in doing interesting things with storing information in DNA and different types of sensing that kind of appeals to the DIY audience a little bit. So the project that I'm going to talk about today is called Porcupine, which is doing rapid and robust tagging of physical objects using DNA with highly separable nanopore signatures, or in a little bit more accessible terms, we tag stuff with DNA and we use nanopore. That's really all we're talking about here. So when we're talking about molecular tags, we are using DNA to identify physical objects. This is kind of in, could be in the same world as QR codes and RFID tags, but essentially we want to be able to tag an object with DNA, ship it or sort it somewhere and then be able to read it back on the other side and say is this the same object? What's the information in the tag or is it completely wrong? And some applications of this include tracking and provenance. So you might have a high value item that you want to be able to make sure it's the same at the beginning and end of the transaction, secret exchange and counterfeit detection. So maybe you've got a set of pills and you only want to be able to sample a few of them at a time. You could do that with a system like this. And we ended up using the nanopore minion device to detect these molecular tags. Some of our system requirements were that we wanted this to be DIY end-to-end by non-experts and not require a full biolab. I myself am a computer scientist and this is something that I can do with maybe a little bit of supervision, so it's not too challenging. And we wanted to be able to generate arbitrary tags on demand without having to do more DNA synthesis. This is one of the most expensive parts of encoding information when using DNA. And so if we can do that ahead of time and just be able to copy it instead of having to generate new DNA for every piece of new information, that can save a lot of cost. We wanted to be able to decode quickly and accurately and also use minimal special equipment. So the nanopore really fits pretty well with this kind of application. One challenge for typical sequencers is that they're often inaccessible to pretty much everybody except for very, very well-funded labs. These costs anywhere from tens to hundreds of thousands of dollars, which doesn't necessarily mean that they're bad, but it's just not very useful for DIY. And they're pretty large. They sit on a big bench top. I'm sure many of you have seen them before, but it's really hard to compare to this candy bar size device that can just be plugged into your laptop. And here's a picture of it plugged into a laptop. All the things that you need to run this are very small, like pipettes, a small centrifuge. So it doesn't require the whole lab. And just to briefly run over how this works, I took this diagram from a nature article. Basically, there is this thin membrane that is present with little nanoscale pores. And there's an ionic current that is being run across this thing. And the current is being measured over time. And then the DNA is prepared such that there is an enzyme that will unwind the DNA for you, and it will flow through this pore. As the DNA gets unwound and flat and is flying through, the current changes a little bit. And so you can go back and tell what DNA was in the pore based on what the current looks like. There's this little diagram in this corner here that shows basically different ionic current traces for each base. Now, typically these will all be concatenated together and we'll go back up to this open channel state. But it's really just a time series of ionic current measurements instead of actually directly reading the bases. So when we are creating a molecular tag within porcupine, what we start out with is just like any other digital tag, any RFID tag or QR code or anything, it's basically just digital information, a bunch of bits and ones and zeros. And each different bit is assigned a different type of DNA. So what will happen is you've got physically a little vial of this particular set of molecular bits that we call them. So when you have a bit that is one, you'll pipette that bit into the molecular tag mixture. And if it's zero, you completely leave it out. So we're really encoding information via presence or absence of these different types of DNA. Then we will apply this tag mixture to an object when dehydrated, ship it or store it somewhere and then rehydrate the molecular tag using a buffer solution. We'll then load it into the Minion and read it out using some software that we created. When we are actually creating these molecular tags, we're not just encoding the information directly, we've kind of got a step in there that is adding some error correction. So we have a digital tag that's a little bit shorter code where that's longer that has some additional bits in order to add some error correction, which I'll talk about more later. But then we are going to the molecular bits and the molecular tag. And a single molecule is made of a unique sequence and a specific length. So this is the part that lets us do create more bits that having to sequence to create more DNA. So we have our barcode sequence at the very beginning of the strand, then a spacer and then another barcode sequence and a sequencing adapter. And typically when encoding information in DNA, the information is recorded throughout the entire strand. And so I often get asked why, why wouldn't you take advantage of the incredible density of DNA to make this happen? And really this comes down to creating more bits without having to synthesize more DNA. We've, we start out with our nice little well plate of 96 sequences. And if we just add two different lengths, we have 192 mole bits without ever having to go back to ask IDT for more short strands for us. And then the second part of it is that having this unique identifiable region means that we can avoid base calling. So when we turn this complicated nanopore signal into bases, it takes a very, very long time and it's very complex and can that is the largest source of the error in the pipeline from working with nanopore data, where we actually have a much simpler problem. We're not trying to identify any DNA, just the DNA that we know is already there. So we can turn this into a classification problem instead of a decoding problem. That saves a lot of time and increases our accuracy pretty dramatically. And we design molecular bits to have distinct nanopore signals. We do this using a tool called Scrappy that is produced by Oxford Nanopore itself. So we can give it a signal or a sequence. We'll produce a theoretical signal. And then this is the actual nanopore signal. And we didn't really cherry pick one of these. They kind of all look similar, maybe a little bit stretched or narrower, but this allows us to really be able to design these sequences to look different, which makes our problem a lot easier. When we are designing these sequences, we're using an evolutionary process. We're starting out with an initial batch of them. We throw our 96, like in our virtual well played, simulate what they look like and then compute how different they are. And we will then start an evolutionary process where we shuffle them, mutate them, and then make sure that we're actually improving things. It's kind of like a guess and check method that will make these look visually different. And basically the lighter colors here are more similar. So they look like similar swiggles. And then you can see like this guy looks very different from this after this full process here. I don't want to pretend like this is the first time anybody has come up with anything that is working with raw nanopore data. There's in particular a group that is working with demultiplexing. And for anybody unfamiliar, multiplexing is a tool that is used to add barcodes on to a sample so that any reads that you get back out on the other side can be associated with a particular sample. And then at the end, you can go separate things out and make it so that you are only working on your one sample at a time. And what they've done in this case is taken the barcodes that Oxford Nanopore has produced for multiplexing. They found a subset of them and then can identify those using the raw nanopore signal. One challenge that they have is that they didn't have the ability to design them specifically to identify them later, meaning that they are working with a much more challenging problem than we are. But it's pretty cool tool and it's been very useful for folks. And then another one, it's currently if you can't just go buy multiplexing barcodes for RNA. And so there's a group that developed four barcodes that they could then identify, which is really similar to what we're doing. We kind of developed it independently, but it is a similar process. However, and the way that all of these work is kind of classifying them similarly, we're trying to just identify what little barcode is present in the individual reads. Our training data, we label the squiggles using sequencing data and then spread all the bits across a bunch of different runs, then test data with half of the bits. And we ended up using a five layer CNN with a fully connected layer and softmax. And this is all stuff that I'm happy to answer more questions about in greater detail later if desired. Our classification accuracy is very high. And the only reason that I actually say this is to show you that identifying mulbits is a totally a non issue. Our training accuracy is like 99.9 something validation 97.7. It's kind of to the point where it's not really an issue for us to be able to identify the mulbits. However, identifying what the tag is based on the mulbits can be a little bit more challenging. Now that we have our sandbox, we've got our molecular bits and we have a way to read them. The question is, what should we encode and how should we encode it? We have our 96 bits and we have kind of our framework where we may want to expand our digital tag, but we have to think about how we might want to do that. The most naive encoding scheme, of course, is mapping one digital bit to one molecular bit. And this is, of course, not ideal because one bit error means that you completely get the entire tag wrong. And so we want to add some error correction as alluded to previously. What we'll do is we'll reserve some bits for the tag and then use the rest of them to correct errors. And in our case, we picked a 32 bit message and then multiplied by a 32 by 96 random generator matrix to produce a code word, which then gets put in the molecular tag. And this allows us to get up to 18 bits wrong, which is an enormous amount without ever worrying about whether we're going to get the tag wrong. This is something that can also be chosen, depending on the application. It's just an example that we have for the basically our paper that we wrote. But if you wanted to take these 96 bits and use them in another way, there's nothing that would stop anybody from doing that. So just to briefly walk through encoding an actual message here, because it's a little bit hard to kind of connect the two. We'll start out with our digital tag here. We've got, we've encoded, we had to encode our molecular information systems lab, of course. And then we added our bits for the code word. Then we'll go through the process of actually encoding and sequencing and doing all the wet lab stuff. And then what we'll get back out of the sequencing machine is a set of read counts. So how many, how many times did we observe that particular molecular bit? And then we have to decide, like, at what point do we want to set it to one, at what point do we want to set it to zero? So we have a threshold here where anything that is above this line will set to one and below it is zero. And then we get a few incorrect bits. Was this okay? We have fewer than our 18. So we'll still decode correctly. But we also realized that there's kind of a large variation in read counts here. And we found that this was reproducible. Still cannot figure out why. I'm more than happy to talk to people about why this might be happening because we've tried a million things. But it's reproducible. So we were able to rescale the counts. And essentially what happens is that we get no bit errors after doing that. And we recover our final, final decoded message here. So our final results, we are talking about how long does it take to actually decode a message? Do you need to run this for hours overnight? Seconds, minutes. And so what we have here is on the x-axis, we've got our sequencing runtime in seconds in the log scale. And then we've got our codeword distance over here. So you can imagine that in reality, you could possibly get all 96 bits wrong. However, because we have used error correcting codes, the maximum or the minimum distance between all of them is 18 bits. So if you get 19 bits wrong, in some cases, you'll actually get a totally different codeword. So in this context with error correction, the maximum number of bits can be anywhere from like 18 to 20 ish. What we have here is, as expected, as your sequencing runtime goes down, or as your sequencing runtime increases, the chance of you getting the message wrong goes down because you've absorbed enough reads to be really confident. Each x here is an incorrect decoding, and then our dash line is guaranteed correct decoding with error correction. And we are able to decode with only about 10 seconds of data, which is really nice because with nanopresequencing you can just stop at any point. You don't have to keep running this for hours and hours because the process doesn't require you to do that. It's just physical strands flowing through this membrane. And what this means is that we can do pretty close to real-time reading. Another cool thing is that we've made the monocular tags shelf stable. This is really important for basically any kind of application you can think of where you want to tag something for more than a few minutes. You want to be able to ship or store your object, but then afterwards we can rehydrate it and then read it. And that has been a crucial part of this. So we'll prepare this tag for sequencing immediately after assembling the tag. This is a step that takes about one to two hours. So we are front-loading all the lab work on the writing side, meaning when you go to read it, you don't have to spend hardly any time at all. It's like seconds. We also sent a tag in the mail and just through regular mail like USPS to California and we could recover everything after about four weeks. I'm not sure what the upper bound is on how long these things will last, but we basically got the same sample back that we sent. So I think there's a lot of work to be done and figuring out like what are the bounds and what kind of like surfaces things can be, these things can be attached to and how long they will last, but initial results are kind of promising for these being actually stable. And then because we're attaching the sequencing adapter ahead of time, the part that's actually doing the unwinding of a DNA, you can't get any contamination. It's just not possible to happen after that process is completed because you have to have that adapter in order for it to be read. And so putting it on a surface means that you could have someone, it's like environmental DNA or anything and it will still mostly read out fine. Another big question I'd get is like, okay, cool, but what can we actually use this for? And we haven't really explored the extent of what kinds of surfaces or anything that this could be used on, but we can think of things that are traditionally difficult to tag with QR codes and RFID codes and such. This might include things like liquids, maybe food with some safety testing and making no claims about it actually being safe for food without testing, paper and commodities. So maybe you have a supply chain where you want to take everything and then only read back a few of them. You can really amortize the cost because reading is going to be the highest cost in this process. And there's some prior work that demonstrates some DNA encapsulation methods. These methods add some time and effort to the process, but they could be used to further extend the life past what this would normally live on its own for. So in summary, we have our molecular tagging system that uses our DNA to tag physical objects. The design uses an evolutionary model for nanopore orthogonal sequences, look making them look visually as different as possible and essentially trying to make our classification problem easier. And then we classify using a CNN. We will encode and decode using our random generator matrix, but again, agnostic to any type of encoding that you want to use. And we can get read out with less than 10 seconds of data. Future work might be like using a generative model to design these instead of an evolutionary model that's a guess and check. And also considering these different kinds of encoding and decoding methods that might be able to take advantage of different parts of the known error in the system, which we haven't really done at all. Another thing to consider, especially because this is a security-minded conference is what kind of security this actually provides. And one nice thing is that DNA is invisible. However, security by obfuscation is not always the right kind of security. It can be good in some circumstances, but you're never gonna fool anybody who is really dedicated to getting access to this just by making it invisible. And so there are applications where this would be useful and maybe not. And I'm really curious if anybody has thoughts on maybe the sparked some kind of interesting application that we haven't thought of yet. I also just wanted to give a really quick plug for the other things that our lab is doing because I think it's really cool and I only work on a very small segment of this. Our lab works on primarily on something called DNA data storage where we're trying to store information in DNA and read it back later, but not in terms of like presence or absence, but actually encoding information in the basis of the DNA. And this has been pretty far developed, but still reading or like the writing aspect is very expensive. The cheapest I've ever seen DNA for is like seven cents per base, which is astronomical when you consider large volumes of DNA. And that's been a pretty large barrier, but it's part of our, I think what's a really cool idea. DNA security. So this is part of our lab called SciBio Security. One paper that came out recently was talking about the GenBank system when some of the security implications. They also have microfluidic automation. So how do you abstract away some of the tedious parts of working in the lab and then DNA circuits? And of course, the nanopore sensing is what I personally work on. And with that, I will wrap up and I'm happy to take any questions in the live Q and A after the session. Thank you.