 Hello, hello, hello. Okay, this is the marathon swift use case. Seriously. Man, we're doing tons of use cases. Thank you for joining us at the last session of the day. And for those of you who've been sticking with us, I mean, we've seen Ancestry, we've seen HPIT, we've seen DreamWorks, Fred Hutchinson Cancer Research Institute, and now we've saved the best for last. Brandon. So much pressure. No pressure or anything. Brandon from Hudson Alpha, who's a system engineer there. And what we're going to do is we're going to talk through how they're using object storage for next generation sequencing. And I think we've said it before, we have a book on this. I'll talk where to get that in a minute. My name is Joe Arnold. This is, Brandon, do you want to do any other intro for yourself? Brandon Cruz. All right, that's all I've got. OK, so check this out. The cost of generating a genome sequence is decreasing faster than the rate of Moore's law. It's crazy. It is. Well, it's a blessing, and it's a curse. So suddenly you have all these scientists running around able to generate just a ton more data. And they're super happy. Yeah, they're thrilled to get this stuff done. But the problem is with this guy. You have these IT guys who are in these research institutes. And you call them up, and you say, hey, by the way, my Wi-Fi doesn't work. And oh, I need a petabyte of data stores. And that's kind of what they're faced with. And that's super hard. And as a result, it's being rethought of how to go about storing this data in these research institutes. How do you do it with a few amount of people at a low cost that can sustain that throughput ingest rate that are coming off these next generation sequences? Well, because beforehand, Joe, people really didn't care about the cost of storage, because you didn't really have that much storage, right? So going with some of the big players really made a lot of sense. But then as this kind of curve just falls off the face of the earth, finance comes back and says, OK, I know you need to store five times as much data as last year, and probably five times as much the year after that. But we're going to give you a 10% increase in budget. Massage it around, figure out what you can do, right? Big problem. All right, so I'm from Swiftstack, and we work on OpenStack Swift. We build a product that makes it easy to deploy. New slide for everything here. We're a leading contributor. What we've done is we've made it really easy to deploy OpenStack Swift and operate it, make it easy to scale, do a bunch of integrations that folks need in order to get it up and running in production. So that's what we do as a business besides work on the open source project. And we have the book, which goes into kind of like the detail on, thank you, Brandon, for all the modeling there, which goes through the detail of what this use case is. What are the really specific storage characteristics that are needed to handle this workload, as well as a, we have another resource of OpenStack Swift that's published by Arale. So go check them out, and we have some available at our booth as well. Yeah, so Hudson Alpha is a research institute based in Huntsville, Alabama of all places is kind of strange. And in the middle of nowhere in this big research park, we have this awesome, super awesome modern building that we're bringing in researchers to basically cure cancer. And like most institutes, just like Fred Hutch and Dirk was mentioning, the demand for researchers is huge, right? We're trying to make all these big strides, genomic sequencing has become really cheap, or you know, cheap-ish. And so all these researchers are sequencing tons of data and people are really excited. Like Hudson Alpha just recently, and we'll get to this a little bit, just recently got a pair of X10s from Illumina. It's basically just these really fancy machines that cost 10 million bucks. And they sequence human genomes really, really quickly and for the most cost-effective, whole human genome sequencing to date. The problem with that, of course, is that it's spitting off a ton of data, right? And what happens after that is really up in the air right now, and a lot of institutes are dealing with this. And so Hudson Alpha, what they wanna do is, you know, put together these really incredible programs, do all this neat work for the NIH, and the IT is kind of secondary, right? And that's the case in most institutes. But the problem is you're having to deal with the petabyte-level scale of doing this, which we've never had to deal with before. And you have, you know, mid-level kind of IT people that are trying to put together petabyte-level solutions. And that's not easy, it's not simple, it's not straightforward. And to top it off, we don't have any money, right? For IT, so that makes it even more difficult. So that's just kind of the 30,000-foot view of Hudson Alpha. Yeah, and so this is a pretty interesting slide, kind of mapping exactly what we do to what the IT component is, right? So fresh off the sequencer, and we'll get to this workflow in a little bit, and I'll try to spice it up, but fresh off the sequencer, you go directly to storage. So there's this need for this, you know, on-premise first tier of storage, which needs to be fast and, of course, cheap. And I can't hire anyone else to run it, right? So all those things compacted into one. But then the, we have an HPC, you know, group that's manipulating this data, just like Dirk said, they're used to NFS, and everybody has, it's like, we're used to just having a share. We can manipulate the data on the share, everybody's happy. But object storage really brings a fresh approach to that, not only from the archive standpoint, but from, you know, the actual implementation and use of these sequencers, which has been pretty interesting. And then ultimately trying to derive and get to something that's useful for the expert to be able to play with, because that was another huge problem, is we're trying to teach biologists how to use object storage, and they weren't very interested to say the least. So we kind of had to abstract that form a little bit, and then end up being much more straightforward than we thought. So it's been an interesting journey so far, and we'll kind of dive into what that looks like in terms of data. So, okay, go ahead, Jordan. Can I do this one? Yeah, please, please. So when I heard this, I thought I was, I was just like fascinated, right? And what's neat is to be, as a technologist, it's neat to be a part of, I mean, something that's bigger, that's happening around you. And one of the things they were talking about to us, when we're going through some of these use cases, was there's just one program that they're doing. And I'm just gonna pick one, there's a bunch of things that they're doing, but this one is about a developmental delay in children. So learning disabilities. And one of the things that they needed to do was be able to sequence all these genes of all these children. And in order to get enough sample size, you had to have 500 kids. But it's not enough just to have the kids, you have to have the parents as well. And so that's four years, 500 families, 1500 genomes, and that's a half a petabyte of data. And this project would just be unthinkable just a few years ago. I mean, you couldn't even do it. And it's because of the equipment now that's available to do the genome sequencings that makes it even just possible to sequence this many family's genomes. Now, this is just one research project. Now think about when we're entering into things like personalized medicine, when you go to the doctor and you have a tumor, you can get that tumor genome sequenced and then one year later, three years later, 10 years later, all that data is still stored so you can track the progress of that tumor as it's progressing or not progressing. It just unlocks new therapies, new clinical treatments for all sorts of patients. It's just a really neat tool at disposal. But you look at the change in the equipment and how much data storage. We're going from these two terabytes a year to a petabyte and a half a year for a high-seq X10 cluster. It's pretty dramatic. So everything has to be upgraded. I mean, everything, when you're handling two terabytes a year, it's really simple and you can get away with a lot of really bad practices. But when you get to anything over a petabyte really, it becomes much more difficult. All of a sudden, your super awesome raid system doesn't work so well and all of a sudden, your super awesome tape backup that takes 30 years to restore doesn't work so well. And then all of a sudden, really, a lot of the off-site and cloud options don't work too well because it's too expensive. And I'll give you an example of Hudson Alpha. They're like, we're just going to use S3. I'm like, cool, that sounds great. I love S3, right? I love object storage. This will be fantastic. We do a little calculator thing and we're all excited about it and then it pops this number of $3.3 million a year. Well, our budget's only like 500K. I'm like, well, that's not going to work. And then of course, that's just for this year, right? Obviously, we don't throw away all the data at the end of the year. It continues to grow. And even if it just continues to grow at this pace, it's a massive amount of data. And we're talking about 18 petabytes in just a few years that we're going to have to handle at this relatively small research facility and especially relatively small IT team made up of just a couple of people. So it's created a really unique problem that SWIFT and SWIFTEC have been able to come in and really help us solve. That we really exhaustedly look through some of the other options which we'll get to in just here a second. They say small, but there's only a few folks that actually have this high-seq X10. Sorry, I had to take a selfie in front of it. Yeah, and I will say that we are one of the only nonprofit groups that have these X10s, but we're also different in that our sequencing, the bulk of our sequencing actually isn't for just Hudson Alpha internally. We actually have what's called the GSL, the Jomic Sequencing Laboratory. That's what they do. They do sequencing as a service, right? So it's maybe, it's kind of sass, but it's a little bit different, right? Yeah, it's a sass for genome sequencing. But check this out. So here, these are called flow cells and they'll take samples and they process them somehow, which I don't know. And they'll put them on effectively a flatbed scanner if you want to think about it that way. It is. Then if you look inside here, so here's a picture of the sequencing here and then right here is a chemical delivery mechanism which will introduce chemicals in different phases of the process and it will effectively scan on that scanner to get pictures. And what is, what can you explain with that picture? Six billion. Well, so what it's trying to introduce is it's actually running chemicals over top of a particular coordinate. So it's trying to say in this coordinate we're trying to find if it's an A, G or T or a C, right? So just like binary with a couple more characters, your whole human genome is made up of one of those four. And so the unique thing about the high CX-10s is not only the data that it's throwing up, but in terms of the biology, it's the first time we've ever had this kind of speed and turnaround. So instead of taking seven days, it only takes three. And we can do a whole human genome instead of just parts of it. So now it unlocks a whole, kind of myriad of different possibilities of what we can research and even things that we don't understand now. Getting to the end of the cycle, which we'll also talk about the variant calling, it's saying, okay, here's the reference genome and here's Billy's genome, right? And Billy's genome is probably boring in a lot of ways. It's probably very similar. But he has some variants. Some of the variants we have no idea what it is. Maybe in 10 years we'll know what it is. Maybe in five years we'll know what it is. Maybe never. It might not do anything. But there are some things that we've already been able to discover which is really incredible that this even happens, that someone can actually take a piece of your blood, run it through this crazy machine and then on the other side say, oh, you have a 80% chance of getting breast cancer in the next 10 years. So seeing this progress and seeing the cost of genomic sequence and go down to where it's readily available for every American is extremely exciting and really the world. The NIH has done some case studies on just the US but just thinking about the data related to sequencing everyone's genome across the world is staggering. It's amazing. Yeah, so this is the machine will run every three days roughly to process something. And so every three days you have 13 terabytes of 20 million objects of those scan flow cells. That will get converted later on into eight terabytes of consolidation and that's into a single file. Then from there, there'll be about two terabytes of aligning it and compressing it, which is another file format. And then again, it's not useful if it's just sitting locally. It needs to be distributed out to the researchers. So you'll have about four terabytes of storage which will need to be pushed out to the edge to distribute out to the researchers every three days. That's also an interesting piece, right? We actually have to be a CDN too, so we actually have to distribute that data for collaborators and customers to be able to download and access instead of just having it locally, which made a big challenge. Yeah, so kind of roughly the four steps. So you have the sequencer that operates, that data gets fed into the storage system, then you have the consolidation of that and that's called a FASQ file. Assembly and alignment converts it into a BAM file and then you'll do some variant calling and processing later on. So that's roughly the steps as it interacts with the NGS next generation sequencing with the HPC cluster. That's good, I like that slide. You like it? That's a good one, thanks. That's nice. Oh, the growth curve. So brutal. You already mentioned deleting's not an option, right? So the muffs halves are, you have to maintain current staffing levels, right? You can't grow additional people on the staff. Geographic distribution, because it has to get to the researchers, you have to have high availability, right? And the commodity lock-in, no vendor lock-in, you wanna be as close to the metal as possible from a CAPEX perspective. Anything else you wanna add to this? No, I mean, in the next slide we'll talk about it a little bit more, but we have this kind of huge requirement that we put on Swift Stack, like you guys need to be all these things and if you're not any of them, then we have to find something else. And these are one of the main bullet points we'll dive into a couple of the other ones, but geographic redundancy and distribution was huge for us because we maintain this set that we have to maintain by law, let's say, for any kind of clinical trials. And then we have something happen at the main base. It's like much more likely that human error will happen instead of a drive failing. So the data center that we have three copies on catches on fire. Well, it's cool that you had, it's really cute that you had three copies, right? But it doesn't matter, because it's all up in flames. So we need to be able to have geographic redundancy, but we also need to be able to have these CDNs. So in being able, these researchers pulling down 100 plus gig files that used to actually be served out of Hudson Alpha primary. So we have this little office. I mean, it's a really nice office space, but we have our data center in the basement and literally all the researchers couldn't get on the internet during the day, roughly 200, 300 people because the customer was downloading a file, right? So we can't have that happen, that's not too good. So being able to put it much more near to where the internet actually lives, which is not in Huntsville, surprisingly, it's actually maybe near towards Atlanta, but being able to be there really gives us a strategic advantage because we're able to provide bandwidth and throughput for a much cheaper cost than what it would cost in the last mile locally. So kind of putting together all these architecture pieces was really instrumental in making the cost or driving the cost down as low as possible for us and also passing off to our customers as well. I'm just gonna touch on one thing on this. So the data avalanche, if you will, and that's turnover time. Turnover time is really important. So if you think about, if you're a service provider, providing lab results effectively for somebody who's waiting in treatment, turnaround time matters. So if you can shave off hours off of this and turn it into minutes at any opportunity that you have, and you've found some opportunities to shave some of that time down with the solution that you've engineered. So that's been pretty important. Just one of the things around metadata. There's so many files and there's so many different ways to represent the data to be able to tag it with different bits of metadata is also a really interesting thing. And then be able to go back and later search on that. Just for all your Swift people, it's kind of, I really pin this to what people are doing with no SQL databases. No SQL databases are super sexy. Everyone has one like, oh yeah, I got this Mongo database and it's awesome. And Dirk and I were talking about this earlier. And it's what I call write-only database, where they're like, oh, we put all our genomes in there. We put all our X, Y and Z data in there. Like, oh, what kind of MapReduce do you guys have? How do you guys do charting and clustering, all this kind of cool stuff and replication? And they're like, ah, we just write data to it and then whenever we need it, we can get it back. You know what I'm like, oh, okay, that's cool. But it's really not very useful, right? It's not really what the tool is designed for. But we're using MongoDB, so therefore we're awesome. So people really don't, I see the same thing in object storage. A lot of people use object storage as just a file system when really it can be so much more. It can almost be a file system and a database. So one of the things we're working on is directly from the sequencer, we have these BCL files, which are then pulled off the object storage and then the BCL to FASCII process is ran, just needs all the files, it dumps out one big file. And then it gets kind of put back or maybe it's TTL and it's deleted eventually. But then we're thinking, man, why do we have a whole separate cluster to do that? Why don't we just do it on the Swift nodes themselves? We already have this middleware, we already know what's going to happen. We already know the workload associated with it and it's extremely linear and unpredictable. So why do we need to build a whole separate infrastructure to do the same thing? And then now we're realizing, holy crap, right? Like Swift and Swiftstack can do some really incredible things that not only are we just coming up with something to replace a scale on NAS, but we're coming up with something to replace our tools that we use every day to deliver products to customers, which is something that not a lot of people end up exploring, which is very exciting. And middleware is something that I'll touch on again. I got a lot of things. Oh yeah, yeah. And you're gonna be buying hardware incrementally over time and so be able to fold that into a single environment is really important. Very. Cost. All right, so this is an interesting one. I put this quote up at the top and actually a Peyton RT director at Hudson Alpha did and he thinks it's really funny. He says, we're an enterprise, right? We're like, we're legit, so we need enterprise drives. We don't know what it is, it's more expensive, we're not sure why, but we need them because we're enterprise. And that's kind of ridiculous. And here you can see it's $67 a terabyte. It's pretty expensive for what's available now. So we also looked at S3 obviously as an option as I'm sure a lot of you guys have as well being in the object storage space. It's awesome, works really well, way too expensive, just God, way too expensive. The other thing is that we have collaboration, right? So this is what it costs for one petabyte per year with our nonprofit discount. And this is only assumed that we'd download it once a year, which is crazy. If we download it once a month, that costs over doubles to $1.2 million a year to store petabyte, if I remember correctly, this might be a little bit off, but the whole point is Amazon shifts a lot of that cost on the bandwidth. So you can't only look at the cost of storing with S3, but the cost of actually using your data is expensive, right? If you use S3 or Glacier just to put your data and then never touch it again and no one will ever need it and you can just forget that it's there, then it's awesome, right? But if you actually need to use the data, it's way too expensive. It just doesn't make sense. Especially for our workload, we see 18 petabytes happening. We quickly exceed the total revenue of the institute in like two years, right? In terms of cost of storage, so that wasn't an option. So he said, you know what? Why do we have this crap on Amazon that's like $37 a terabyte, right? So I go to EMC and I'm like, I want a petabyte and they say, okay, that's cool, no problem. Give me your firstborn son and a couple million bucks and then, you know, whatever else. And I said, okay, that's not an option. I said, how come there's this huge discrepancy of what I can get on Amazon as a consumer? And Derek was even mentioning this from the Best Buy guys, right? Why can I go to Best Buy and get a six terabyte drive for X number of dollars? It's way cheaper than what the IT department charges me. And so this was a huge issue for us. You know, we really had to rack our brains and say just back to the basics. Why is Amazon drives or Seagate drives so much cheaper than the enterprise storage tier, right? So we said, you know what? We're just gonna buy four petabytes worth of drives and then see what happens. But if you look at the four petabyte raw equivalency versus something like a hosted object storage, it's remarkable. You know, our costs related to EMC is less than just our maintenance contract for the infrastructure, which is pretty amazing. So just to be clear, you didn't actually buy your drives from Amazon? No, but we looked at it. We went to add 504 in the cart and just said it couldn't do it. So, said limit eight per customer so we had to make a whole bunch of fake accounts. No, we just went directly to Seagate and they hooked us up. And so this is something that I kind of want to touch on. Now kind of getting into the rack, right? So those four petabyte rack gets dropped off and we're like, this is so cool. But like now what do we do? You know what I mean? We're familiar with RAID, but we're not gonna use that. And we're familiar with kind of these other, you know, operating systems that offer storage, but we're not really quite sure of their reliability and scalability. So how can we take advantage of, now that we've beat down the cost of raw drives as low as we possibly can. So we're as close to commodity as we can possibly be. How do we just make an incremental change on top of that and then not have to hire anybody and all these kind of other things that I mentioned before? And the solution was ultimately Swift Stack, right? So one of the top things we talked about was durability. We have SLAs with customers. Losing data is just not a possibility for us. We cannot do it. It is very expensive for us to lose a human genome and have to re-sequence it. Or if you sequence Barack Obama's genome and the person IT director comes in and you're like, sorry, you know what I mean? Like I thought I set it up for a three up because I set it up for one. So having Swift Stack in place to help us out with that was really influential. Availability was the other one. We have a lot of customer needs for being able to access the data at any given time after we sequenced for them. So keeping it all on site really wasn't an option. And I didn't put it in the slide but cost is something that I keep beating into everyone's heads, but it's so true. We needed to keep as close as commodity as possible that meant not hiring any new people, being able to manage petabytes worth of data with just, you know, even like Dirk was saying, it's right on a .25 full time equivalency, basically a part, part time person that's spending 10 hours a week to manage petabytes worth of data, which is crazy, right? Like that doesn't even exist five years ago to be impossible. So it's amazing to see the storage architecture and the storage technology mimicking what's happening in the genomic space, right? Because like genomic sequencing is falling off a cliff, but drives really aren't getting much cheaper, you know? And everyone talks about, oh, well, you know, there's going to be some big things that come up, you know, someone's going to come up with a 50 terabyte drive and whatever. And we said, that's fine. You know, Swift Stack's still relevant, right? We still need something to manage the infrastructure, to do our charge backs, to not have to worry about or hire anyone to manage petabytes worth of data. So regardless of how the medium changes, Swift Stack will still be relevant for what we're doing. So, yeah, durability, availability, speed, of course, riding chunks to individual drives over thousands of drives, thousands of really, really slow drives or what people call consumer drives, works really well, surprisingly, when you have an infrastructure that takes advantage of that abstracts is out to the user. It's very, very quick. And Durk had some really good examples of what that actually looks like in terms of throughput. And, you know, his throughput was maxed out by network and not by the actual drives themselves. So it really works well for our type of workload. And we see this happening in a few industries as well, like video processing. Okay, on to the next one. All right, so, why Swift Stack? I already mentioned a few of them. But management was really important to us. We wanted to be able to call up C gate and say, hey, come drop off a four petabyte rack. And then we're just going to plug it in. We want it to just work. We want it to just, you know, we want to do one shell command and get everything super automated and super sexy, right? Because that's what we're like. We don't want to have all this kind of workload regarding scaling because we know we're going to have to do a lot of it in the future. So Swift Stack really abstracted out, and I use that term a lot, but it's very true. Abstracted out the difficulties associated with just provisioning raw servers, you know, and raw drives. And so all of a sudden, we're able to take this extremely low and cheap commodity cost of drives and make it something useful for all of our researchers. So nothing changes. You know, we can add four petabytes and to the user, everything is completely hidden from them. You know, they know that whenever they go to save something and they get the 200 okay back, if they're using HTTP, everything is cool. So that was really important to us in terms of being able to get everyone else to buy into the idea of object storage, which wasn't the easiest at first, but it's really come along well. All right, architecture. So this is kind of a cool piece. This one's related to the sequencers themselves. So the sequencers spit off a ton of files, as we just mentioned, but most of them, like the way that Lumina does it now is they had this thing called the run copy service, which just says, okay, set up a window mount, a windows mount, and then we're just gonna throw at the windows mount. And oh, by the way, the directory structure that we put it in has to be exactly perfect or you could lose tens of thousands of dollars with the reagents, those little flow cells that you saw in your sequencing, right? So that was kind of important when it was pushed on IT to do. So we need to be able to come up with a solution. So we actually ended up using switch tax file system gateway, which works really well. I'm hoping that eventually we'll be able to go to native object storage just because being able to have Quorum and all that kind of stuff on the rights is really nice. But for now, it's a very simple drop in place solution to where we didn't have to really disturb the piece of what was going on to upgrade very small storage infrastructure to petabytes. So really without even doing anything but telling the departments what we're doing, we were able to put that in place, which was a really important feature because if they wouldn't have had the file system gateway, they just wouldn't have native Swift, we wouldn't be able to do it because our origination of the data doesn't talk Swift. You know what I mean? So there would have to be something else that's run on the Illumina sequencers and then that's a big no-no. We can't put third-party software on there. We especially can't distribute third-party software on the machines. So it was really important to have in place. So the other piece is that once the data is actually in Swift, then everything becomes really cool, right? Like then we can distribute to customers, then we can do the replications, we can do erasure codes. It all gets like super sexy, but getting it there was kind of a challenge and that's where we eventually said, if we're just getting and receiving all this kind of stuff from Swift, why don't we just start doing this stuff on the nodes? So in terms of our overall architecture, that's what we're trying to move to and I think you'll see a lot more object-related applications that start using Object Server much more as a database with metadata and searchable. One of the things we even talked about was being able to say, okay, what if we could take all these BCL files or what if we can even take FASTQs like the raw coordinates in FASTQs that each have an A, G or TRS-C and a quality score and all that kind of stuff. What if we can just take a meta-object, right? Is that what it's called? Does it sound right? Sounds right, go for it. It sounds good. If it's not, that's what it should be called. So what if we should just take a meta-object which is just an object with a list of a whole bunch of little tiny objects in it. So whenever you go to get that object, Swift on the back end goes and says, okay, we're gonna compile all these objects and serve it up as one big thing. And for us it works really well because we have like tens of millions of small files every three days that we need to be able to ultimately present as one big file for the whole run. Which is really important. It's called a manifest. Manifest, it's not as cool. But we'll roll with it, it's still pretty cool, yeah. The next one? Yes. Okay, so these are some of the practical application stuff that we've done which I think is interesting and some of the stuff we're moving. And I've already touched on a couple of these but the first one I'll just kind of skip down is the off-site replica. Being able to say, I need this data accessible by customers. We've looked at some of this stuff with Spotify, for example, where they're saying, okay, we need all this kind of archiving but then we need to be able to have this customer facing CDN with terrible but really cheap bandwidth to be able to just have this massive amount of throughput. And in the research world, we have this thing called internet too which is a government subsidized internet backbone. It's a 100 gig backbone. Is that right there, 100 gig? I think it's a 100 gig. 100 gig. So it's a meta object and 100 gig. So if I'm wrong, don't say anything. But or manifest object. So has this big object and we have it stored the data center and we need to be able to push it out to the customer. Being able to have access to that really cheap bandwidth not only kind of fulfilled our requirements for having this insane amount of durability and availability by having multiple data centers off-site away from us but also the customers being able to hit from there. Our relative cost of bandwidth is 43 cents a meg compared to probably $2 a meg at the home base. And the home base bandwidth is kind of important because people are actually using it for services. So the cost associated with giving that off was really cost effective. And another thing that I was mentioning if you were here for the last talk, I'm just kind of regurgitating on this one too but chargebacks are huge in institutes. And the reason is IT is kind of this just overhead cost allocation for a lot of the institutes where they just say it just cost this much to store. That's just what it cost. And you have a lot of people that are working on a lot of projects some equally some not. Some use a lot of storage and some don't use any but they're all charged kind of similarly prorated to the number of heads in the building. So this has been a huge benefit for us not having this on any other storage tier to be able to say okay now we're gonna set up caching layer, long-term layer, a CDN layer that we're gonna be able to charge back to departments. And it's also very good for the departments because they're actually able to see how profitable they really are. Where now they kind of say okay, here's our chunk of IT budget that we just get assessed and so our profitability is sort of relative. So it's been interesting to see. The temp URL is another kind of middleware that Swift has done that's very, very interesting to where we can essentially distribute a URL to a customer and then the URL can go away, it can be access controlled, right like beforehand we had this FTP thing where we do HT access, I'm so sorry, I'm so sorry. We do HT access for usernames and passwords. Someone would go in and access it. That person has a password to the server that's not known by anyone else and all this kind of terrible IT infrastructure in place. So now we had the ability to essentially control access as well very seamlessly for all the objects for all our different customers. So it's very interesting. Yeah, the temp URL is kind of like a, when you're sharing a file with a Dropbox link for example, it's very analogous to that and you can set a time threshold for when it's gonna expire. So it's pretty cool. Very cool. Great access. Okay, so erasure coding, Joe mentioned last talk, maybe not, it's coming up, it's gonna be awesome. It's gonna reduce our overhead in terms of drives but we kind of feel somewhat similar to Dirk in that the cost of drives, now that we've kind of beaten this into our heads and seen that we don't need to be enterprise, the cost of drives really isn't that bad and the cost of powering these drives really isn't that expensive either. Bandwidth was kind of one of the more expensive things that was tough to get a hurdle over. So we're able to do that like I mentioned with the off-site backup stuff but erasure coding, ideally we'd be able to have erasure coding locally and then just the one object replica off-site for our CDN stuff which would be pretty neat. I mentioned that using Swift-Tec middleware, that's gonna be much more common as you kind of see object storage evolve. People are gonna be using middleware as an ability to search things, an ability to be much more intelligent about the application that's being written. There was even a question about the video last talk and that's a really good application for middleware, middleware that can be intuitive about what it's doing. So an example we have at Hudson Alpha is that BAM is kind of like this, that's the super perfect format to store human genome in because from the BAM file, you can go forward to the variant call piece or you can go back to the FASQ piece which is just one step past right off the sequencer. So being able to store that and then translate that in middleware actually helps a lot, right? Because we had to store FASQs and we had to store BAMs where now we just store BAMs which lowered our footprint tremendously because FASQs about three times as big. But we can serve up FASQ in real time. Just think of it like compression. I mean Swift-Tec already does that. I think it's natively or maybe it's still middleware but being able to do store something compressed and then as soon as you retrieve it it's uncompressing on the fly, on the proxy. Is it on the proxy? You'll do it on the proxy. On the proxy. That's right. All right. Easy authentication in the Department of Billing, blah, blah, blah. Yes, so perfect. Those are some pretty interesting things and I mentioned also about the medium changing. There's a lot of talk now, right? With a lot of different drive manufacturers, I say a lot of different, two or three, that are doing some really, really neat things that are gonna completely revolutionize storage. But when it does it's still a huge issue of utilizing the drive and abstracting that out to the user, right? There's still this big middle piece that's missing. So if Seagate comes to me and says, hey, Brandon, now we have 100 terabyte drives and this awesome one like, yeah, except I still, you know, what's gonna end up happening is you saw the picture that Dirk had of the USB drives plugged in the NAS appliance. It will be just that, but it will be 100 terabyte drives, right? Like the same thing will happen if there's not structure in place. And so Swift stack really allows us to continue on that path and as mediums change be able to still remain very effective. Okay, yeah, so off that replica I put in like a thousand slides, so I'm not gonna talk about that anymore. So go ahead, Joe. No, yeah, you talked about this and being able to do offsite backup and this is really kind of touches on the multi-region capability. So if you need to put data into the environment and then that point users to that other location to allow grant access, they're not gonna be hitting back at the institute. They're gonna be going to some place that's better connected to the internet. Right, so for Hudson Apple really doing true infrastructure as a service, as mentioned on the slide. Basically we're able to deliver this really, this very close to Amazon product, plus I mean in terms of durability and availability, plus a whole bunch more customization and features before just a fraction of the cost. And so for Hudson Alpha, that's really been a big benefit, but also going to institutes and saying, man you can do the same thing because everybody's kind of fighting with the same problem. And with genomics we have this kind of need because of the drop in pricing of sequencing, but there's a lot of industries out there now that are having similar types of needs. Like we mentioned with video, just the cost of cameras being so cheap and the resolution of cameras now, people are saying I want to shoot 10 different 5K or 4K angles instead of just having this one really expensive what once was a 1080p camera, which is neat. So one of the other things we're using Swift 4, which is kind of cool is all the public notifications in the state of Alabama is something that I work on. And being able to actively through port mirroring dump all of the active voice calls for compliance into Swift was really, really nice. Because that kind of throughput concurrency wise was very difficult to achieve even with some scale out NAS that we deal with. You're still in IT shop and you still have to deal with other workloads. Yeah, exactly. Including NAS. Okay, cool. And you guys might be saying is this a Titan missile silo? Yes, it is. And we may or may not, but definitely do have a purchase option for a Titan missile silo. I don't know. I'm from Huntsfile Band, we like rockets. You know, we just kind of go with it. But I think it'd be a really awesome place to store data because it can withstand a nuclear strike. I don't know. I thought that'd be pretty interesting. So, yes. All right, I think that's what we had. If there's any questions? Yeah, shoot. That's a great question. It's pretty much industry standard. I've seen ours using a lot of places, a lot of institutes. I think what we're trying to do here and we've actually talked with them before about trying to do a lot of the actual storage of the big files in Swift or something similar. So you still have the, and a lot of people are used to that kind of industry standard management interface, but be able to use the flexibility of Swift and the offsite backups to CDN as your actual point of destination, or actual point of origin, which you can't do currently. So that's a really good question and one that we're trying to solve by taking that away, taking the actual back-in storage away from the management platforms. And I'm sure Dirk probably has a hundred different content management platforms inside of Fred Hutch, but we kind of deal with the same thing where a group of researchers will come in, we'll recruit them, and then they kind of have their own special sauce. So they have their own thing they like to use, they have their own write-only Mongo database that they like to use or whatever. But then they ultimately have the need to, I mean, a good group of our researchers ultimately have the need to share the data and to collaborate. So from there, we want to take everything down to a funnel at the end to say, okay, ultimately has to end up in Swift. And then we can control the access layer, but we're also really dynamic with that, right? And that we can use Active Directory or some other authentication system to control the access layer as well. The end goal being to be able to get offsite and we're bandwidth is really cheap. And in the UK, that's even more of an issue. Self-describing archives, it was the question. Yeah, we've looked at a lot of stuff. The biggest thing for our requirement was that our near-line immediate storage had to also be our customer-facing storage. It couldn't be different. So we're currently using the same old tape architecture and infrastructure that everybody else is using for long-term storage, but there's a whole bunch of questions even related to that about what happens to tape degradation over a number of years. So now that we've been able to get our pricing down to well below what Glacier is, currently the Institute is very happy keeping that all on spinning disks. So there might be some stuff in the future about actually spinning down disks that we personally have looked at doing to be similar to Glacier and then spinning up or in the data, spinning it back down. But power is just not very expensive with these drives. Especially where we're at Huntsville, I think we pay like six or seven cents a kilowatt hour. So it's relatively cheap. Any other questions? One more question, yeah, good. Good question. The question was what changes did you have to make to the application with the HPC environment? Good point. And just like Dirk was saying, it was really kind of a wrapper. So we abstracted it from the developer. We said, look, and this was something the IT department probably shouldn't have done, but they said, let's be a little bit DevOps here. Let's go into this, you know, a billion line bash script. And then whenever you're actually, which is literally how it happens, you're just like they say all finances run in Excel, it's so true. Like there's billions of dollars of funds that have Excel, you know, doing stuff in real time, it's crazy. But to get back to your question, we were the ones that architected that piece to essentially say the first was the file system gateway. We said now just use it like normal because you were already using NFS. But it just, we don't think that's gonna be good for when they're using the local storage as their, or the remote storage as their kind of scratch, right? So what we're doing now is beforehand, and we were a little bit lucky here, beforehand the majority of the bash scripts that we're doing the processing were actually pulling down to local, manipulating local, and then pushing back. So re-architecting that to use object storage is really simple, you know? But if someone's actually using that storage tier as their main tier and they're not pulling it down locally for whatever reason constraints of local disk or whatever, it becomes a little bit more difficult. So I hope that answers it a little bit. All right, thank you everyone. Please come up and say hey.