 Spend at least fully half an hour, 40, 40 minutes, depending on how many questions people have on just talking about data management. With the next generation technologies, especially with the solid and aluminum technologies in particular, they generate a lot of data. And if you haven't had a cluster previously or never had a lot of IT need for IT in your lab before, it can sometimes be a rude awakening. It certainly comes to a shock if you've never done that. It certainly can be a shock as to how much is actually required. So I certainly, for those of you who have machines already, hopefully this is what you have. And if not, this will maybe give you a good leverage for lobbying for more IT support. So with next generation sequencing, they generate gigs of data. Generally, the machines out there, the solid and aluminum, is generating around 20 to 30 gigabytes of data a run. And so when you're looking at these volumes of data, how you structure your compute resources really matters. One of the arguments that has to be made repeatedly that I've had to make repeatedly to people that perhaps don't know as much about IT is why is it that I can go to Future Shop or wherever, Best Buy, and buy a terabyte, I just put one in my own home PC there, a terabyte disk for a couple of hundred dollars now. And then you're asking for something that's like four or five times the price, $1,000, for a terabyte of data. There's often a disconnect there as to why you can get cheap storage, but to do the stuff we need to make generation sequencing storage needs to be a lot more expensive. And the answer to that question is actually quite simple. It comes down to scalability, reliability, and throughput. You're not going to build that one terabyte disk that you buy for your home PC is already going to last you a few years unless you take tons of photographs and download tons of movies. But for most people, it will last years. That one terabyte disk would probably last, if you're lucky, a month with next generation sequencing. So scaling up, scaling up those one terabyte disks into an array is not practical. Getting reliability out of those disks is not practical. And certainly getting throughput out of those disks is not practical. So these are all things that come into play. So if you're choosing compute resources for your lab, if that's the ball that you play, then you're probably aware that designing correct system architecture is going to be important for you to get the most out of your system. If you're actually using the lab resources, you don't actually design it, but you primarily use it, understanding how those compute resources that are range and organized will help you get the most out of your system. As you get into more advanced research areas or more research environments, I should say, where the IT resources are designed and architected at a higher level, you often find slow storage and fast storage. And if you're trying to run your next generation sequencing analysis out of the slow storage where there's gulfs of space, well, you might have lots of space available, but it's going to run really slowly. So understanding how things are structured in your home space is important. So in this time frame that I have available, really, you're not going to learn a lot about computing here, but you'll understand the basics, hopefully, about in this basics of computing systems and how they're prior to design and the main features you need to think about, and especially with respect to in the context of purchasing decisions that are relevant. But the key thing is to talk to your local expert. And if you don't have a local expert, make a friend at a larger genome center. That's the best advice that I can give you. So if you don't know anyone at a large genome center, make friends with Francis. If you don't already know, he's not in the room. But certainly at OICR, they have enough expertise that they can help here. So the basics of any computing system is fourfold. It's a CPU, this space, and RAM. There's just three that mostly people are aware of, talk about. But then the fourth one is bandwidth, which is sometimes neglected, but you neglect bandwidth at your peril. So CPU obviously determines how quickly your operations execute, how quickly your computer works. In biothematics, most biothematics applications, alignment and the like, assembly does not but alignment for the most part fall into what's called embarrassingly parallel category of applications in that you can easily divide them up into separate pieces, run those pieces on separate compute nodes, and then just simply sum up the results, like, cat all the results together at the end and create one big file, and that's it. There's no interdependency between those individual processes. For the most part, there's always exceptions. But for the most part, there's no interdependency between those pieces. And so that means it's very easy to divide up that big data set and run it in parallel. And in that environment, getting your application to run faster is just simply all you need to do is add more CPUs to the system. And so this is what's led to the great uptake in clusters in biothematics in the last 10 years or so. And the advantage now is that we're now moving to CPU architectures where you can get multiple cores within a single node. And that's making clusters smaller, which is good news for us. So throughout my slide, you'll see either crossed out or new text. And this is where I've updated my slides from last year. And so last year, what I said in this session was that you need around eight CPUs per sequencer to handle the current data rates. So I now say you actually need eight dual quad-core CPUs to handle the data rates. So that's actually up to my requirement by a factor of eight there. That's just a consequence of the extra data rates that we now have available. But these sort of boxes are actually relatively easy to come by. You can go and buy them from Dell or wherever, and they'll have these boxes available. In terms of RAM, RAM is important because it allows the computer to store information in memory that's situated very close to the CPU. And so instructions can be calculated on that storage very quickly. And there's no delay in getting a relatively small delay in getting that data out of that space. And so having enough RAM for the application is important. So typical sizing is two gigabytes per CPU. So in the eight-way box that I mentioned previously, you probably want around 32 gigs of memory. And that should work fine with most aligners. The assemblers typically requires a lot more RAM. A human ennovo, in particular, you probably need 96 gigs of metal of RAM or so. There's been some development on assemblers that use less RAM out of the group in Vancouver. There's a ennovo assembly application called Abyss. And that's distributed across the cluster. And it doesn't require a huge machine with lots of RAM in the way it's architected. If you run out of RAM for your process, the computer goes into the state called swapping, where it's reading the data instead of reading it from RAM or being up top, it's reading it from disk repeatedly. And it's not able to cache any of that data from the disk in the RAM. Disk being a relatively slow access storage space means the computer speed just goes right down, where it's limited, where its speed is limited by the disk access speed. And your CPU is, eventually, spending most of its time sitting around idle. And so that's where you want your process to be CPU limited. And that's why having enough RAM is important for your application. In terms of disk space, unlike RAM, which is temporary storage and goes away whenever the information in that space is believed every time the machine is switched off, disk space is, essentially, permanent. And it's easy or cheaper, say, well, cheaper to create large storage in disk space. The downside of it is it's slower access. And so there are ways to improve the access time. Rating the disk allows disk information from multiple disks to be read in parallel. And that improves performance. Things to be aware of when you're reading from large data sets is that if you're reading from the file and you're trying to read multiple places in that file or trying to read through that file to find a particular piece of information, you'll have to seek to a particular location, even if you know where that location is. So I think yesterday, BAM files were talked about, yes? So BAM files indexed the file. So the advantage is you don't need to reach for the whole file to find the read that you're interested in. But even though you know where that read is located in that file, you still need to seek to that location. So the actual disk head needs to move to the location where that read is. That takes time. And it's very short. You'll be on the order of 9 milliseconds or 5 milliseconds or something like that. So it's relatively fast. But if you're seeking around, you're trying to extract million reads, and they're located all over your file, huge file, you're going to have a million seeks. And that is going to start to slow your program down. So when you're designing and thinking about how to write software, that's something that you have to bear in mind, is that if you're thinking about how to do that. And so that's why often these files are sorted by genomic location, so that once you find a particular position, you can read through at that point. And all the downstream operations benefit from that process. OK, so in terms of disk space, how much disk space are we looking at? So sequences generate right now around 2G bases of data a day. You have to, that's just a sequence data. So you have quality values as well. You can have anywhere from 1 to 4 quality values per base sequence. And then you also have the additional files, alignment files, and all in all, you can pretty much figure out, this is going to mean that each machine that you have will generate around 35 terabytes of data a year. And this is where, as I mentioned earlier, where the scaling of this space becomes important to have a system which can scale to allow for that type of data. So really, this is probably the biggest problem that's being faced today. We're now looking at 500 gigabases for just sequence data and quality values from a single run. And so the last year's question was, should you store images? The answer to that question was no. And now we're now at the rate of, should you actually store intensity values? As of today, NCBI still requires you to submit intensity values to the public database. But there is this discussion taking place which will be led by the sequence groups, NCBI, DDBJ, thank you, and EBI, on around whether or not intensity values should be stored. So if you have the opportunity to talk to these groups, make sure you say the answer is no. And there's a lot of information that's possible to store. Intensity values are useful if you plan to recall the data. So if at some point in time, if you anticipate there's going to be a new base caller, that you anticipate is going to come and you want to run your data through that new base caller. Let's say, for example, you have a rare sample that you know you're never going to get access to again. You may want to keep around the intensity values for that sequence because you're never going to have the chance to sequence that again. And so if there's a better base call that comes out in a year or two, you can potentially rerun it on that data and perhaps get an extra percent or two added value to that data. So I agree you said that you should not store it with it for the sake of that. And then you said we could store it in 3.5 for the inside or waiting for the right good base caller that might come in the future. But for example, in Illumina this time, they had a new version where the old images were clustered better and we were able to get more reads. So my question is, we can also keep the images for future. It's better with clustering. I mean, that's not to contradict some of the questions. But do we have something better where we can compress these? I think there's some groups that are working on compression schemes. Yeah, it's a good point, though, that you raised. I think it's really at some level, it's often a very individual decision as to how much you can afford to keep. I know there's places, I know there's many genome centers that keep their data for maybe six months and then throw it away. There's other places that throw it away right away. There's places that, I think, have it all on tape as well. And so I think it's fine to do that. That's an argument that's been made. And at some level, that's probably true for things that are commonly available. And there's definitely systems already that we have that will be system-in-lab where data is collected and thrown away automatically by the machine. Yeah, it depends on how much you want to spend and how much you think you're going to get out of data at the end of the day. And that really is a very individual decision. But I do think I answered your original question, which was what are there any, what was the original question I sent from Pina? I was wondering if you know any better compression techniques out there? Better compression techniques. So I've heard there's this, I don't think it went anywhere, but there was a group that was looking at trying to turn the data into a movie. And maybe you could use then-use compression techniques from movies that are used to compress movies at the end. And potentially you could use that. There's always data loss. And so then the question becomes, how much data are you willing to lose in order to compress the image? It's very hard to do data duplication. In other fields as well, that's commonly used. But you can degrade a photograph, quality, and retain the basic elements quite easily. But it doesn't seem to work as well with our data. So I don't know of anything that's out there that's currently being used to compress images. I did a little bit of research this last year. There was nothing that was immediately applicable. I remember one group looked at it. I think they managed to get it down by, it was quite, basically, they couldn't get it to be any better than GZIP. And so there was nothing special. There was nothing inherent in the images. In their research, there was nothing inherent in the images that they could find that would allow them to do much better than they were getting out of GZIP, which to me says it's pretty much a lost cause in that case. So I think images, you can store them if you have the space and you have the desire. Should you store intensities? I think right now there's probably some value in that. But I think as we go forwards, as base quality values become better and more stable, there's other things become better and more stable. I don't think it's going to be as necessary. And I would argue it's probably not necessary today because really what you have to ask yourself when you think about this is, are you ever going to go back to that data and re-analyze it? Really, honestly. And that's really the question you have to ask. So if you have a rare cancer sample, it might be worth doing that. But I've never seen, it's just never seen that. I've seen data published on collation of microarray experiments, but that was on cell files. I haven't seen anything where anyone's done this. So think about it. I mean, you can do it. It's your money. And then you just have a central bridge that you send out samples from. You send people a sample rather than data. That would be more compact. If I see it. If I see it. If I see it. If I see it. If I see it. If I see it. Yeah. If I see it. If I see it. If I see it. If I see it. If I see it. If I see it. Anyway. OK. All right. Bandwidth. So this is bandwidth. Bandwidth is often neglected, but it's actually the most critical piece in designing compute architecture that can handle this type of data. Limits the maximum transfer rate between the two points. And the two points that you're interested in primarily are going to be your CPU and your disk. And that's really, that's going to be the right limiting step. And if you don't design your compute system well, then that's going to be a problem. And so thinking about the algorithm, so CPU's going to process data at a particular rate. And you always want to optimize your architecture such that you max out the CPU utilization. In other words, that your processing speed is limited by your CPU speed and not by the underlying disk, because usually the CPU is the most expensive component of your system. And so in order to feed the CPU at the right rate, the bandwidth to the CPU must be at least as big as the rate at which CPU can process the data. So just taking through a little thought experiment, say for example, the liner that can process reads at x reads per second on a single CPU, which can process a data rate of y bytes per second, you can process 200 million reads. Let's say you can process 200 million reads in 10 hours. So each read of 50 bases at 10 bytes per base, which is basically gives you a bandwidth of 2.7 megabytes per second. Let's say you have a 100 terabyte storage and you connect to your CPU resource. And you design the bandwidth to be 10 megabytes per second, which gives you plenty of space to bandwidth. Now if you want to complete the job in an hour and you get permission to buy 10 more CPUs, you might think, great, that'll allow me to complete the job more quickly. But in fact, of course, what you're going to end up hitting is this bandwidth limitation. So in order for those 10 CPUs to run at maximum speed, they'll need to be supplied data at a rate of 27 megabytes per second. And since our bandwidth of a connection is only 10, it's going to be limited by that factor. So no matter how many CPUs you add to the system, it's never going to run faster than two and a half hours. So that's why bandwidth, it's often overlooked, but it's critical. That's why storage costs so much. In order to actually have storage that can supply data at rates which are going to serve out data to a CPU storage at high rates, we actually need to have high storage. So just buy your data piece of your hard drive from FutureShop or whatever. It's not going to be able to meet the needs of your CPU architecture. So really the best balance of your compute resource is going to depend. It's going to be application specific. But a line is the critical point here. And they require a different balance. They require a different balance, but most jobs within next generation sequencing right now are around alignment. And that's the major bottleneck. And so if you have limited resources, it's best to design your system around the biggest bottleneck. So ideally, you'd have different systems for different parts of the pipeline. So if you're an assembler, if you're in an assembly and alignment, you may want to design your alignment system separately to the way in which you're designing your assembly system. So for your assembler, you'll probably want maybe a few CPUs attached to the larger memory for your liner. You may want lots of CPUs attached to your disk space with a high bandwidth. But that's just a question of thinking about those sorts of aspects. Oh, yes, backing data up. How many people here actually back up their data? That's actually pretty good. That's better than last year. It's hard. Everyone hates doing it, but it's one of those critical things. OK, how many of you actually backing up data to disk? If you're backing up data to disk, sorry, how many? So I assume, presumably, everyone else is backing up data to tape. Is that right? So how many of you do backup? So put your hands up if you're back up data to tape. OK, if you back up data to tape, how many of you actually go back and check those backups periodically? So here's the thing. It depends on where you're backing up. So if you want to back up data for long term archiving processes, don't expect that data to be intact after five years. And you may as well test that data goodbye. Magnetic tape has a lifespan. And so keeping it around, gathering dust on a shelf is not. It's just making your book shelf look like you have lots of magnetic tape on it. It's not doing anything else. Backing to active disk is probably the easiest way to back up, although probably fairly expensive as well. You can't take your active disk off site very easily either. And so there's really, I don't have a good solution for this. Within my own company, we're looking at mirroring between sites. We're lucky because we have multiple sites. And so that's probably what we'll do for our data. But there's actually no great solution out there. And the issue with mirroring is that you need high bandwidth between your multiple sites. So there's no great solution out there except an SEP machine, which if you've read Douglas Adams, you'll know this is somebody else's problem machine. So the best way to solve this problem. Unfortunately, right now I see I'm responsible for IT resources in my company for my area. And so I don't have that luxury today. But another way to think about these things, backing up data, and actually even processing data is with clouds, as Francis mentioned earlier. Probably the most familiar cloud to people, the Amazon EC2 system. And there's been some growing interest about applying clouds to bi-automatics. OK, sure. Sorry, sorry, I should start from the beginning. So a compute cloud is a virtual system that you connect to over the internet and is, from your perspective, a virtual resource that you can send data to and that has CPUs and machines there that you can use and access. Usually it's page-to-go model. So I'll just actually, I'll just get out here too. So for example, you can get an Amazon to their easy EC2 web page. And you can sign up here. Once you've signed up, you're going to essentially get a pass key. And then you can use that pass key to start up a machine on their virtual compute cluster. And you can produce your own machine image. So they have some standard images out there that you can just use. So they'll have standard Linux environments that you can just start up a machine. But you can also build your own machine. So if you have your own Linux environment, you can make an image of that and put it out there. There's companies like Bioting, that's consulting company that you can, that will give you a customized solution that works on EC2. And then we've also worked with the ABIX, so to comment, we've also worked with Geospaser, looking at cloud solutions as well. They have, in here, there's all the information on types of machines they have. You can have as many machines as you want up to, because as many machines as you want, basically you can get 20 easily. You don't have to do anything special to have 20 machines. You can reserve more if you want to, but you have to contact them to get more than that. They call these things instances. You can see they have like, you can get an extra large instance with 15 gigabases of memory and eight EC2 compute units, whatever. Those are actually underlined. They're typically each compute unit is a 1.7 gigahertz box. These are all virtual machines, so they'll be running this on some other hardware, but what they provide to you is such a virtual machine. That's their largest memory size right now, which for some bioinformatics applications not that high, they're actually planning to bring out larger sizes in the fall as well. I mean, it depends on if they actually I think they're accepting the research. OK, cool. And they also, I think, if you're willing to have, so it costs money, let me just see if I can bring it up. So storage is present on, OK, my mouse keeps disappearing here. There's storage. They have also have this thing called simple storage and for our data, for our size data, it can be fairly expensive. So to upload a terabyte of data to the system costs about $100, about $0.10 or gigabyte. So it's about $100 a terabyte. Now, you can also FedEx things in them and you can FedEx this them, but if you have a public data set, I think they make it available for free or they store it for free. There's also ongoing storage costs. These things will cost money, but the advantage is if you don't have anything available right now, you can get up and running without, whoops, OK, you can get up and running without too much initial cost. But probably for the longer term, I think it's still the valuation of the cost effectiveness of a cloud-type solution. I think people are still looking into that and figuring out what makes the most sense. Question. Ah, yeah, well, that's a great question. A really long time is a simple answer. So if you have a, you need a really good internet connection in order to get it up in a reasonable time, it's going to take days to get a data set up there. That's the short answer. So that's the living-limiting step. Now, there are ways around that which we're looking at at AB. Sorry? You can always FedEx. There's other ways, actually, as well, but I'm going to diversify this offline that you can look at for getting information up there. But it is, that's a large bottleneck. I think that's the biggest bottleneck for these types of applications is for cloud right now, the biggest hurdle is actually getting the data into the cloud. I think longer term, where it's useful, is perhaps as a storage medium for archiving, because they will keep it live. And it will take you a while to get the data back, but presumably only putting stuff into archive that you're not expecting to work on actively. But this could be an option for a small lab. It doesn't want to have a large initial outlay on resources. I want to figure out, once they get comfortable with their next-gen box first before expanding. Limb systems, these are obviously a key component of any lab, especially if you have multiple machines and are generating lots of data. Limbs, I think you know what? You asked me this question last year as well. I still didn't write it out. Alimbs is a laboratory information management system. Alimbs is a way of telling you where all your data is and what it means and what experiment it referred to and why you generated it in the first place. There's nothing worse than going to a directory and trying to figure out what the files are in there and what they all mean and what the experiment was. So alimbs is at least some type of database. At the very least, the spreadsheet is important to keep track of the data and for managing where it is. But alimbs' systems provide automated ways of keeping track of this information. Often they have barcoding readers. So if you have lots of tubes and samples, you barcode them and it will keep track of how they flow through the system space. The database will store the location of the files and potentially also where the sample is in the fridge. And yeah, if you're running multiple machines, you definitely want something like this. Other software, there's a lot of commercial systems. There's also some systems that are free out there. Sometimes in places, there's reluctance to use existing solutions, commercial or otherwise. But it's really a question of where you want to spend your time. If you're a biophagician, maybe this is something that you're actively interested in. But there is a lot of free stuff out there. So if you don't want to spend your time developing a method, you can probably find something that does 90% of what you want to do. And for the other 10%, you may be able to start up a collaboration with that lab or if it's a company, you can persuade them to be able to modify their product to meet your specific needs. So I encourage you, obviously, to look at other off-the-shelf solutions. Excuse me a second. You want to edit that, too? So standards. So there are emerging standards in this space, which is really good news, I think, for the most part. Sequence standards was involved in a sequence standard called SRF for sequences from next-gen data. I think, eventually, that will be replaced by another sequence standard. It was developed in the early days of sequencing where we didn't really know what we were going to want to store longer term. And I think there's now slimmer, meaner, faster, leaner ways to do things that we know more about the data. But there are sequences around standards around sequencing. Alignments that Sam and Bam were introduced yesterday. And that's been very strongly adopted in the community for storing alignments. Yet to come is the iteration of that for assemblies. And then from that relationship between a genome sequence of an individual and a consensus reference genome, annotation standards are there, but how they apply the next-gen data is there's still some work being done in that area. And then as we move into the clinical space, there's a lot of other things to think about as well. And so standards, really the most critical point for this group is that standards are important to support, because what they allow us to do is just focus on the science. If everyone's writing to the common standard and all the tools that are generated are working off that common standard, you don't have to worry about writing blue code. You know that the output of one program is going to feed into the input the other. And it just works. A lot of people have written, I mean, I don't know, this is becoming an older question, hopefully no longer as much a problem now as it was when I was writing my PhD. But pretty much at one time, every bi-conditioner world has written a blast puzzle. It's less of a problem now with the emergence of Biopole and things like that. But having a standard output format would mean that multiple aligners would all write to the same standard. And so BAM, Sam, I'd say, so go out there and support BAM and Sam and BAM. And we're certainly doing that within AB. And if you build it, will they come? I guess that's the other question around in standards, is that you never really know if a standard is how the standard adoption is going to take off or not. But you often have to just put these things out there and see if they work. So I'd encourage you to support standards if you're interested in them. They're never going to win you an Nobel Prize, but they are an important part of getting the work done. And that's where I'm going to leave it. So thanks a lot.