 Yn yw'r cymaint, mae'n mewn rhan o'r cyffredig sy'n cael cyd-on. Errhyw gydag y cyfrifedd, yn fawr, mae'n fawr o'r cyfrifedd eich cyfrifedd y maen nhw'n meddwl o'r cyfrifedd. Roeddwn ni'n rhan o'r cyfrifeddau sy'n rhan o'r cyfrifeddau sy'n mynd o'r cyfrifedd. Rwy'n rhan o'r cyfrifedd o'r cwrsau mwy o'r cyfrifeddau sy'n gweld yn ychydigol. Mae'r ddweud y byddai gyda'r Unigwyr Showio Cymru i wych yn ôl iawn, felly mae'r ddweud y cyfoed gweithiau ar ysgrifennu Llywodraeth, ac mae'n ffordd o'r cyfrifio yma o gael genoedd i gaelio'r projekteig. Mae'n gwneud yn cael mewn y gaelio'r genoedd i gaelio, felly mae'n gweld ei wneud o hyn wedi odeil y flyneddion mynd i gyfro有annu fitynnig yn ychydig yn erioed. Mae'n lluniau'n gweithio'r cydwylliant mewn drafodaeth gŵr, y byddwn i ddweud y ddysgu'r cydwylliant yn mudio hwnnw'r lluniau. Ydw i'r rŵn cymddiözis a'rnwyf meddiliadau o'ch gweithio printfyrdd y ddysgu. Yn ymlaen, mae'n mynd i ddweud i'ch ddweud i'r lluniau a'n mynd i'ch ddweud i'ch ddweud o'ch ddweud i'ch ddweud. Cyn wych yn ddiwedd, ac nifer ydych yn gweithio'r unol i'r unol i'r unol, ond ym mwy o'r hynod, ym mwy o'r unol i'r unol i ddweud ddau o'r ddeud, o'r syrgyn nhw'n gwybod. Rydyn ni'n ei hynny'n cymhysgol, gan ymlaen nhw'n gweithio'r unol, .. varieties of people. But our focus is really to help researchers get the best value out of the systems that we're building to help them get research done. We're not interested in building big computer systems... ..just for the sake of doing so. So, this is pretty much the core of our business, DNA sequencing. So this blue box in the middle is a DNA sequencer. It's about the size of a large office laser printer... ac yw'r materiol biologiad yn ysgolod ac yn ein bod ni wedi i'r aceffau a'r data fel angen, y data byddai ddechrau'n ddweud o'r ddweud, ydw'r bobl yn gwneud. Felly, roedd Ropsedd wedi bod nhw'n holl o ddweud a ddim yn ei ddweud o'r ddraifers. Fy gynig o ddweud o ddweud o'r ddweud o'r ddweud o'r ddweud o'r ddweud o'r ddweud o'r ddweud o'r ddweud. yw'r drwfwnt y cyngwisio'n syniad y gyfremfyr y methu, y genomef, cynod drws yn chwarae am y 12 myfyd. Roeddwn yn hir y cwmwyliad eraill, dyna fydd hon ar un oeddant y攝sloedd yng ngyfwyr, byddai hynny llaw tensell yng Nghymru o'r awrthu ar y mae'n cyfrifat yng Nghymru winning. Ac yna'n gwybod y prosiect ar-ddi Newydd, rydym yn dechrau'r mod ar y tynnu cyfremdedd rwy'n masgyn three days at a price point, and that's about $8,000, and we see this trend continuing, so the $1,000, you know, is probably going to be here within the next 12 months, 24 months maybe, and obviously this decreasing in the price point increases the amount of data that we get. So no talk from a sequencing institute will be complete without the scary graph. This is our data output over the past 10 years or so, and you can see it's a very steeply growing curve. When we were doing the original human genome, our yearly sequence output was kind of way off the bottom of the graph. So we were doing 30 gigabases of sequence a year. We are now doing 7 to 10 terabases of sequence a week, a huge increase in data output, which we at the IT end have to deal with. This trend is going to continue, so there's a constant improvement of DNA sequencing technology. So this was a press release from a company called Oxford Nanopore of what the sequencing machine that they're hoping to release sometime this year is going to look like. So there's a little sample well where you pipet your DNA sample, and then it's USB attached and you go and plug it into your laptop. Those of you in IT services organisations will be dealing like we are with the headache of what bring your own device is, where people want to have access to all their data on their mobile devices, which is very complicated. I mean, our lab people are now going to be dealing with bring your own sequencer. We have no idea how this is going to change things apart from we know it is going to change things. And of course, for large organisations, they're going to cluster these together up into large racks of sequences to give us the big throughput that we need. So what are you doing with all these genomes? So we have a number of large scale sequencing projects. So the UK 10,000 genomes project is one of our flagship projects. What we're really looking for is now we have a reference human genome, but everyone's different. So if we sequence lots of genomes and then compare them, we can look at differences in genomes, perhaps correlate that with medical records and find the genetic causes for a wide range of common diseases. So that's useful from a predictive point of view, but where we're going to see real benefits very, very quickly is for personalised medicine, especially for cancer genetics and cancer treatments. A lot of the way that your cancers respond to chemotherapy is down to the genetics and the mutations that occur inside your particular cancer cell line. And it's a, for an oncologist, it's a bit of a big process of trial and error where if you have a patient, you have to try lots of different treatments to actually find one that works. And so you can spend a lot of time giving people drugs that aren't particularly good for them and have lots of bad side effects for no real benefits. And we hope by doing genetic sequencing of a patient's tumour sample we can say these are the drug treatments which are likely to be successful. A lot of our other projects also exist in frameworks of international collaboration. So again, being able to share data and resources quickly on an international scale is kind of crucial to our success. So our IT requirements basically have to grow to match our sequencing data outputs. So historically we have doubled our compute and disk capacity at the Sanger every 12 months. So we have about 17 petabytes of usable disk space at the Institute this year. The requirements of the science change very rapidly. There's this constant improvement in sequencing technology, but that improvement brings changes in the laboratory process, data types. And so we have to be able to respond quickly to that. And we can't just keep on throwing more money at the problem as much as hardware vendors would love us to do that. Our data growth is larger than Moore's Law. Cost for the disk and compute only doubles every 18 months. So there's a gap that we have to bridge somehow. And so for all the talk of the $1,000 genome, well the $1,000 genome is not going to include the computer informatics to make sense of the data afterwards. And that's something that the whole field is going to have to struggle with. So what are the data flows that we kind of tend to deal with? So at the top we have this sort of nice, what we call the sequencing pipeline, where we have DNA sequences at one end, which spit out data on a continual basis, which then goes to these various steps of quality control processing, data analysis, and then is finally archived. And there are two other processes that go along with that. We move from very unstructured data, lots and lots of flat files, into a more structured set. A lot of data ends up in relational databases or other sort of queryable data stores where researchers can come and pull data down. And hopefully we go through a data reduction phase as well. Well, so we go from having sort of tens of terabytes per experiment on coming off the sequencer, coming down to the really important data that a medical researcher may be interested in, and maybe a few megabytes does this person have this gene or not, a very, very small amount of data. So one of the concepts that we kind of try to implement from a systems point of view is to say, well, we need agile systems to match kind of the agile work for the agile development processes that our researchers and informaticians are going through. So we want to try and build modular systems so we can scale those and replicate those quickly to try and keep up with the demand, changing demands. So we've built out blocks of compute and networking and storage. And we kind of always try and assume from day one that we're going to be changing it and we're going to be adding some stuff further down the line. We're not going to do a standard three-year procurement where you buy a system, plug it in, run it for three years and then turn it off. We're going to have this constantly evolving incremental approach. It means we can expand quickly by adding more blocks and hopefully they're blocks that we've looked at before so there's nothing new. We can use lots of automation to simplify the management of our systems. So as our systems double in size every 12 months, I don't double my team size every 12 months. We have to do more with the same headcount. And I guess one of the key enablers is to make storage visible everywhere that we have fat networks now, 10 gigabit ethernet is a real enabling technology that allows us to have disk and compute reasonably close together. But if we want to access some data from a slightly different compute cluster when we have before, whether if you've got a nice fat network, we can do that and not pay too much of the penalty for moving data around. So our compute modules are very standard. So we use commodity Intel or AMD servers. We have to like blade form factors that fits well in our data center. It's good from a data center space and management point of view. We're a very embarrassingly parallel workflow. So we're not running huge jobs that require very low latency networks like you might find in a traditional high performance compute center. We have individual tasks that don't need to go and talk to one another. So it means if you add more compute, you do generally get good scaling up. If we add a thousand more cores of compute we get a doubling of our throughput fairly easily. We try and keep things very generic. We really haven't looked at or see much uptake for newer CPU architectures such as GPUs. We really don't have the applications there at the moment to take advantage of that. And we have a footprint of about 2,000 to 10,000 cores in a cluster of compute with a reasonably modest memory size. So the storage is a little bit more heterogeneous. We really ended up building two sizes of storage module. We have some very fast and expensive but high end storage arrays, which give us very, very high throughput which we run luster or high performance file system on. It's an expensive solution. I have to say it's a good value solution but the price of terabyte is not the most competitive one out there. So we also have blocks of more commodity storage which is basically slower but the same amount of outlay. We can add more of them and so if we can scale workload out across that system then there are benefits there as well. One thing we've battled with is to work out how big a module should we build. A fewer modules gives you less points of management. A larger modules are probably more cost effective because there are always economies of scale. But if the module breaks and goes away, you've lost a big chunk of real estate. A fewer modules means you've got more things to fail on you. So we vacillate about, depending on how reliable hardware has been in the past and how brave we're feeling, that we end up with storage modules between 100 and 500 terabytes. So we end up with a sort of architecture that looks like this so we have various sizes of disk module and various silos of compute. Now consolidation is obviously a key buzzword at the moment in the industry. When we started we thought we should build basically one big compute cluster because that gives us really good consolidation and we can balance workflows and get really good throughput and utilization. However we found that if you do that we had workflows which were fine if they were placed on a compute cluster on their own. But if we ran them both together we'd had all sorts of horrible interactions where well behaved linear data flows would suddenly become randomised where you had two streams of IO fighting against one another. And so we have found it valuable to actually step back from completely consolidating everything and give reasonable silos of compute dedicated to specific functions. So we have some very identifiable, very large projects that benefit to being on their own. Smaller projects we can get away with pushing onto compute clusters with other users. Again the key thing has been the ability to move workflows and projects between clusters without having to pull the whole thing to bits and build it up again. So again we're using software and workflow management tools to direct workflow between clusters. So if one project has a huge sort of peak for compute we can redirect workflow without having to recable anything and take downtime hits. So some of the things we've learned is simple is much better than complicated. There's this great acronym, KISS, keep it simple, stupid. It's really easy especially when you have enthusiastic technology people to get carried away and say we're going to build a really technically good solution that's got failover and resilience. And that looks fine on paper but what you've found operationally is the more layers of complicated software you put in the way more likely things are to break and when they do break they're much harder to put back together again especially if you're hunting around for skilled people which again is another theme that Rob touched on. Now that's not always possible. Our use of high performance file systems like Luster is, that's not simple and it's a pretty big violation of the keep it simple principle. But that's a real need for us. We can't do it any other way so we have to sort of bite the bullet and do that. But we choose our battles, we don't go for complicated solutions everywhere. The real key has also been communication so our compute requirements are driven by the researchers so we have to talk to them and you have to talk to them a lot. It's no good putting a system in which meets the requirements that they had six months ago. Because what they want to do will have changed. And we find by having this constant dialogue, it helps us to sort of fine tune systems and so we find that researchers may not know what their requirements are going to be. This is research computing that no one's ever done it before. So you build a system that's probably a little bit over specs but then they grow into it and then as they know what the workflow is actually like in production when we come to expand in six months time you can say wow. We don't need a really expensive compute module there. We can get away with something cheaper because we know it'll do the job or researchers can make a decision to say well I don't want to go and I don't have the budget to spend on some expensive hardware but I can spend some developer time or some PhD student time which as we know is free. And we can go and change our workflow a bit to take advantage of the systems that we can afford. Data triage is a really, really important thing. We cannot keep all the data we generate. We do not have the budget for it. So we constantly have to go through this cycle of deciding what are we going to throw away. And that changes in your researchers until they've had data in their hands and had it for 12 months or so. Don't know what was actually useful and what wasn't. And you can argue that process continues that you should keep everything because you never know what's going to be useful. We tend to have a slightly more pragmatic and every 12 months or so the secreting pipeline teams will go and say well what data is actually important. No one's ever going to go back and look at data and re-analyse it in practice because we've got too much new data which dwarfs the old data. So if no one's looked at it and the researchers come to consensus which is actually easier than you might think to say well yeah we actually don't need that bit of data so we'll just stop collecting it. We can do with what we actually have. That really helps to kind of keep this big explosive growth in check. So this is kind of how we ended up blocking out the sort of compute modules onto the secreting pipeline. Different numbers of cores in different parts of the pipeline and different sorts of storage. So far so good sort of solve problem. What we've unfortunately found is life is not quite as simple as the nice linear diagram I show there. The way researchers tend to work is so data comes through the secreting pipeline and then they say well I've got a really new cool idea for a research project. I want to start destaging all of this data and start analysing it with my new data. And so we see the big cycle of data coming back out of data stores onto our compute farm and round and round and round and round. And it's massively problematic because this top process is all automated so it's all run by machines who are very good about not duplicating unnecessary data and leaving old data around. Well this is all done by people who if you know that look at your laptops and see the large amount of files listed across your desktop that we're all really bad about organising data and remembering where the important stuff is and keeping it all organised. So we found there's this huge explosion of data in this part in the middle of the organisation that we really don't understand where it's coming from. And in fact nobody understands where it's coming from. We have this big problem of unmanaged data. People take interesting data and do stuff with it. Doing stuff with data is hard so it tends to mean when you've created a big data set you don't want to move it around because it's slow so you'll leave it where it was created. So we get important data left in scratch areas that aren't backed up. We get large amounts of data duplication. No one's quite sure where the canonical data set actually is so we'll just take another copy and store it somewhere. Capacity planning becomes impossible because you can look at a disk real estate for a group and say they're using maybe 10 petabytes of data but which bits of that data is actually taking out the disk real estate. Trying to do a normal Unix disk usage find on a 4 petabyte file system is going to take a very, very long time. And it's not just an IT problem so we had huge problems for a research group where they had 100 terabyte file system which was what we were building at the time. They'd filled it up and that means their production stopped but they had 136 million files in that data store and had no idea what stuff can we delete so we can continue production. And you can see a real loss in productivity of research teams because everyone's spending time just tracking, trying to keep track of their data. And we saw quite quickly there was a real change in IT usage between groups who had spent some time looking at data tracking and data management and those who hadn't. Groups who were in theory doing the same research using half the amount of compute and disk space because they were only computing what they needed to compute and only keeping what they needed to compute. So this quickly became our problem because we wanted to definitely get away from lots of individual groups reinventing data management infrastructures which couldn't talk to one another so we took it upon ourselves to come up with a system that our researchers could live with. So we've implemented a sort of data management framework using a product called IRODS which has come out of the University of North Carolina, Chapel Hill. It's basically a sort of Mark II version of a product called Storage Resource Breaker which has been around in the high energy physics community for quite a while. These are guys who have been dealing with big data probably before most of us and kind of know what works and what doesn't in that space. And what IRODS is basically a smart archival system so basically it's a way of storing data on pretty much any bit of storage you care to imagine. It's basically storage agnostic, you can create a database, a file system, Amazon S3 bucket and then it basically has a metadata catalogue which allows you to store metadata about it for each file and this is kind of where the power comes from. You get metadata that researchers can query. The system itself is scalable, it's been designed to deal with big data so we can deal with millions of objects. It knows about data replication if you want to replicate data for disaster recovery purposes, it's very important if you're trying to build an archival system that's going to be around for a while. It also has a rules and policy engine that allows you to say you do things to certain types of data based on their content or their metadata. One of the features that really interested us is the fact that you can federate it so if we want to share datasets with other institutions or maybe merge datasets that sit in other institutions we can merge different instances together into a single namespace to give users a single point of access to numerous different data stores. This is what it looks like from one of the web interfaces. It's a fairly standard directory hierarchy list of files but the magic is you can say, tell me the metadata of this file and it will tell you what experiments actually generated this file, what sample number, how it links back into our laboratory tracking system, all things that researchers can use to find the data that they actually want. We've implemented one of the IRODS archives for our sequencing archive so this is basically the final resting place of all data coming off our pipelines. We have 800 terabytes of usable space in the archive at the moment. It's actually replicated between two parts of our data centers so there's actually twice as much. We have a very, very simple rule set. We basically say for every file that goes in we check some it so we can check for silent data corruptions over the period of storage and basically just replicate it so we actually do have two copies of the data. You can see the growth up here so we're fairly aggressive growth in the archive. Our researchers really like it. They're actually clamoring for us to build archives for them and put data in because the benefits for productivity are very, very quick and easy wins. We really want to start pushing the boundaries of this system now. We have researchers who are keen to start looking at federation with other institutions that they're working with and really doing complex rule sets to govern their data flows as data gets pushed into the archive. So I can architectural diagram. Two sets of storage again, various sizes of storage blocks. Different vendors come up with different price points as we incrementally add to the storage and we can change storage vendors and not have to pull the whole system apart. Being storage agnostics is a really good feature that we've made a lot of benefit from and got a lot of value from. I'm going to wrap up now with just some thoughts about cloud and how that's impacting our operations. Despite talking a lot about agile systems, it still takes a long time to actually bring hardware into a data centre and cable it up and do stuff with it. It's easily our longest lead time in when it comes to putting new systems in. Cloud is an obvious solution to this problem and we're pretty cloud agnostic. We use cloud where it makes sense. We have some misgivings about cloud for big data and the major one is it just takes a long time to get data up there. Certainly at the moment we sit on Janet, like a lot of you, if we want to upload data to Amazon, there is not a particularly fast link between us and Amazon, shoveling data that is pretty inefficient. Having cloud providers who sit on Janet and are part of the UK educational research networks is obviously a very interesting thing that certainly piqued our interest. There's a lot of fear, uncertainty and doubt thrown around about data security in the cloud. I don't think it's a particularly big issue. Cloud isn't any different from any other bit of IT infrastructure so if you have important data then you should have processes that allow you to manage the risks and access controls whether that's in your own data centre or somebody else's data centre. It may be if you want to have really secure data clouds the right place for it. Most research networks are fairly open, ours is. If we want to actually bolt on a secure area onto that it's probably much easier to start on a greenfield site out in the cloud where by default everything is locked down rather than trying to carve off bits of our infrastructure where security was probably not the overriding concern. Ultimately it's an economics decision for most of our big data operations today it's cheaper for us to do production in-house. However, this is a purely economics decision if those prices change or when we fill up our data centre and then perhaps do a big capital outlay to build another one cloud may well be the right solution for us. I think I'm really going to wrap up there. The key points that I really like to get across is that if you're getting into big data then being modular so you can start small and build up over time is really important. You're really having a handle on what data is important what data is not and being able to make decisions to throw data away is perfectly reasonable but in order to do that and do that successfully you really have to engage with your researchers really on a day-to-day basis to make sure that what you're doing and what they're doing all really lines up. I'd just like to thank the team at the Stanger Institute Phil Butcher is our head of I, director of IT who basically gives myself and my team enough rope to go and hang ourselves with so we can go and look at these interesting problems and all the informatics teams and the various scientific research groups who have really helped to make our IT infrastructure a success. Thank you. Thanks very much, Guy. Any questions for Guy? Down the front here. Thank you very much, Dr Cates. You said something very interesting about data triage which was that you are quite happy to throw data away. I wonder how you make that decision between performing continuous analysis on new data that's coming in and actually saying, you know what, we've got enough data I'm going to do some new analysis on data that we've already kept. So nobody goes back. So for our particular workflows no one ever goes back and looks at all data and re-analyzes it. They say they want to in practice. The newer sequencing technologies tend to bring pretty good improvements in data quality so quite often the preference is well the new data is better. Now there are some cases where that may not be possible so if you have samples that have to be ethically consented you may not be able to go back and re-sequence again or there are really weird edge cases so the Neanderthal genome project so that wasn't done by us but the people who did that they only had a very small amount of fossil DNA which they only get to sequence that once so they kept everything but for most of our stuff using new data and analyzing that is probably better and people do go back and look at the older data but it's not the entire data set again they've made the decision so perhaps quality scores so this sort of how for every base that you get off a sequencing experiment there's a kind of a quality that says how confident are we so do you really need that five significant digits or do you only need good or bad so perhaps we'll just get away with good or bad and that's good enough so that's the process that we're going through but there's about a 12 month lag time between new data appearing and the people really understanding we understand the error models we're now confident to say we really don't care about this bit subset of the data so we only care about this other part Lisa is the one quickly from Twitter so the question was with IRODS how much of the metadata is created automatically and how much do you have to create by hand so we our data basically go for that sequencing pipeline the data gets put in automatically the sequencing pipeline knows who created the experiment, what sample it is so that's what gets put in but you can go back and manually do stuff afterwards if you want to okay thanks very much Guy can we just say that