 It's lovely to be here to talk to you today a bit about some of the work that we've been doing in the UK for the microbial community using OpenStack. So I'm Tom Connor. I'm a senior lecturer at Cardiff University. So I'm a member of academic faculty there, and I'm one of the academics that's involved in this project that I'm going to be telling you a bit about today. So before I get started, I just want to sort of flag up and the people that are involved. So I'm obviously here. I'm the only person from Clime who's traveled out for this meeting, but there's a cast of many who have been involved in actually getting the project up and going and making the project happen. So I'm a bioinformatician. So I with Nick Lohman designed and have led the implementation of the system, but we couldn't have done that without a significant number of people on the technical side who we've worked very closely with to actually bring the project to fruition. So just as a bit of an overview, I'm going to talk a little bit about microbial bioinformatics and about why we might care about developing infrastructure for microbial work. I'm going to talk a bit about this concept of biological silos. And so if you're not a biologist and if you don't know what the field's like, then it will give you a bit of a sense of where the problems are. I'm going to introduce genome sequencing for those who don't know it, aren't familiar with it. And then I'm going to talk a bit about how we're using the cloud as a solution for some of the problems that I'm going to outline. I'm just going to talk then a bit about climate and some of our key features. So first, what are bacteria? I'm pitching this just in case there are people in the room who don't really know it. They're tiny organisms. They're microscopic organisms that make up over 50% of the mass of life on Earth. So if you were to take all life on Earth, and you had to set up in probably enormous scales, the bacterial mass of those cells will considerably outmass everything else combined. So bacteria are a majority of life on Earth. They're fundamental to life. You find them everywhere in every environment. There are more bacterial cells in your body than there are human cells in your body. We rely on them in every possible way. They're enormously adaptable. They're very plastic genomes. They can swap genes in and out. They can share genes between different bacterial species. And that means they can adapt very rapidly to changing conditions. And they have genomes that range in size from a few hundred thousand base pairs up to sort of 10 to 12 million base pairs. There's a lot of variability within bacteria. And even within a species, you have a variable genome. So if you took the genomes of everybody in this room, we would all have the same set of genes. You might have different versions of each of those genes. We would have the same set of genes. If you were to take the E. coli out of the guts of every person in this room, they wouldn't have the same set of genes. There'd be a few thousand genes in the E. coli genome that would be shared between all of them. But then there'd be a load more, several thousand more genes that would be present in one or two people here, but not everybody. We can examine bacterial genomes by sequencing them by technology that I'm going to be talking about in a moment. And because DNA mutates at a relatively constant rate, we can take multiple bacterial samples. We can take multiple genomes. We can sequence them. And then we can compare them. And because of DNA mutating at this relatively constant rate at a rate of a molecular clock, we can actually then infer how related they are to one another by the changes that are present between samples. And because genes encode proteins and proteins of the blueprints and, sorry, those proteins actually confer some sort of function, we can look at the proteins that are present that are encoded within a genome and infer something about what the bacteria are actually doing and how it might respond to things, I'd say antibiotic treatment. In terms of what I do, I work almost exclusively on disease-causing bacteria and pathogenic bacteria. And so the picture on the side here just emphasized the scale of bacteria. It's a pathogen that I work on called chigella flexonarii. And in the background is a human embryonic stem cell and the chigella is invading that stem cell and that's how the chigella cause disease. It goes inside cells that line the gut, causes inflammation, which then causes dysentery. So why care? Diural disease in numbers. You might be wondering why and how bacteria relate to the cloud and why actually you might care about ensuring bacterial research is funded. So just a couple of numbers. So 200 million people on the planet Earth have gastrointestinal disease at any point in time. And if you were to visualize that, that's about, that they were producing a day, about 60 million liters of diarrhea, okay? And so that is equivalent to all the water passing with Victoria Falls in one minute, right? So if you don't remember anything else in this talk, you will remember the waterfall of diarrhea, right? And so to take that a step further, if you were then to stand next to Victoria Falls and watch it for six hours, you would have seen the volume of diarrhea produced by gastrointestinal disease by the human race in a year, right? So it's a lot of diarrhea. And so the funny thing is that in developed countries, we often laugh about things like diural disease. When I was a student, you know, you maybe go out after a night out and you'd be on your way home and there'd be a kebab van situated somewhere nearby and you'd have the kebab and you'd come home and you'd be rather ill the next morning. And that would be an inconvenience because we're well-fed, we're healthy, we have probably quite a good healthcare system nearby. That 200 million cases a day translates to about two billion cases of disease every year worldwide. And the serious bit is that 5% of all deaths in low to middle income countries are due to diural diseases. So this isn't so for us in the West, we can be a bit blasé about it, but actually it's a killer that affects a large number of people. And it's mostly children that are affected. Okay, so it's a serious issue. We do have gastrointestinal pathogens that cause tens of thousands of cases of disease in places like the US, in the UK. And that causes many deaths as well. Gastrointestinal pathogens are one of the key reservoirs for things like antimicrobial resistance as well. So they're big, big problems. And there are many other pathogens. This is just one example of why we should care about bacteria. So I mentioned before, bacterial genomes can be sequenced. And the way that's worked over the last few years is in the bad old days, we had this thing called an ABI Sanger sequencer. And so when the Human Genome Project was started in the late 1980s, this was the instrument that was used to sequence the human genome. And to do one single human genome, it took about 15 years using these old instruments. In the mid-2000s, we had a set of new sequence technologies come on tap, the Roche 454, the Selex Relumina, and then we've got new generation approaches like iron, tar, and packed bio, and nanopore. And what that's done is that's changed the scale of what we can do. So on an ABI sequencer, you do about 96 bits of DNA, up to about 1,000 base pairs. And you would run that over a few days. And you get those 96 bits back together and try and tie them back together to make a contig, or a set of contigs. With a modern Illumina instrument, that's now up to 500 million reads per run. So you can see the increase there. And Selex Illumina instruments aren't that expensive. Most universities have at least one. And so since the mid-2000s, we've seen an explosion in the size and scale of the sort of questions we can answer. And fundamentally, what's happened is we've had a decrease in the cost of sequencing. So in 2003, to sequence a human genome cost 2.7 billion dollars, right? So that's 2.7 billion dollars visualized there with bails of dollars on forklift pallets. In 2016, we can do that for $1,000. So we've gone from an amount of money to sequence a human genome that you could only carry about in a shipping container to one that if you're better off than I am, you could carry about in your wallet. And obviously for me, humans are relatively boring. For the same money, we can sequence around about 50 bacterial genomes. The sequencing is cheap, it's relatively rapid and we can do all sorts of fun things with it. So biological silos. So this is a lab. This is a picture from the welcome images of a laboratory. And when you're a PI, when you're a researcher in a university, especially in biology, the lab is everything. So you come into university as a newly minted academic and you are given a lab space. And into that lab space, you buy equipment. So you buy things like a PCR machine. You might buy equipment for flash chromatography. You might buy a fancy microscope. And that sits in a physical space that is yours and has your name on it. Each group is led by a PI. So there's a single person who's in charge of each group and your name is on that door. You know, your name is on the risk assessments. Your name is, you are the line manager for the people who work within that group. So the group members, my group members work for me as the PI. Each group will generate its own grant income. Okay, so when I put in a grant application to a grant awarding body, I am generating income for the university, but that income comes effectively to me because it comes onto a budget line that only I can spend. Each group has its own physical space, has its own equipment, and it's recognized and understood by the university. So universities understand groups as these sort of organizational units. And what that means is that, to my mind, an academic group is more like an SME. Now, we're part of this large, multi-hundred million pound organization. But actually, day to day, our functioning is much more like an SME. My concerns as a PI are, where's my next grant coming from? How am I gonna keep the people who are working for me employed? How am I gonna generate my research outputs? So it's much more, I guess, like an SME rather than being part of a larger organization. And this view translates into bioinformatics. That's not my machine room, but it could be. All of my work is underpinned by computational resources. So we generate vast quantities of sequence data, and then we have to do something with it. And that requires quite hefty amounts of computational capacity. When I arrived at Cardiff, so I came from a place called the Sanger Institute just outside Cambridge. So Sanger sequenced one third of the human genome project, the human genome during the human genome project. When I left, they had around about 16,000 cores for a total staff of about 900 people. It was freely open to all. You had huge amounts of storage. It was basically limitless compute. I arrived at Cardiff, and basically what we had was a server in a cupboard from my department of 120 academics. So I suddenly had to work out, well, how's I gonna do my computation intensive research in this new place? I had years worth of scripts and software that I'd written on the Sanger system, which are now incompatible with my new environment. And I was completely on my own because I was the only bioinformatician in a department which didn't really understand bioinformatics. So I did what most academics do, which is I built my own system. And that's not at all unusual. Quite often when we put in grants, we'll put in money for a server or for some compute. And because of the way that the budgeting works, that budget line is then mine. And so instead of handing my money off to a century university IT resource, I buy up my own server with it. And this isn't an unusual story. So last autumn, Nick Lohman, one of the collaborators on climate, and I did a Twitter survey where we asked bioinformaticians on Twitter where they did the majority of their work. And I guess the interesting thing for open stack people is the fact that over half of their work was done either on a local resource, i.e. a server in a cupboard, or on a personal computer. And the cloud is almost completely unutilized. So that's an interesting thing. And there's a lot of reasons for that, I think, and that's something that maybe we could delve into, but fundamentally at the moment, the cloud isn't used and actually the ways that bioinformatics is done is done in a way that's not really reproducible and not easily shareable. So we come to this problem, the thing I call it the sequencing iceberg, but it's really a big data iceberg, which is that we can now generate data very cheaply and very rapidly, but our main costs as researchers are actually in the bit that most PIs don't see, which is the informatics expertise, the reproducibility of our software and how easy we can share our data, the compute capacity and the storage capacity that we need for that. And that's all underpinned by this expertise as well. So we come to open stack. When we sat down to plan climb about two years ago, we thought that there was a better way of doing things and that was to provide bioinformatics infrastructure on demand, so providing systems that the researchers could use for doing their research, pre-installed with software and with packages that they want to be able to use, providing an environment where people could share images, share pipelines and share data all on the same infrastructure. And so we thought open stack was quite a good system for that. Knowing the way that the academics work, the open stack enables us to add hardware pretty simply and at cost effectively because we can buy capacity in larger chunks. And critically for us, it means we can meet multiple use cases in a single system. So if I'm a post-stock who's got a small number of bacterial genomes to analyze, I probably need quite a small virtual machine with not huge amounts of RAM, sort of talking maybe 64 gig of RAM and eight cores. Whereas if I've got a large metagenomics data set, I'm gonna need terabytes worth of RAM and hundreds of cores. So there's a wide range of use cases within microbiology. Removes the need for closet IT systems, so if you're having these horrible servers in closets, actually with open stack, we can provide somebody with a nice siloed virtual server that's theirs, has got their lab's group name on it. And it actually removes the need for them to actually have to manage their own server and have their own server locally. And for me, the nice thing is it really simplifies this process of installing and maintaining software. If we want to share software, if we want to share pipelines or have a sort of core set of analysis packages that we use, doing it via a cloud system is much, much simpler than having to do it, having to reinstall the system every time we want to really change something. And the other major advantage for us was that at the moment you have a situation in biology where data is very frequently shared but very infrequently reused. And it's the same with software as well. So you go to most publications that mention the bioinformatics approach and you'll find a link to GitHub and you'll follow the link to GitHub and then you'll find there's a load of dependencies in there which aren't really specified, there's no documentation and you can spend a week trying to install this software and still not get it to work. And that's a huge problem for software reuse because if you've got a paper that has a set of data which you can't get access to and a set of software that you can't install then by definition you can't reproduce that. So we think having a single environment that brings together storage and compute through a relatively simple interface is actually a real way to get over this reproducibility issue. It also simplifies training. So the thing that a lot of people in IT don't understand is that biologists are extremely poorly skilled in IT and have no interest and no desire to learn. So if I've got a postdoc in the lab who spends 99% of their time sitting at a bench petting things and doing stuff like PCR sequencing, whatever, in a laboratory, that 1% of the time where they need to sit down and actually analyze the data, they're not gonna remember the training course they did on how to access the command line 18 months ago because they're not using it every day. The skill fade is just too great. And so we expend in bioinformatics a lot of time and effort trying to train people up, missing the fundamental truth that actually a lot of biologists don't have the time to refresh their training every time they need to use a system because they do it so infrequently. And so the advantage that we have here is first of all that we can simplify the training because one of the classic problems with training is that if I go off to the Sanger to do a bit of training on one of their great advanced courses, when I come back to Cardiff and I find that we've got a completely different system installed or they won't let me install the operating system that they used at Sanger or whatever, I suddenly can't translate that training as easily whereas within a single environment we can really simplify that. And it still doesn't get past this local problem and that's a critical issue. So the other thing that's sort of really underpinned climb that we're really starting with was the fact that as microbiologists we often represent a much smaller portion of much large departments who are concerned with things like human genetics so in the School of Biosciences I think there are about five microbiologists of 120 academics and so quite often when you get a system being built for an institute or for a department what will happen is that the microbiology needs will be an afterthought and will be a small portion of this overall whole. And that's the problem no matter where you are so there's a big issue for microbiologists that quite often we don't get a fair slice of the pie on an astronaut level simply because stuff's done locally. And so no local group is likely to be a sufficient size of in terms of income or importance to be able to qualify by having a single system of their own. But there are a lot of microbiologists across the UK so there are thousands of us spread across most universities and they all need these resources to do their work. So the idea behind Climb was to create a one-stop shop for UK microbial bioinformatics. So the idea being it's a system which is what I call a public private cloud. So it's public in the sense that if you're a UK academic or part of UK government, you can use it for free. It's private in the sense that if you're not in one of those groups, you can't really. We've already got a set of standardized cloud images to implement key pipelines. So providing a simplified set of pre-configured images that people can just go on spin up and use is really, really, really great for breaking through this training issue and through the siloization issue. And that's all underpinned by a storage repository which gives us a way of allowing people to share images that they've created, images that they've produced, but also sharing data within the system as well. So I guess the vision that we've got ultimately is an edge-urome for microbial genomics, a place where you don't have to worry about where it is you're actually signing into, but you can get access to your data no matter where you are. And additionally to that, we're providing access into other databases from within the system. So one of the problems if you've used it with things like the European Nucleotide Archive are, the ENA is great for storing data, but actually accessing hundreds or thousands of bacterial samples to put into your analysis is very, very difficult when you're downloading it via FTP. And so we're going to be mirroring some of these databases locally, the entire database, and then presenting that out through clients so people can very simply move data in and out. And again, that's another thing that we're going to be doing using OpenStang. So the system itself, four sites, four universities, each university has a system connected over Janet, so we're very lucky in the UK to have this academic network which is extremely fast and so enables us to move large amounts of data relatively easily. Different sizes available sort of, but these are different to the sort of stand-open stacking images in the sense that actually they've got a lot more memory than you would normally get. We're able to support over 1,000 virtual machines simultaneously and that's actually limited by the number of external IP addresses you can get from the universities rather than any hardware constraint and we're not doing over-subscription on this. We have about seven to eight petabytes of object storage across four sites and so that gives us two to three petabytes when using erasure coding and replication. We also have four to 500 T of local high-performance storage that's GPFS per site and that's for scratch space and for mirroring databases and for providing a place to spin up VMs as well initially. And our vision is to have a single system with a common login that uses federated credentials to enable anybody with a .ac.uk email address to be able to log into our system. And I guess the special source in terms of what we're doing is a between site data replication means that if we lose one of our sites then you can still access your files no matter where you are and I've heard a few people saying that that's unlikely. Most of the universities that are within the group have had issues in the past few years with data center being down for an extended period of time whether it's for, we had a big maintenance, three weeks of maintenance just before Christmas the year before last. Cardiff when the whole data center was shut down to replace the power systems there. And so these things do happen and actually by having four sites we get the resilience of being able to just spin up an instance somewhere else. We designed the system to add extra nodes in universities so we can have other universities join us within Climb and share their data and capacity as well. And the critical thing for us is that it's academic led it's focused on the community because we're all part of the community and we're quite well connected within that community. And we couldn't have done it without the collaboration that we've had from our local HPC teams. System itself so we're running open to that kilo at the moment. We've procured the hardware in a three stage process. The procurement itself was pretty difficult because we had a very short time scale so between all the contracts being agreed between the universities we had six months to do the tender, buy all the hardware, get it installed and have acceptance testing completed. And that's made even harder because by definition we're designing a system for a set of unknown use cases so we don't know what people are necessarily gonna want to use it for and what's gonna be the most popular thing which makes it difficult. And even thinking about things like acceptance testing how do you design acceptance testing for a cloud system when you only know about HPC? That was a particular challenge for us. So the way it fell out, IBM OCF, someone of the integrated in the UK provided the compute and then we have Dell and Red Hat providing the storage. And all our networking is provided by Brocade with the exception of our Melanox Infinity Band network. And we didn't know the vendor but from pretty early on it was clear that OpenStack was gonna do what we wanted it to do and so we were pretty sure that it was gonna be an OpenStack route that we were gonna be taking. So per site we have two router firewalls, Commodity x86 hardware running Viata software and they're capable apparently of routing about 80 gigabit each. Three OpenStack controllers, 2164 core, 512 gigabyte RAM nodes and then we have these three extremely fat nodes, three terabyte RAM nodes which allow us to do some analyses that simply wouldn't be possible using our current infrastructures. We have around about 500 terabytes of GPFS on each local site and GPFS runs over Infinity Band and then we have that failover onto the 10 gig if there's an Infinity Band issue. And then we also have per site around about two petabytes of Cep and that's 2764 terabyte RAM nodes and those are the Dell 730XDs and that just runs across the 10 gig backbone. So everybody's starting a network topology. We use both Infinity Band and Ethernet and that's for a couple of reasons. So we're co-located with the HPC systems on each site and most of those HPC systems run on Infinity Band. So if we wanted to move over to be able to provide capacity for the HPC systems maybe using Ironic, we'd really need IB to do that. And the other thing is that the guys at Warwick and at Birmingham who are experts in GPFS told us that Infinity Band was better so we made sure that we got it. OpenStack however just pretty much runs over the 10 gig. We have a brocade VDX fabric and the fabric is pretty good. We've had various promises from vendors over the last few years, a lot of which haven't really panned out but the fabric is actually very, very good. The VDX has worked very nicely and it does pretty much what it says on the tin. Unfortunately the viatas which are also extremely expensive weren't so good. We have various excuses for why that is. I'm told by, since I've been in Birmingham, that the newer version of viata is actually working as expected but that's taken 18 months from the point of purchasing it for it to be usable for us, which is a bit annoying. Using Neutron throughout, we had some problems initially but actually the stability over the last sort of six to nine months has been much better. We have found that our network problems have been the hardest bit to fix. So most of the other issues that we've had have been relatively minor but it's the network issues which have caused problems throughout. Many of these are reported in the bug tracking sites but the issue that we find quite often is when you've got a non-specific network bug actually tracking down a bug report for that particular issue, for the particular version of OpenStack that you're using is actually non-trivial. So we spent quite a lot of time actually hunting down bugs and solutions on the internet for some of these things. So our compute, the key thing for us was knowing our workload. So we have some extremely large complex data sets. So when we're looking at bacterial populations, say from a vehicle sample where you have 10 to 10 bacterial cells per gram and you have thousands of bacterial species, if you sequence that and you get hundreds of gigabytes worth of data at the other end, the complexity in there is such that actually you need a lot of RAM to be able to process that in a reasonable way. And actually this on the side here, I've got a screenshot of a top of a thing called MetaRay running, running an assembly. I think it's using 1.8 tier of RAM quite happily and that's one of the 3T RAM machines there. So we need to be able to scale up to large RAM single core jobs. Bioinformaticians are generally not very good at writing software. We write software pretty badly as a rule and so often the software is pretty inefficient but if it's the only tool you've got to do the analysis that you need to do and you're in competition with groups all over the world to publish your data first, you use the tool that's available for you, available now and you don't try and spend six months re-implementing it. We're currently set up using Regions, which is a source of some pain. When we started, actually it's still the case, the cells documentation was a little confusing because if you go to documentation, it says don't use it in production, then you're usually conservative university system admins who know what it's like to have an academic shouting at you, say we're gonna do regions and not cells. So our issue when we started was that it looked like regions were the way to go for multiple sites and actually I think it would have been better if we'd gone with cells. We will be moving to cells in the next 12 months so that's why it's interesting to be here and hear about some of the discussions about cells version two. But we still have a number of issues with the compute as well, so we've had some performance hits on the large movement machines because of numerous related issues and in Cardiff, we have a slightly different network setup so we're all copper cabling in Cardiff, so we use 10 gig base T and IBM supplied Broadcom cards with that and those have been caused no end of problems because as soon as, well, we had an issue on the controllers where as soon as we put the controllers under any sort of load, the cards would just fall over straight away and you get a kernel panic and the controllers would reboot and so the failover worked brilliantly, right? So the controller, the first controller would go down and it would fail over to the next one and then that would go under heavy load and then that would crash and it would go over to the next one and then that would go under heavy load and that would crash as well and then we'd have all three controllers failed and the MySQL database not coming back up so we'd lose the system. We did that quite a few times before we finally worked out where the problem was. It's fixed now but it's a concern and it's something that we think might be causing issues elsewhere on the system as well. So our storage has two elements. We have a local scratch, so that's our GPFS and then we have a replicated object and block storage. The split actually made our acceptance testing easier because we could buy the GPFS with our compute and then have a place to spin up VM straight away all putting by the same integrator so we could get past the acceptance testing and then we spent a bit longer actually getting Ceph properly configured to run with our system. It also fits with our future needs so what we would like is to have a reasonable amount of local scratch space for storing all sorts of large files that we might not wanna pull down over the internet and so GPFS is actually really great for that as soon as Manila's working properly and then the Ceph is replicating between sites and that's actually taking over our block storage. So at the moment, Cardiff is still running with GPFS where it spins up the instances but both Birmingham and Warwick are using Ceph entirely for block and object storage now so we're gradually moving that overhead of our formal launch. We've had issues with hardware here as well so we had issues with some of the Dell hardware. Initially it was set up and there were issues around the RAID cards and how the RAID cards interacted with Ceph and then we've also had some issues with GPFS and the IBM hardware and in addition to the Broadcom issues, we've had a few issues with our Infinity Band set up at the moment and we also had initially a problem because we picked the wrong block size with GPFS so when we set GPFS up in Birmingham, we picked a block size that was way too large and so when somebody was on the system and started generating huge numbers of small files, the performance tanked and so we had to rebuild the system. So Climb is more than just a system, that's a key component of our research infrastructure and what we've done is we're coupling together not just our compute and our storage but we're also coupling sequencing onto that and the training component as well and it's actually an open stack that really makes that possible because that's bringing together the compute and the storage and that gives us the flexibility then to work around it and it also gives us phenomenal flexibility for our quite varied workloads and I think critically we're eating our own dog food so Nick and I have both moved our research onto the Climb system so that's where we do all of our computational research now and it works for us and we've got a vested interest in making sure that the service remains effective. When we're talking about coupling our sequences and object storage, it simplifies the analysis pipeline for our users so this is a screenshot from a sequencing service that's run out of Birmingham called Microbe's NG so with this service people send in bacteria and then the Microbe's NG team will sequence them and then when the data comes off the sequencer it's picked up, pushed straight into SEF following a small bit of pre-processing and then it's available as an object in SEF to be imported directly into Climb so we've got a complete pipeline with a single environment where somebody can go from sending in some DNA to actually having some quite useful results already pre-processed or running on top of open stack. The other thing that we're doing is we're shamelessly stealing, borrowing, working together with the VLSDI in Melbourne to move their genomics virtual laboratory over onto Climb so the issue that you don't get passed with open stack as it stands is the fact that Horizon is a pretty horrible interface to give a biologist and say you go and create yourself a virtual machine and so the guys in Melbourne have overcome that using this system called the GVL and the GVL is pretty simple, you have a form where you paste in your EC2 credentials and then you hit launch and then that interacts with open stack, spins up an instance and gives you a website and if you want to go and have a look at it the IP address on there, the website's still up and running and then you can go to that website and what you've got is you've got a personal research gateway so it's running things like Galaxy, it's running Cloudman which gives you a scalable cluster on the cloud, it's got an Ubuntu desktop, it's got an iPython notebook, it's got Juniper Hub on there, it's got RStudio and it's got an SSH into a system that's pre-installed with a load of pipelines. So what this does is in a single step provides for about 95% of the biological users that might conceivably use Climb and it's a great example of what open stack can do but actually we can have the system running here interacting with the APIs, the user never has to see anything complex and it's a simple case of you just entering new credentials and it spins it up for you and the nice thing is that the Nectarun's an open stack as well so transitioning this over from Nectarun to Climb took a couple of days and that was it. So it's pretty good. So we have a set of continuing challenges. We've got federated access issues trying to get our federated access working with Chivaleth. We have VM scheduling issues as the Birmingham cloud is now 100% utilization and we start to run into issues when people try and schedule instances and you get either VMs not being scheduled onto nodes that look available or you get issues where you get nonsensical error messages that nobody can understand. We had some issues with storage configuration our block size was wrong and we've had some issues in large volumes as well that we're currently working through. Our networking is now mostly not to do with open stack it's actually to do with our viatas. And we're starting to see now that the system is starting to get hammered and we're starting to see some issues coming up as our system's going to the heavy load. This is compounded slightly by the complexity of open stack and CEP but now we've got about 18 months of operation experience fixing our system when it breaks we're actually starting to feel a bit more confident and the system's a bit more stable. I think it also helps with the improvements that have happened in Kilo as well. And then finally our last challenge is this user experience of Horizon which is a real issue even for quite seasoned bioinformaticians who actually come to it and are really very confused by the whole interface and don't really want to use it. And that's compounded by a lack of use of that area information. So our performance is generally very good this is just a bit of benchmarking that we did on the system so 10 bioinformatics workloads. In the middle we have the university Cardiff University Raven which is our HBC system and so that's one. If it's above one it's faster if it's below one it's slower and properly configured we see very little difference in terms of our images, our VMs compared to the bare metal performance there's actually one exception which is an interesting one. And so all in all we're actually quite pleased with the overall performance of our system. The, I mentioned the block size issue and so this benchmark down here is actually client running with the wrong block size and you see that some workloads aren't really affected but in others there are real dropouts in terms of performance. And so that sort of really brought home to us how critical it was to get the block size right. So where are we now? We've got computational hardware's in place we've got over a hundred users already. We've got two modes of access and we've got access for registered users to the horizon interface and we're adding our launch system which is the GBL access in the summer this year and just in terms of sort of outputs that we're generating. So this is a nature paper which was led by Nick Lohman and what they did was Josh Quick the first author on the paper went out to Africa with these nanopore sequences so about the size of a chocolate bar and they sequenced the Ebola in real time, okay? And the way that works is that you get the Ebola samples you crack open a DNA you run it through a nanopore sequencer and then you upload it to a cloud system because there's no compute available locally. You analyze the data and you can start inferring things in real time about transmission and about the spread of the epidemic. And so that's I guess an extreme example of where this sort of technology can be really useful if you've got a core infrastructure that provides the capacity for analyzing sequence data in real time then it means you can go to any part of the world or whether it be Africa for Ebola or Brazil for Zika and you can take an ongoing outbreak and actually track it in real time using sequencing. So we've got a pretty complex system with a lot of challenges but we've got a very clear vision that underpins that we know what the community want we know what we're trying to give them and actually that's probably the most important thing. We've got various complexities around the fact we're not a single site we're actually built across four universities and with capacity to expand to more. And actually the key to everything that we've done has been our collaboration between our technical team and the academic team with the vision for the project. We think we have a huge potential from providing resources like in the case of Ebola for actually responding to outbreaks as well as sort of long-term supporting of research groups so spinning up long-term instances that run for an entire groups you replace that server in a cupboard with a VM running on the cloud that they could be reasonably sure is gonna be stable and is gonna be there tomorrow. We think this is pretty much only possible using OpenStack at the moment the complexity is the price you pay I guess for having a broad range of features that you can configure to do what you want. And I think looking forward in terms of the OpenStack side our challenge is actually to engage better with others in the community so a lot of stuff that we've been doing has been very siloed I guess we've been working in amongst ourselves and with a few people in the UK and actually one of the reasons why I wanted to be here in Austin was so that I can connect up with some of the people on the developer side who maybe might be interested in what we're doing and might have some solutions or some interesting ideas for some of the problems that we face. I think some of our use cases will push OpenStack and I think maybe we will have some useful feedback for the developers as a result of that. So with that I thank you for your attention and I'll leave you with the acknowledgments. Questions. You mentioned VM scheduling issues could you elaborate on that? I didn't quite understand. So we had a few issues where we set the schedule up with the filters that we think should be in place and then the system comes to schedule a VM and it gives, Horizon gives us an error saying it can't find available space even when there are nodes with huge amounts of space available on them. So do you believe it's a configuration error or a software error? We're not sure. We think it's probably a configuration error but the scheduling documentation is not the easiest for us to work through to work out exactly where that might be. It's working out exactly where the issue is that's proving to be a problem at the moment. Okay thanks. Can you mention what manpower in terms of FTEs are being used and does this include HPC guys? So we have four full-time CIS admins now employed so the most of the system has been set up with two FTEs of CIS admin and then all the time from myself and Nick to actually get the system going and Simon and Birmingham. So I guess in terms of the main contributions I mean Simon and Birmingham, Marius have really been sort of central to getting the system up and running into the state that it's in now and but going forward we just recruited another two CIS admins to provide user support as well as further support to the system. It helps that Marius is extremely experienced with CEP so he's taken over the setting up of CEP and the way that we did the procurement was such that Birmingham had a framework agreement for buying HPC systems so we actually procured the Birmingham part of the system first which gave us six months for Simon to set up OpenStack, tested and get to a configuration that we were happy with when the main kit went in. So it's probably been about, I guess, two to three FTEs over the last 18 months. Question models, your CEP spending sites are separate CEP installations or is just one that has replicas in all four sites? So it's four separate CEP installs which replicates between sites via the gateways. Okay, so you have Radwitz Gateways replication? Yeah, we do. Okay, and you mentioned disaster recovery scenario where one site is down. You basically have the data in S3 available at the other site but not the cinder volumes. Yeah. Yeah. Okay, and you mentioned alternative dashboard used by bioinformaticians. They put their credentials and they spin off VMs. That works in a single tenant scenario where everybody has just one network and they all basically have their VMs in the same network. Yeah, so the GVL works with, it just takes your user credentials and you need certain settings. So you have to have an external IP network and you have to have an internal network on there. And as long as you've got that as your tenant, it will spin up an instance on that amount of work. Okay, so you can specify additional networking details. Yeah, but that's through the configuration on the launcher itself. Okay. And are you using any Docker containers to package your applications to make it easier than having glass images as a share? So what we have is we, so it's actually all done via Ansible. So we have a, so the GVL has a core image and then you build the file system on top of that using Ansible. So it's just an Ansible playbook, effectively. Yes, but Ansible is gonna install basically with, you might still run into dependencies issues and... You know, but it starts up with there's a GVL base image which is quite small and that's got everything that's required on it and then there's an Ansible playbook which is kicked off. And installed the software every time. Yeah. Okay, because we are actually using Docker containers on top of VMs. Okay. And this basically allows to have no dependencies on the OS of the... Yeah, the GVL is a special case because the launcher only runs the GVL. So there aren't any other images on there yet. So there are different versions of the GVL that you can run through, but that's, it's a single launcher for a single result, effectively. How do you find the performance of the Cinder volumes are using the ratio coding for the Cinder pool? We're not using a ratio coding for standard replication. With SSDs for journals or? Yeah. Okay. And it's fine. Until it flashes. Yeah, obviously. Okay, thank you. All right. Thank you very much.