 I think we'll start. So my name is Paul Collegiw, I'm the director of research computing at Cambridge University. And today we're going to talk to you about our attempts to start using OpenStack in research computing and specifically a biomedical platform that we're building currently. I can't look at the audience, I'm getting blinded by these lights. So this talk is going to be in two halves. It's myself and my lead engineer, Vojtek Turek. So the first half is going to be more of an overview. An overview about what we do at Cambridge, an overview about HPC, and the myon bed cloud, and then Vojtek will give a more engineering focus talk giving some details of the implementation and some recent examples of trying to connect storage within a research cloud. So just some history. I think it actually does say in my contract that whenever I give a talk I have to talk about Cambridge's history. It's almost de facto, my boss will be watching. So we have a long history of computing development at Cambridge right back to the late 40s with an interesting system called EDSAC right up to the modern day here of the OpenStack. Vojtek was a very interesting system and that's Maurice Wilkes there, he's quite famous, and look how engineers used to dress in the 40s. My engineers do not dress like this today. This was a very interesting computer system. It was the first computer with electronic loads to our memory. It was mercury delay tubes and that's the vat of mercury there that they're both staring into. This was actually a very interesting system. The fastest supercomputer in the UK was in Cambridge in the 60s and this is our machine room today, much more boring looking commodity machine room. Even a Raspberry Pi of course came to Cambridge and you can do clustering with a Raspberry Pi. This was built by a colleague of mine in Southampton by his son, Lego Raspberry Pi system. I was quite annoyed when he beat me to the pip with that PR. He got lots of press for doing that. I was actually quite pissed that he beat me to it anyway. Cambridge is an interesting location. It's really a global tech centre. The university itself has a turnover of around £1.2 billion. We have a research budget of £500 million which is one of the largest research budgets in the UK. I think more interesting than the university is a tech cluster. Cambridge has a really active technology centre. There's 1500 technology companies in Cambridge with an annual turnover of around £12 billion. Employing 53,000 staff, that's why it's really difficult to get good engineers in Cambridge because they're all working here earning much more money than I can pay them. We just have to make things interesting for our engineers. It's the only way we can keep them working. We have a mandate by the university to provide research computing services to the university and the technology cluster. The university really does like to foster and assist this technology cluster. It's an integral part of the city and we now have a mandate to provide services to that technology cluster as well. If I look at the focus areas of my division, we really support research in the university. That's our primary function. We have a strong industrial outreach function where we try to project services to the UK industry. We have quite an active solution development function. This is all aimed at obviously driving discovery impact and innovation. This is the agenda of the university as a whole and we have that agenda. The research computing team is structured across six divisions. We have 28 FTEs in those six internal divisions. Just a quick slide on outputs and usage. We support 700 active users. I think we have a thousand registered users but a lot of those have retired or died or realised they logged onto the wrong system. We have 700 people that actually use us actively across 42 departments. Our systems are around 80% utilised constantly. That's quite a high utilisation especially when you have large parallel jobs and you're waiting for systems to drain to give access to large parallel jobs. We're really pleased with that kind of average utilisation. A very interesting demographic has appeared over the last five years. We're seeing this emergence of what we call the long towel. It's kind of an HPC centre director's dirty secret really that in the past 95% of your usage was by 5% of your users. That's not a good statistic when you try to get money from either your university or central government. They don't like that stat. So we don't normally talk about it. Over the last five years things have changed and we can talk about this now. Those 5% are still there but they only consume around 50% of my resources. The other 50% are consumed by a much larger number of smaller users. This thing's got a hairline trigger. We have around 300 users who are consuming this kind of level. If I say core hours my boss doesn't understand it, he doesn't know what a core hour is. I give him this number of work station days. He gets that. Work station day he understands. 200,000 core hours he's got no idea what I'm talking about. This is a boss slide when I try to justify my existence when I want to pay rise. I bring out this slide. It kind of works. 200 work station days in the last 12 months. This is quite a nice figure. Ever since we started nine years ago we've had this kind of compound annual growth rate of new users on the system. That's been a constant growth at that rate every year for nine years. I think when we move to a more open way of accessing the system, models we can employ with OpenStack, I think that growth rate will go up. I think this is the most important point on the slide when you're trying to get money out of the university. We're currently supporting £253 million worth of research projects. When you look at all those user groups and I ask them what's the value of the grants that we're supporting. I add it all up, it comes to that number, which is 17% of our university's income. This is quite a nice number when I try to get the university to fund me. We support a lot more research than we cost. Over the last nine years we've supported 1,400 publications. It's about 300 publications a year are produced out of work that gets done on the central system. Again, that's a nice number to give to your boss's boss. Just a quick talk about infrastructure. I'm an HPC guy and we love infrastructure. I know I'm at a software conference but HPC guys love infrastructure. We managed to convince the university to invest in a new data centre. This is the HPC hall. We have around 100 cabinets in the HPC hall. Two megawatts of IT load. The cooling is quite interesting. You can't see it here but we have water cooled back doors. The PUE of this data centre is 1.1, which means we only consume 10% of the energy that goes into the computing to cool it. Most non-optimised data centres even today running at PUEs of 2. They cost the same amount of energy you put in to cool them. 1.1 is a really nice number because electricity in the UK is expensive. You guys here are paying what? 7 cents a unit, 8 cents a unit in the UK we're spending 18 cents a unit. Power is really expensive in Europe. Our current platforms, we have around 900 Dell servers in a few different systems. A large Intel x86 cluster with Infinity Band, a large GPU cluster. These are all quite large systems reaching quite nice numbers in the top 500. The top type 100 is kind of like a league table of supercomputers. HPC guys love to compete in that league table to get higher up the list. This machine was number 93 when it entered that list in 2012. The GPU cluster was particularly interesting because it was very energy efficient. There's another league table for energy efficient machines and this reach number 2 in that list. This year's going to be an interesting year. There's a lot of new platforms to Biomedic Cloud that I'm going to talk about. A large upgrade to our central cluster. That'll be a thousand loads of Intel, Broadwell. A lot of work going on in storage. We're really seeing an explosion of data science at the university. This is echoed in every university. Computation has really now moved on to data. Most of our ways of dealing with data were not keeping up with demand. We've had to completely restructure the way we store data moving to a kind of hierarchical system to keep that under control. Let's get on to OpenStack. Why do OpenStack? Why am I here? Why am I not enjoying my time in Cambridge? From my perspective, from a central IT person's perspective, there are strong drivers. We wish to make computing and data and applications and workflows a lot more accessible, a lot more flexible and more secure. I don't want to be relying on Linux permissions to keep my biomedical data in the data centre and not on some undergraduate desk. We need better security models. Also, I don't want to keep having to have a range of different frameworks for all the IT I do. I don't have to keep employing specialist engineers like this one. This one's okay, but it's difficult to find them and keep them and nurture them. I would much rather be getting my engineers from a wider engineering base. I have drivers that OpenStack possibly allows me to have a single framework to offer a lot of different IT services from, and I'm drawing on a wider engineering base to do that. That's my driver. My customers drivers, what does this guy want? I don't know what this guy wants. This is a strange guy, but my customers are strange guys. That's the problem. I can really happily say that my customers are strange people. This is videoed as well, isn't it? My customers are great people. I love them. What does he want? He wants computing to be easier. He wants to be able to share his computational models, his methods, his data, and basically all he wants is to decrease the time to science and increase innovation and increase research outputs. It's quite easy actually. He doesn't want much. Easier computing, reduced time to science. That's really what it's about. What are we doing in Cambridge in terms of OpenStack? I think there's two main focuses for us. We're looking at development and deployment of OpenStack in a broad range of research computing use cases. I'll talk about those broad range of different use cases in the next few slides. We have a particular focus on biomedical computing. We're also involved in a very large astronomy project called the Square Kilometre Array. This is one of the next very large scale big data, big science projects. All the computational design for that project is headed up by Professor Paul Alexander in the astronomy department, astrophysics department in Cambridge. We've been contracted by Paul to look at certain aspects of the computer design. We're looking to use OpenStack for the base control and monitoring system within that radio telescope. Basically radio telescopes do big Fourier transforms, but very big Fourier transforms. When the first phase of the SKA comes online in 2020, which is 10% of the experiment, we need a 300 petaflop Fourier transform machine. We wish that machine to be running OpenStack. This is tens of thousands of nodes running OpenStack. We're in the design phase of that. We have a good partnership of companies that we're working with at the moment. Daryl, Intel, Red Hat, Mellanox, Nex Centre, a StackHPC, which is a specialist company in the UK looking at this convergence of OpenStack and research computing. Last but not least is this emerging community that is here in this room. I think the people in this room are going to make this fly or not. I think for OpenStack and scientific computing to flourish, that will only happen if the community gets organised and pulled together. I see that happening here at this show. It's very encouraging. Use cases. I thought I'd just spend some time looking at use cases for OpenStack within the research computing domain. The first two are really when I put my service provider hat on and we want to provide infrastructure and platform as a service for researchers that have persistent research computing needs. This is not scientific workloads. It's for that web server or that persistent infrastructure in the department or in the research group that normally go out and buy a beige box and keep it for 15 years. I do have customers that have 15-year-old equipment in their machine room and then want it to give it to me to put it in my machine room. We say no. That should go in landfill. This stops that. This helps us get rid of all that old crap from my customers' machine rooms and get them running on a virtualised infrastructure. That's one and two. Three really is kind of chameleon-like. This idea of having application development provided as a service from the centre is a really good idea and I shamelessly copied this idea from chameleon because it's a really good one. Here we've got kind of research computing as a service. VMs as a service for researchers to attack that long towel to make VMs accessible for sharing workflows. I really like the jet stream scenario. Jet stream does this really well. HPC is a service. The last talk was about that, the guys at Monash are doing this. This is where you run your HPC systems within a virtualised environment. The sixth use case is data. Data I think is one of the really interesting scenarios here where you might have a large public data set. You don't want to be dragging that data all around the country to the different researchers. You want to bring the researchers to the data but then they complain that the environment is not right. Virtualisation gives you the chance to bring the people to the data and give them their own environment when they get there. I think we really are moving to data-driven science and I think that OpenStack and virtualisation really has a huge part to play in enabling data-driven science and letting people bring their own environments to the data. Let's talk about the biomedical cloud. This is the new system that we're building now. Actually, we're not building it now because the guy that's building it is sitting there. Actually, I've got an email waiting for you because the customer has now noticed that you're not there and he's saying, where's my system? So as soon as we get back to England, Voyatech is building that system. What does Biomed cloud actually do? It's designed to be a single computer and data platform to link different research communities at the university together. It links academic researchers and their data to clinical researchers and their data and it also brings in more sensitive data from the hospital. It brings in medical records, it brings in live telemetry feeds from the hospital. We wanted to drive research outputs in the clinical domain to drive research outputs within the academic domain. Hopefully, to translate some of these methodologies back into patient care in the hospital. This is what we call translational medicine. I've got a slide on this. Really what we're trying to do, Biocloud will enable this kind of virtual circle where we take patient data, we drive that into research programs in the university. That gets fed into clinical trials and under the correct kind of ethics goes into patient care. We translate those outputs into new treatments but you can only undertake that translation if you have this virtual circle. Cloud infrastructures really can help this process. What does the Biomed cloud look like? This is my kind of noddy cartoon. I drew up. Boy, I think he's now laughing at me because he has no wires in it. This is what I think it is from a director's point of view. We have storage that sits in the NHS. That's our hospital structure. It sits on the NHS secure network. We then take data warehouse products out into our secure storage location in our research network. Under the right ethical and compliance regime, we can let research staff get their hands on that data and undertake research programs. We have 10 petabytes of luster. From the HPC space, people are used to using large parallel file systems. There's two petabytes of Nexentor store, which is a more enterprise type storage system. Nexentor Edge which gives us our block and a large tape. This allows us to actually have quite well-enforced policies of stopping these file systems clogging up. We have a hierarchical storage system and an HSM to move data continually on to the tape. It's not all open stack at the moment. The only open stack component within the system is 2000 cores. There's a traditional 2000 cores of traditional HPC and a Hadoop cluster. We do plan as we get more experience and we've tested more ways of working to put all of this within an open stack framework. Why are we doing this? What are our drivers? We have three stakeholder projects. The funding for this was just over $3 million and that came from three lead projects. The first one is a genomics project. We have a contracted relationship with a company called Genomics England. This is a public company. It's fully owned by our National Health Service and it's tasked with sequencing 100,000 patients with rare diseases. My team are writing the software stack for this, which will do that gene variant analysis component. This software really offers great breakthrough functionality and performance because traditional methods just don't scale to that size. We will deploy this software on our own infrastructure not to feed Genomics England but to feed a similar project that we're helping in the university called the bridge project, which has to sequence 10,000 disease patients. That will feed into Genomics England and we're already working quite closely with that group on their pipelines and helping them run within the Biomed platform. This is a very similar use case that was just discussed by Monash and we will be shamelessly having some support for some of our Monash colleagues who are ahead of the curve. We have a large imaging microscopy of structural determination groups in the university. It's a medical imaging where brain scans and various other scans are microscopy. This is apparently a microscope. I used to be a microbiologist and I never saw a microscope like that. This is how microscopes look today. This is really funky stuff. These microscopes produce an awful amount of data. Loads of images at really high throughput. This is a structural determination. A lot of computational biology relies on structures of molecules that are traditionally produced by X-ray crystallography where you crystallise your sample. Biological samples of interest do not crystallise so you're kind of stuck. This method doesn't crystallise. It freezes them and then puts them in a electron microscope and you can get very good resolution. This whole industry has gone for a real revolution just recently and again produces a lot of data at the end point and needs a lot of computation in the middle. All these industries are transforming biology and medicine and they can't work the way they used to. They need high performance computing resources and our industry does not serve them well by traditional methods. We need to use new methods of providing them with computing data resources. The chapts in Monash have been doing this very well and why invent the will. The last area that we're going to look at is predictive medical informatics. This one I find personally really interesting and there's a lot of low hanging fruit here. We can work with the hospital and you can take medical records, patient records, you take live telemetry feeds from hospital people's blood pressure and various other stats and you run quite simple statistical models on that data and you can come up with some really interesting predictive statistics that can really improve patient health. This is a really good example. This chap here, John Cromwell, he's from Iowa. I met him at Delworld last year. He has a statistical model. He takes live feeds from the operating theatre. He combines that with medical records and he can predict during the operation whether the patient is likely to get a post-op infection and they change the procedure and by doing that they can cut the post-operative infection rate by 58%. That's kind of cool and there's loads of things you can do like that if you set up the infrastructure and you have the statistical models. I think that's my part of the talk done. I'll now hand you over to this chap. Ashley Wojtek, it's quite a good likeness. I try my best. Afternoon. My name is Wojtek Turek. I work for Paul. I'm leading a team of research computing platforms. In practice that means actually I have to design, build and then make it work whatever Paul's dreams up. We basically are driven to deliver cutting edge and high quality technology to our researchers, to drive research at the University of Cambridge. So I'd like to mention my colleague who contributed into a talk and some data in the slides. So Stig Telfrey, I think everyone knows Stig. He works with the University. He's our open-stack consultant and he's the driving force behind our open-stack efforts. And Matt, so Matt is our HP specialist and he's done a lot of work around benchmarking and building a storage platform which we're planning to use with open-stack. I would like to also mention our partners. So we teamed up with very good vendors to deliver high quality and production open-stack system. So the key characteristics for our open-stack biomedical cloud we actually at the University we have a site license for Red Hat Instruction site license so it was natural for us to get advantage of that and our production open-stack platform is based on the Red Hat open-stack platform. That gives us a very good automation management and also some control of the updates. I think these things get better as open-stack is released. So every new release has better way of updating, upgrading its components, which is very important for production system because obviously we don't want to have a downtime of three or four days when we're upgrading our open-stack system. Obviously because of the production system we would like to have a high ability so we utilize a pacemaker, I.G. Proxy and other techniques like cholera for database replication. We're also looking at different types of storage so bioinformatics requires a decent storage to look at the data. It's a lot of data so delivering high-performance storage to the compute node is a key component of that platform. So we look at different storage systems so our cinder will have multiple backends for different functions. And to deliver that storage we obviously need to have high-performance network. So we use myonox network which is the same as we do for HPC so our HPC runs infinite-bound. Here we use myonox Ethernet this network delivers capabilities such as RDMA which can accelerate your storage. So you've probably seen that slide many times today. So this is a reference architecture for our open-stack. It's red-hot-based. It has the under-cloud component which manages and deploys the over-cloud components we've got compute nodes multiple controller nodes to provide HA and then multiple cinder backends. And as Paul mentioned we would like to deliver a stable and secure platform so we isolate all the networks and also because of the performance we isolate the storage network and other networks as well. So this is actually if you've seen London Underground map so this is based on the London Tube map. This is actually showing University of Cambridge network. What is this important? So this little circle is our data center where the main part of the platform will be located and all these little points are different department colleges. So University of Cambridge is spread across the whole town and we actually own the network so we run our own fibres and we can connect different parts of the university directly with dark fibres and enjoy very high performance connections at this end here we have a hospital and it's one of the biggest research hospitals in the UK. So you can see we have also diverse connections to the data center in the hospital so we can provide a highly available network connection to our customers and hospital will have devices like pet scanners and gene sequencers they produce the data and they have to shift the data to the data center so we have that infrastructure in place and it's very good and this is actually thanks to our network team that runs it so they're doing a very good job. So this is actually our high level physical view on our open stack system and how that connects to other elements in the data center so that green square is a rack represents a rack and that element here is our ability zone in open stack and we got three similar blocks like that and you can see probably can't read because it's very small font but this is a one-gig network for management and IPMI side racks and we've got a 50-gig connections to our compute and storage and then we have 10-gig connections as well and that also to our compute and that is to, we will utilise that for external networks for provider networks and access to departments. So these two blocks are actually physically located in our data center and our main data center is a world class data center has been built last year and it has a capacity of 200 racks and 3 MW and plus one so it's a very good data center I think a big disaster would have happened to actually data center down but we actually are not very far from the airport everything may happen so for that reason we deployed a part of the open stack in the second location which is a smaller server room I wouldn't call it a data center and we also have, in the main data center we also have our HPC cluster and HPC storage and a tape library for archiving and these two locations are connected via the dark fibers that were shown on the slide before so we have a 2 x 40-gig connection so Mellanox have this long-haul transceivers of 40-gig switches so we have a pretty decent bandwidth across the sites and we divide that link for different functions so this is a function for the open stack we also have a function for the tape library and other components so we designed it as well in a way of thinking about that maybe in the future we would like to take one of these blocks down and maybe do something different with it maybe run ironic, wear metal change the hardware so in a way they are the same so we can actually migrate VMs to one of these and that can be disconnected and changed and then we can bring it back maybe even a different version of open stack so it's quite flexible design in that terms our storage is also distributed across multiple sites so we've got our block storages we've got one in the main data center and another one in the second location and we also have a tape system in the second location for the archive so a little bit about the hardware so as you can see actually it's not that bad I'm not here but actually things happening there and people put cables into servers and switches so don't worry Paul so the hardware our computers are actually computers are here C60 to 20s which are dense 4 servers per to you so you can see them here there's lots of cabling in there it's quite neat cabling though so yes we try to utilize space as much as we can because space is precious and we're running out of that space because we've got lots of systems so you can see here the one gig network and the 50 and 10 gig network here and this is the one part of the control plane so we've got a controller note here and then some of the object and block storage and essentially that's one of the green blocks that I showed you on the slide before so we also we use Nexentah storage we connect Nexentah to Nexentah store and our network is as I mentioned before it's myenoc so the core network is 100 gigabit network and we also have so this is actually the new spectrum switches which will they have capability of offloading certain functions so in the future they will have capability of offloading VXLAN and you can be able to do VXLAN on the switch which will enable us to redesign our network so the moment this is L2 network and obviously for scalability we would like to have L2 inside their accent and then have L3 above the racks and that will be possible with high performance VXLANs when myenocs actually brings that function online on the switches and for all I hear they are working very hard to make that happen as soon as possible which is great so we also have a development platform that is playground and it actually it's essentially in the scaled down version of the production system so it's pretty much the same hardware which is obviously the key thing because you want to test this hardware we want to bring the new features the new firmwares online on the development system and then test it on the production system in any way so one of the things we are doing now with this is we deploy this generic storage device this is actually a server with 24 SSDs in RAID 10 and we've got a optional Jbo it's a Dell MD3460 storage with 80 new line size disk attached to it so we are exporting that to our compute nodes using ICER protocol so this is like a SCSI but with RDMA acceleration so everyone says this is great because that is high performance high throughput high latency at least that's what you read in the myenocs PR so we decided to test it ourselves and my colleague Mark spent quite a lot of time in the last few days to do some tests and we got some benchmarks done we use FIO which is quite popular benchmark so the other part we just did the benchmark for the SSDs because they are more interesting really so as you can see the system has 24 SSDs Intel SSDs the server is quite powerful as well and in terms of throughput we see a really big difference on this graph the bigger is better so we can see more than 5 gigabytes a second for RAID and then slightly over 4 gigabytes a second for RAIDs and this is normalized SCSI so it's a huge difference for IOPS this is a similar story bigger is better so we are getting almost just below 600,000 IOPS for RAIDs and 500,000 IOPS for RAIDs and then ICSI is not doing that well so obviously there is a big advantage of the RDMA acceleration and this is also a very important latency so huge numbers this is actually a logarithb scale so you can see here actually the lower is better so there is a big difference between ICSI and ICER in terms of latency as well so it's winning on throughput and IOPS latency and I think what's really cute about this is that this is a very cost effective way of delivering a flexible and high performance storage and in a very flexible way to your compute and that can work well in typical HPC application or in open stack so these tests were done on bare metal so we haven't done the test yet on the NOVA so this is the next stage we'll be publishing these tests actually part of the work we're doing here is to design the reference architecture for these storage systems then run benchmarks so validate what the vendors actually says is actually true and then publish them so part of the scientific work group which had the integration in this summit is to make sure that all these outputs they captured and that everyone can find them and make use of them so that work is not lost and that we actually progress things so this is some of the future work which is interesting for biomedical applications but not only so my HPC bug grant is storage and we're running a big cluster systems more than 5 petabytes and we would like to obviously take advantage of this existing system so there is actually around 3 used models for the open stack all the existing cluster file system to use in the open stack and use cluster as a shared storage back end through services like Manila or deploy cluster as a service so just deploy entire file system within the tenant environment just to provide high performance and high throughput shared storage with that tenant environment but there are challenges with cluster and we've been discussing this during meetups on Monday and to make actually cluster a true multi-tenant file system there are numbers of things that have to happen we need to implement a mapping feature for the UOIDs and GIDs and enable the sub directory mounting and also authentication for the clients for example using Kerberos so people actually know Luster and follow Luster developments and go to Luster conferences they know that actually these things are happening these projects already started and are being progressed so obviously the meetings like that the scientific working group helps to drive this and Intel is a big contributor to Luster so we work with Intel and we are trying to make sure that this work is progressing as well so actually in Cambridge we have something called CDI which is the Cambridge Intel Solution Centre and what we do there is we have a number of projects that we develop and solutions for HPC but not only we work closely with Dell hosting group who is led by Onur who is actually here and from our previous experience there is a lot of good stuff coming out of that work so we obviously decided to start the OpenStack project through that vehicle and Intel is one they have the Intel Enterprise Luster version and they are a key contributor in Luster development so we are hoping that we will get traction in making Luster more usable in OpenStack environments so we may know that AWS can actually use already Luster as something called Luster Cloud Edition and if you spin up AWS instances you can actually deploy Luster in a few minutes on AWS so we would like to take that work and port that functionality to OpenStack so that it is possible to do that now more or less manually automatically but it would be really good to get that down with Intel and make sure that this work is automation is done and Luster is maybe added to the application catalog so it is very easy for everyone who is effectively a customer of us or user of OpenStack and actually just launch the Luster file system straight away without having this special skills that you require to actually stand Luster up because it is not easy so to make that actually work it is not difficult to spin up Luster in OpenStack but to actually make it work in high performance mode you need to have high performance storage so the work I showed you earlier is enabling that so the ICER Block Storage delivered by Cinder can deliver that level of performance so we will be doing more benchmarking tests we will publish them but some of the work we have done in the lab shows that there is actually clear and big benefit of using RDMA acceleration for delivering storage to OpenStack so again I will reiterate that all this work will be documented and then published as a whitepapers but also through the OpenStack community and the scientific working group so that is all from me and questions and there are also our emails if you would like to contact us so the red line is for the boss he never answers his emails don't email me so your infrastructure since you are partnering with I assume it is based on implementing the computing power through virtual machines rather than containers currently virtual machines currently we are looking at because of the security isolation requirements from some of the project in the hospital we at the moment are looking at the more traditional cloud application using VMs and KVM and that has been driven by that particular user community they are used to working in VMs it is easy for them and actually the main driver is our customers say that that is the way they access resources on stacks and workflow stacks within a VM and then they share them so I want to see the professor of medical informatics a couple of weeks ago Lydia Drumwight and she is like where is your container yard so these people are quite advanced in that way of working but currently we are just doing VMs I haven't mentioned before actually we have a small POC platform for almost a year now and we are in the hospital so they were very happy with that platform that platform wasn't even RDMA accelerated but I think the reason they were happy was because of the flexibility that OpenStack gives them that they can get their own resources build their own infrastructures they just can get on with it and don't wait for the IT department to do it and it's very quick to change things a key advantage using OpenStack is the flexibility you get but obviously our work as a central university information services department means to make sure that we obviously provide the platform which is flexible but also provides these performance features so my second question regarding the you have certain expectations from your customers which is the department of the university how do you handle the orchestration at the time delivering very fast power at the time of demand by that department do you have any special tools for orchestration? so in our traditional HPC environment there's a batch scheduling system and they put their jobs in the queue and we can give them priority or not dependent on how much they pay because we run a cost centre so money is a great leveler so those that have more money run faster in the queue so one of the advantages of the virtual machine model is we can over prescribe possibly get more instant access so this is one of another driver actually for moving to a virtualised platform is we can give more instant on because the traditional HPC certainly is not instant on really you wait in the queue and there can be gaps and you can get in if you have a high priority but this is one of the drivers I think is for that smaller workload having a more instant on with the ability to over provision is something that is probably what you get in that right? also in terms of the network orchestration so we may or may not provide that tool called NEO to orchestrate a network driven by Neutron so Neutron will drive that software and the software will actually talk to the switches and configure I mean it sounds good but in practice probably there is some bit of work to do to make it work well which we will do but also the key element is as I showed you in one of the slides we had this very good fibre network so what we can do I mean we can on demand provision access directly to a department via our own devices via on fibres with very high performance throughput through that network and it can be dedicated to a specific research group so in that respect we have this almost dynamic way of provisioning very fat pipes for the storage connections and it's important for some of the groups as they produce a lot of data and traditionally in the past they were doing the compute in their own place but obviously as the research grows they have a big requirements for the facility so the natural way was to move to the data center so one of our key maybe not issues but challenges is to make sure that we provide these fat pipes to the department so we can pull the data and I think one of our responsibilities is as well to have the storage infrastructure in place so not just high performance crach or high performance storage for computers but all the multi-tier multiple tiers of storage so one scientist move data to us to compute and get the results we don't want them to take to need to take that data out they need to be able to store the data in the same place because you know this is a big problem we already have that problem because previously we didn't have this multi-tier storage strategy and a lot of complains was about moving data out and in and in a way you shoot yourself in the foot because you know pushing data to your system take it out to your system is always busy and always fights in all the time so yes we are moving into better time hopefully with the multi-tier strategy on the next center edge platform I don't know that much about the next center line up at the moment but is that being used for volume storage or something? It's a new platform next center is that last year so we are kind of pioneering that with them we were using that for block storage initially but it can do both it can do block and object so we will be testing the object function as well and it's a tier for the next center store no no it's actually completely unrelated it's just we just want to evaluate these different platforms because hardware it gives us block an object and it's a cost that I can afford to pay looking hardware infrastructure is all the same really whether this is safe or next center edge the hardware will look essentially similar so next center edge the architecture is promising high performance access to block and object there is a point here that the body hasn't mentioned so we can have the Dell storage hardware and then we can differentiate that hardware to luster one way to next center store and another way to next center edge so just by different software configuration I can meet all the storage requirements that we need with one hardware platform and different software platforms that have all supported it so our hardware fabric is essentially almost the same across all the system with this HPC open stack this is actually very important because from the operational point of view for me to have multiple platforms and many vendors and provide efficient production services it's okay for development but if you actually can't do production system it's important Just wondering with this cloud is the idea to I guess provision a cluster just one request at a time from a user or a group of users or is it to maybe for one team you can give them a cluster with a list so in this in our initial world out we have not virtualized or put open stack on the HPC cluster we plan to do that and what you're saying is exactly one of the goals in this configuration we've just provided in virtual machines not in a cluster configuration for that kind of single node high throughput type workload for that long tell that you might have heard people talking about so you know I said that HPC has democratized and there's now a large number of smaller users those users will love the virtual machine the traditional HPC and clustering that will be our next stage similar to what the guys in Monash have done you can deliver a traditional HPC within open stack where I could then spin up on demand you know even a virtual cluster that's your cluster I spin it up with your image and then you run your scheduling environment on the cluster and then I tear it down it goes back to someone else there are various modes that's next on the list so we've started simply for now the infrastructure enables that very well because we have the RDMA accelerated storage we have we are working on enabling SRROV within the virtual instances so from there it's really working out the most efficient way automatic way of spinning up the cluster there are a number of projects out there that you can use and spin up a slurm cluster within open stack it's technically feasible but we need to test it and then servatize it because we work in a central IT department and we've learned by history and previous mistakes that we should only roll things out once we've really servatized them and we know how they behave so that's future work the point actually of architecting this infrastructure is to enable all these different functions but with time so we don't try to do all at the same time because this production system will be bringing that online in steps Thank you You plan to use it on top of in addition to There's one thing about CERF which is quite interesting CERF looks very nice and you go to Red Hat and you ask them how much does that cost me to support a petabyte of CERF and that's an interesting discussion model just scales and when you scale it up to petabyte level it's not economical and so we're not using CERF because we're a cost model We're actually using CERF on the POC platform but we're talking about production system here so we've seen for the production system we rely on vendor support so we go for support all the components of the system So CERF is not in the... No, it does the cost model for support to scale to petabyte level The second question is Did you gather user requirements for bioinformatics workloads to provide very fast SSD based storage? You can That's what you have right now You are building this for SSD based We can do it that way or link it through to the last of our system So currently we support some very large bioinformatics pipelines that have a well maintained well engineered Luster file system that's fast enough but the work we're doing with the SSDs has that in mind because it's more flexible and we can load that up on demand By the way, I'm running a 3 petabyte CERF cluster using bioinformatics with zero cost We can talk after if... Obviously we could do CERF and have no commercial support but that's fine but for those who once in a four year cycle problem we used to run in very large parallel file systems and it's all fine and you're a happy man and that goes on and then once every four years you get that one problem with your file system where your engineer doesn't know how to fix it We are now using CERF in file system storage, object storage It depends how you will store your data the results in CERF files We think CERF is great and I think it's just one of our key goals is to because we work with Red Hat to make Red Hat It will come, as in a cost model Maybe introduce the CERF as the infrastructure the offer in the license making it more cost efficient so everyone can enjoy it in that supported way but yes, we've got skills in house for CERF and we've been using CERF on the PLC platform so we think it's a great system as well CERF The network fabric that you talked about Is that for research only or is it for everything? Can we join all the ducts that go all through the Cambridge metropolitan area so it links up every building I guess my question is do you compete with the IP phones? The IP phones have their own fibres so we can have dedicated dark fibre do we just blow another fibre for the tube so we don't I may explain the network in Cambridge so our team doesn't actually run the Cambridge network so the network that connects all the departments and actually we look after the fibres it's a different team and we've got GBM a network with dark fibres and we have CIDI which is the L3 network and that has the phones and other things so that network is used mostly for access like SSH and websites but if you really want to get a proper high performance access to your HPC then we would enable this via the dark fibres so we get that flexibility almost to every department every college we can run a dark fibre and connect that directly to our systems in whether L3 or L2 when we take back stuff up to take the vice chancellor can still talk on the fact otherwise we'd have a problem okay great thank you very much