 Good morning, everyone. Good morning, everyone. Sorry for the little delay there. My name's Tom Firefields. I'm the Cloud Architect at Nectar. Midway through this presentation, I'll be joined by my colleague Sam Morrison, who's the technical team leader of the University of Melbourne for the Nectar Cloud Node. And today, we're going to go through a bit of context about what Nectar is. And hopefully, as the title suggests, deal with some of the interesting things that we've encountered during running Cloud for the research community over the past 13 months. So just to set the context, I think it's appropriate to highlight some of the strategic investments that the Australian government has made in the e-research sector. E-research, for those of you who aren't familiar, is basically research plus technology. And that's a good thing. This diagram is really complicated. But in general, you can see a whole bunch of projects, timelines in green at the top, and funding amounts in the bottom. And you can see that we are talking about a wide range of investments over a 10-year period, starting from a, I probably don't need to go through all of this today, but just to point out some of the other ones to set the context. Australia has an academic and research network that today provides 10 gigabit connectivity to universities and research institutes around the country with tens of gigabits internationally. And that's getting an upgrade in, for example, the $37 million research network program there. So that within the country by the end of the year we'll be doing about 80 by 100 gigabit between sites and increasing that international link to capacity of about 100 gigabit. There are other projects to do with peak computing, including funding high performance computing for the life sciences. Climate change, of course, is a very important one. And other projects such as radio astronomy and the square kilometer array. There's always been, since 2008, some funding to do with collaboration research tools. And that is where the Nectar project came from in 2011. There's also a new project designed to provide data storages, data storage. And that's the RDSI project, which has about $50 million of funding that will probably, in addition to our cloud, provide 100 petabytes of storage for research. So what is Nectar exactly? It's actually an acronym. It stands for the National e-research collaboration tools and resources project. And we love our honey style theme. But essentially it's a $47 million initiative out of a bucket of money from the Australian government called SuperScience. And it's designed to enhance research collaboration through the creation of e-research infrastructure. So essentially it's divided into four different programs. You can see that we've got some infrastructure-style programs. The Research Cloud is hopefully what we're here to talk about today. We've also got the National Service Program, which is a bit more like your traditional enterprise hosting environment. We use that for all of the really core services, things like AAA and those kind of things. We'll also touch on the virtual laboratories and e-research tools, which are research software programs. Virtual laboratories are quite sizable investments between $1 million to $2 million. And they're designed to create exemplars of e-research. For example, if you've got data coming off a beam line of a synchotron, you want to store it in some storage here, process it with HPC, move some analysis onto the cloud, and provide everything through a web portal after. That's an example of a virtual laboratory. E-research tools, smaller projects, hundreds of thousands of dollars, up to $1 million, designed to fix capability gaps in specific research areas. Anyway, so that's what Nectar is as a project. Who are we in terms of OpenStack? And we're quite amused to see the University of Melbourne popping up in graphs like this, which was the Folsom Commits for all projects, and again in 12th place for Grizzly. So we must be doing something. And it's a bit of a hack. I mean, it's mostly from documentation commits. We all know that's really easy. But anyway, there's a couple of people at the summit that you might have seen on your review process before. This guy's standing next to me. If you want to talk about how we've used cells in production or basically any part of OpenStack is a fantastic generalist. You'll find him in the authors file for almost everything. We've also got Kiran, who recently, yes, he's in there in the back, who has recently got his Horizon Core certification or whatever. And he's been doing a lot of work basically taking feedback from the real users that we've got and improving the usability of Horizon and extending it. If you've read that book, you probably know more than I do. I'm one of the authors of that. But anyway, the research cloud, finally getting to it. It's basically a platform for innovation. And my boss, Glenn Maloney, famously says it's a platform for failure. And that's really quite an interesting term. But research is an inherently risky business. And essentially what we're trying to do with this cloud is enable researchers to have a very low barrier for access to computational resources. And that means success can happen faster. But it also means the cost of failure isn't happening. And I think if we've got cancer researchers who are logging onto the cloud and just using some extra cores, doing some random stuff that they wouldn't otherwise do, just because they've got this extra resource, that's probably a good thing. Essentially, it's an open-stack cloud. It's split across eight different sites, but has a single API endpoint. It's built to a research spec. And researchers, unlike many of the office workers around the world, don't just work nine to five. They tend to work crazy hours, 24 hours a day, and collaborate internationally across any kind of boundary. They're fantastic at working around any kind of policy. So essentially, we're dealing with any researcher in the country's publicly funded research institutes doing any kind of research. That's not just the hardcore sciences. We've also got lots of people from the humanities, as you'll see later on. And we think the scale of the cloud, by the end of the year, we have to spend all of our money on hardware. And it's going to be about 30,000 cores across those eight sites run by completely different organizations, which Sam will touch on a bit later. So one of the questions we always get asked, just to run through it quickly, why are we doing this ourselves? Why wouldn't we just give lots of money to existing fantastic commercial cloud providers? Part of the reason is to do with the funding and the politics in Australia, obviously. But there's also a couple of other reasons that we think are really significant, which gives us benefits. And just to run through this quickly, what we found is a real honeypot effect. You'll see from our graph of users on the next slide, as soon as we created this cloud, bang, within the first couple of weeks, we had like 300 users. And then the word of mouth started going outwards. And creating a community just like OpenStack has a fantastic community around our cloud is something that we're seeing. And users are helping users, which is great for reducing support costs. Local infrastructure is also more responsive to research needs. And I'm not sure if you've ever tried calling up Amazon saying, hi, I've got one researcher who just wants to do this. Can you add a feature for me? And seeing how they respond, I've done that. It doesn't work that well. And so we can actually take feature requests. And thanks to OpenStack being a very flexible platform, change the middleware. Our cloud is free. That's kind of something that we need to be able to have a great deal of control over the infrastructure at every level to offer a service model like that and deal with rather than a cost-based model where people pay real dollars to access the cloud, have these really fuzzy things called research merit and judge research on this finite resource against other research to determine who gets that 1,000 core allocation. Probably number four is the biggest one, though. So we've got a ton of existing infrastructure, data centers, data storage, scientific instruments around the country. And having a cloud in the same data centers as those or access to that very high-performance network I mentioned before is critical. And then just to round it off, data sovereignty. We all know lots about data sovereignty. We've got particular data, particularly medical data, which, if that even crosses state boundaries, we get in trouble. So we went live in January 2012 and put out this announcement. Since then, thanks to this graph, you can see that we've got almost 2,000 users. That little initial spike you can see there is what I mentioned before. There was just this pent-up demand in the research space for cloud computing, despite the fact that large commercial clouds like Amazon existed. And since then, we've accumulated a great deal of users who are already able to publish more research based on the resources that they've got available here. So one of the best parts about being involved in a project which has a mandate for openness is putting up slides like these. You can see we've got about 7,000 cores in total right now, and both of these sites at the bottom will be doing more procurements. You can see that we've got a whole range of vendors up there. And this, I think, is a fairly solid statement that OpenStack works with a large range of hardware. So thank you to all the vendors who gave us good deals. And please continue to do so in the future. You can see also on the map there, Australia, for those of you who don't know, is roughly the same size in terms of land areas the United States. So to get from over here on the east coast to the west coast, it's about a five and a half hour flight. So for example, going from the Q-SIFT to the IVAC node, you kind of go down the coast in terms of the fiber. So we have some interesting things with latency there. But anyway, on to some of the use cases. We've got a whole range. Sorry, that's cut off. High throughput computing is what that word says down the bottom, ranging from high throughput computing just to people using a one-core web server. And it's amazing what kind of impact having free access to a one-core web server has for someone like an archaeologist who traditionally never really gets considered in terms of providing infrastructure budget. So we've got archaeologists out there. We've got people researching wine who are doing very interesting things with spectroscopy and chromatography. Possibly one of the few users in the world on cloud computing who has a lot to do with rock concerts. So we have a project from the Digital Humanities who are basically taking all of these cultural databases like the gig guide and stuff like that. That's 35 of them, combining them into a portal and making it so that humanity researchers can search across all of these databases, create virtual collections, annotate them, extend them, share them. And that's a fantastic collaboration. We've also got some of the more traditional users. You probably saw this morning in the keynote that the particle physicists are a nice big user in our cloud. We've got radio astronomy, the square kilometer array project. You should look them up. They've got some fantastic requirements as well. Climate science, genomicists, marine science, even people who are looking at disaster management during forest fires, of which we have quite a few in Australia. And you can see there's quite a range there. One project just because someone dared me to put an Australian animal in this presentation, this is a crocodile, it's not an alligator. This is a multi-disciplinary project, which I think is really cool because you've got zoologists working with electrical engineers. And essentially what they're looking at is animals moving around so they can track the decline of their habitat based on the urbanization of particular areas. And that's important because you kind of want to know if the crocodiles are moving into your suburb. So they have a little portal in our cloud where you can actually plot all of these animals around. I think that's pretty cool. So we're live. What problems have our users had? And so we're gonna try and be really honest here. Sorry, the speaker's notes I can't really see so I keep referring to these slides to remember what's on there. But we have had hundreds and hundreds of support requests since this went live because researchers are at all different levels of IT. You've got the people who have really fantastic white beards who have probably never used a computer before through to the brand new software engineer and graduates who are just hacking 24-7. One of the biggest usability things is it's possible in OpenStack very easily and this is a great default security policy but interesting in terms of usability. Really easy to start up an instance and have absolutely no access to that virtual machine and that's probably our most common request. So we've looked at ways to make that more usable. Other usability things, some of our more advanced users had too many security groups to the point where the launch button was pushed off the edge of the page and they couldn't launch instances. We allow anyone to upload any type of virtual machine to our cloud which is pretty gutsy according to some people but that results in enormous list and currently in horizon it's difficult to distinguish that the good golden images that we kind of prepare are like Nectar versus all of the other images that people have made public and that's something that we've been putting a bit of effort into. There are a lot of different storage types in the cloud and we've had a real user education push to try and get them to understand that object storage you cannot mount it and the ephemeral disk is ephemeral and volumes are this and that and the other and so that's one of the most red docs that we've got. We have a lot of people trying to follow our documentation but then they try and use the client tool that's in the repository of their distro and it's too old so we point people at the PyPy repository and things get better. The S3 APIs and the EC2 APIs do not have all of the features or have weird incompatibilities like you can't have a slash in your bucket name so people even if they're familiar with using those APIs come to our cloud and go hang on what's going on here this works, it's supposed to work. Snapshots were very slow to create and I know the code has been improving over the releases but we had, we kind of pushed them a bit early on and so many people started using them and were having bad experiences that that generated a lot of tickets. They also found that the storage performance wasn't good enough, we'll talk a bit more about that later and of course creating an image is really, really difficult so that's one of the most involved support ticket requests that we get is helping people create images and when they're not going well. So we've been live for 12 months, thankfully we haven't had too many security incidents. We've seen a lot about open DNS servers recently with all of that fantastic DDoS, that's probably the most common one and we're very lucky to have all of the university security teams as well as OSIRT very vigilant on that so we tend to find those very, very quickly. We've also had a couple of spam sources, we had one user who decided it was a great idea because we just dropped stuff on port 25 because we have a no mail servers policy in our cloud who started arguing with us about our security policy being too onerous and all of this kind of thing after we found out that his machine was compromised and spamming the world. We also had another spam source which was basically a student that got onto the cloud, started up a machine and kind of forgot about it and then we found one really interesting compromise which was basically a virtual machine that was clicking on ads, just doing that 24 hours a day and that was easy to track down because it was just thousands and thousands of HTTP requests to the point where I think what was it, the contract table filled up and this was pretty interesting. Okay, so in terms of outside on the infrastructure side I gotta go more quickly now, sorry. The API scales out very nicely. OpenStack does its job. The underlying storage, we've had some issues there because we hadn't had experience with this before and made a bad choice in terms of what was provisioned there at the University of Melbourne node. The object storage works really, really well. It's really easy to administer. It's fantastic, the upgrades work, use it but we need to do more to increase uptake. We think it's fantastic but we haven't communicated that well to our users. We've got every single staffing problem under the sun from not enough staff, not enough money for staff, too much money to staff but unable to increase the salaries, the staff that we've got don't have the skills and so you're probably familiar with those problems and Sam might talk a bit about operational tools later on but of course we've talked about today the infrastructure as a service solution that we provided. That's great but it's absolutely useless for the vast majority of our 65,000 research staff in the country who are not able to install their operating system or don't particularly want to be sysadmins and we don't want them to be sysadmins. We want them to be doing research and so we're very, very strongly this year moving towards platforms and software as a service solutions. So NET has funded those virtual labs and research tools which is great as a starting point we don't have nearly enough money to kind of get everything we want to get done. So we've been focusing on developers and getting them cloud ready and having developer days are on the country to ensure that those software developers working with the researchers are able to get stuff up and running and focusing there rather than creating like an app store kind of within the institute or within a domain. So archeologists generally want to talk to archeologists just running through quickly because I'm stealing all of Sam's time here. Yep, and basically we want to do that through recipes, toolkits and scripts. We use a lot of puppet for the infrastructure. We've found that a lot of the application people using Chef, things like that are really great to share around sharing virtual machine images and hopefully we will be talking about research and no longer having to talk about cloud anymore because the infrastructure is just working and I think the sign of a good infrastructure is when you don't even notice it's there. So I'm going to hand over to Sam and I'll sit down. Thanks, Tom. Cheers. So Tom's been outlining what we're about and I'm going to start just talk a little bit about what those challenges were and how we sold them in a technical sense. One of the biggest challenges we have is that we've got eight institutions. They're all separate. They've got all their own policies and networks and everything and we've got to try to bring these together in a single cloud. So technically we do this with Nova Sales. I'm not sure if you know, you might be familiar with sales. It's a kind of a new concept that's coming in. I'll talk about that a little bit later. You know, there's obviously the political challenges and that's where we have Tom for that. So at the University of Melbourne we run the central services. So we have your central keystone dashboard and a glance registry and then we have at a site, we have the Nova Cluster and Nova Sale. They all run their own SWIFT clusters separately at the moment. Hoping to get some region into region support there with SWIFT and just some glance kind of caches really for our, so the images are nice and close. One of the big things that Tom has also mentioned is with how we're different is really, you know, researchers don't pay for us. So that's really cool because we can just get people going on to a club without us even knowing and doing stuff and it's doing research and that's really what we want. Some of the things we've been developing on OpenStack, on top of OpenStack, we try and run as close to stable as possible. Shibboleth Logan, so we have a federated SAML-based, identity management with all in Australia called the AF. So all our users have accounts on there so for us, we just put that in front of our dashboard. Users can use their own university credentials, they log in and they're in a narrow way. It's really good, it's really easy, we don't have to manage anything. We've been playing a bit with kind of geo-distributing glance and I think there's actually a talk coming up soon which is gonna talk about that more but trying to deal with how that works. At the moment we've got kind of glance APIs around and they kind of cache images and then essential glance registry because we don't want people to have to copy, make two copies of images and stuff like that. And yeah, and of course sales which I'm gonna talk about soon. We use Puppet, we're not really different in terms of using Puppet, now lots of people use Puppet. Unfortunately when we started our cloud, there was no real good Puppet modules out there so that's kind of why we are still rolling our own. I'd love to move over to the great ones to develop our Puppet Labs but one day maybe we'll have the time. We have a central Puppet server and we try and get all our nodes to use the same one and use environments so we kind of have a good common code base for all our deployments but we try and keep it lots of fixable because there's lots of different environments out there, different hardware, different network topologies so they kind of get a bit complicated. And then we try and copy some of the stuff that OpenSec do with their QA process which helps us develop. Yeah, so sales. So sales for those people who don't know they're kind of like a way to split up a single nova installation into kind of multiple ones and you can do this for scaling it's originally developed by Rackspace and they use it to scale out because they've got lots of compute nodes. They have lots of compute nodes talking to one rabbit and one MySQL can get problems there so you can logically separate that out and we kind of do this logically separating it out but on a site level so we're kind of, I guess it's geo-distributing our nova install and just a little graphic of what we do on the dashboard so we can have our users choose which cell to launch on. This is also important because they might have some research data or something close to that cell so they need to launch on a specific one or else we just leave it up to the scheduler and they can launch and it will just launch somewhere around the country. So how does this work technically? Each cell has an nova cell but then we can also have multiple cells at a site so the way cells work is a hierarchical thing and it's kind of like a tree structure. So for instance, at the University of Melbourne we have two data centers so we have kind of a cell at the top which is the kind of nectar cell and that does all the scheduling on a country-wide level that comes down to a Melbourne cell and then that does the scheduling within Melbourne and then we drop down even further to each data center so it really allows each site to have their own policies in terms of scheduling they can choose how they scheduling nodes and they can also have different hardware so if we've got different flavors like GPU flavors we can, some sites might prefer that so it gives a lot of control back to the sites I guess for us and that's something that's really important for us. So we grabbed the code for cells from Chris and not the early development but Rackspace were running live with the code but it wasn't yet in OpenStack so we've taken our version of Stable Folsom which is pretty much been a lot of Stable Folsom and we've got Chris's code, chucked it on top of that and then we've worked on improving it. Now that cells is in grizzly we're gonna start pushing these things back we love that we're gonna have some time to get some blueprints up some of the things, if you were at the talk yesterday from Chris, we mentioned a few things that were not implemented and we've kind of done some of those things security groups is a big thing we don't want users to have to create security groups in each cell and things like that so we've got security groups in common being able to select and a lot of the things between cells and scheduling and filtering and that kind of stuff we've done a bit of work on there when we have a new site coming online we don't want them to all of a sudden get flooded with instances of Chris because they've got the most capacity so we kind of try and ease them on to our production site our goal is really to get all this back and we don't wanna be managing all this ourselves so hopefully we can do that in their Havana cycle. It's a quick little slide on high availability as I mentioned we have all our central sites are all in one place so we really need to have those highly available most of the pain point isn't actually with OpenStack itself it's with things like MySQL and Rabbit and things like that I mean OpenStack itself is quite nice in terms of its architecture for scaling for most of the stuff so we do all our packaging in-house it's a bit slow the way OpenStack is moving it's quite a rapid code base that's moving and we have to kind of adapt to that pretty fast and waiting for a patch going from trunk down to the stable branch and then going into a Ubuntu package we're all on Ubuntu here it can take some time and even if it gets that far so we do a lot of time spent on finding those patches backporting them getting them into our own package and rolling up ourselves and also with all our sales code we need to kind of make our own packages and we've got a few mods to Horizon and Keystone to do some stuff with the shibboleths and et cetera we've tried to emulate a lot of what OpenStack has done in the QA space we have our own Gerrit and Jenkins and the sales has pushed things to production faster and I really like what OpenStack do in terms of their quality assurance and just their rolling out of code so we try and emulate that as much as we can you know I love just pushing out things to production as quick as we can and this helps us a lot Upgrades, upgrades are probably the hardest thing you can do with an OpenStack cloud I think we started in, we actually started, we had a pilot cloud in Cactus but we started in production in Diablo and we've been live and up since then Diablo to Essex was a lot of pain I don't know if you guys have experienced that but it was a big pain trying to keep everything running while upgrading going from Essex to Folsom was a little less it was getting better and I'm hoping when we start to look grizzly now that's going to be even less one of the main things we're really plan for when we're doing these kind of things is keeping the instances running some people might, in a commercial sense I guess it's the APIs that's the main thing but for us it's really those instances because that's the researchers, scientists you know, they need that so, yeah a lot of database hacking a lot of drive runs, we have a good test environment it takes a lot of time to just do some planning on how to upgrade OpenStack it's a lot of fun, actually it's one of the funnest times that I enjoy grizzly some of the things we're looking forward to in grizzly is some of the more operational tools one of the things I think is lacking in OpenStack is really things to make it easy to operate things just like querying for, you know, users or who's got this role just simple things like that are hard and hopefully grizzly can make that easier there's also a nice patch we've got in with with images in the dashboard where we can have images listed in a certain category so we can say, look, these are gold-starred, nectar-approved images and you can show them there or you can say, look, these are ones that you've created and these ones are the Wild West you can use it at your own risk and that's going to hopefully help our support teams because we have a lot of support requests about images I'm pretty disappointed that Maldihost is not in Quantum I really want, we use Nova Network still and to move to Quantum we need to feature parity with Nova Network and still battling with Keystone DV migration still not working if you're upgrading it doesn't work yet, still one thing we're looking at in the future is Selomoto I think it's something that's going to be really cool we really have no idea about some of the stuff that's going on in our cloud and trying to plan for expansion or, you know, upgrades and where we're trying to, you know we've got a lot of storages at the moment, you know actually trying to nail those down how much data are we pushing out of our cloud and this kind of stuff so I'm hoping Selomoto is going to just help us there and do it for us and it's also, we need this for when we're coming running out of money soon we need pretty graphs and things to show the government to say, you know, we are useful people are using us and this is how much, you know this is how much we're using it future, we've actually got only two sites in production currently Melbourne University and Monash University we're going to be putting on one of the Queensland states in two weeks time and then five more to come so it's an exciting time trying to, you know, grow that Selomoto we put the first one in two weeks ago and there was a few teething issues but I think, you know, in terms of what the sales is doing it's really good and it's working how we need it to work we're also, at the moment we don't actually provide any volume so volumes are something that we're going to be looking into in the next kind of round you know, we we have lots of people that need this for EC2 compatibility you know, they're using EC2 and their specific code uses volumes and you know, they want to use our cloud and hopefully that's going to help them and yeah I think that's all I've got actually yeah, I thought you wanted to slide sweet I think we're almost running out of time anyway so question time in the service catalogue in Keystone ah, okay, no, so in Keystone it's just one endpoint okay, so there's one there's one only one endpoint, really that makes sense, so there's only one over API and how users can view how they can figure out where the other sales are we've kind of mapped sales on to availability zones for us so in EC2 if you do list availability zones you'll get a list of sales and you know, in the dashboard you can see the list there is that a question? yeah how are you? the resource allocation process that we've got right now is an interim process effectively because we're trying to also hook into these other national processes but effectively because we're bringing up capacity so quickly these days we tend to be very liberal and at the moment through the horizon dashboard we've actually added a panel there where once you're logged in you can request an allocation and you can trace how many cores you need all of this kind of stuff and that goes to a committee who assesses it based on its merits the appropriateness in terms of the amount of allocation that they've got and then eventually it becomes a quota on the cloud okay, yeah, so we have we have the API I mean it's really just we just scale it out you know, I mean it's pretty easy to scale out than over API service so we have 9 9 API servers I think it is and yeah, it seems it's working well for now Keystone with the central Keystone like you've got all the things talking to Keystone as well that's something that I think we're going to have to start looking at a bit more we're getting to a bit of a peak there but I think there's a lot happening there in terms of the PKI for Keystone so things that need to keep talking to Keystone that can validate the token on their own thing and also with Memcache and all the hosts that also keep that down so there's lots of techniques you can do there for limiting your API requests yep, so that's the RDSI project unfortunately for the most part you'll have to go and talk to them they've been having a lot of closed door meetings that we're not invited to despite the fact we're supposed to be working together but in general it'll be split up into a few different categories of funding and there'll be some common protocols between the different sites like us actually RDSI was designed to be at all of our sites so we should have some silo of hard disks at every site one of the things that we think is going to happen is that some of that money from RDSI will be used to prop up our volume service and we think that's a nice way to provide some persistent storage to researchers yeah at the moment at the University of Melbourne we have an NFS NFS at Melbourne Local Storage and they're using block life migration I'm not exactly sure how they're doing it to be honest but if you've got NFS it works fine but there are other problems with NFS because we've got all these different sites which can have different configurations we've also got people looking at other things as well whatever works so the object storage service hasn't been so popular and part of that's our fault because we haven't really marketed it very well we've been so focused on the virtual machine offering so what we're doing now is actually looking for services to put on top of object storage and make it sexy and get people in there so there's a range of things there and saw the savanna announcement recently which is Hadoop backing onto Swift and that's for an example of one of the things that we've been looking at but suggestions are welcome if you've got applications that work in object storage we'd love to have to zero VM, excellent so at the moment we run with no contention it's a one-to-one mapping between virtual core and physical core and that's working for us right now because we've got capacity down the track, yes, we might look at corraling the little one core web servers into their own little bit and over-subscribing but in general for the science workloads we're always going to be one-to-one they just chew the CPU no more questions, looks like it oh, one more haven't got the yet still looking for info any, one more KVM, KVM, Ubuntu, 204 there's no more questions, in theory I think we can go to lunch