 everybody welcome to the vending ironic for big iron session I'm going to start the session my name is Kate Cahi I lead a project called chameleon which is an experimental test but for computer science I'm from University of Chicago and Argon National Lab Argon National Lab is the the place where you find the fifth fastest supercomputer in the world and that's our all super computer we're building a new one so definitely a big iron type of place my associates Cody Hammock here and Jonathan Pasteur work on the chameleon project Pierre Rito who's our DevOps lead unfortunately could not be here today he's getting married instead for some reason he thought that was preferable anyhow chameleon is like I said an experimental testbed for computer science it's let out of University of Chicago TAC is our partner I understand you heard a lot about TAC we've got three more partners from UTSA Northwestern and OSU it's an NSF funded project that started in the fall 2014 we built the testbed and took it public at the end of July last year since then we've had 700 users 180 very exciting cloud research projects now what we were trying to do with chameleon is two primary things one build a large-scale testbed large-scale testbed for high-performance computing research and big data research so we've got 650 nodes five petabytes of storage to support big data experiments over two sites TAC and University of Chicago connected with a 100g network and reconfigurable deeply reconfigurable because we we support computer science experiments ranging from virtualization containers exoskeleton operating systems and so forth so hence the name chameleon right supposed to adapt itself to your needs a quick word about chameleon hardware we primarily have racks of which we called standard cloud unit of 42 compute nodes which are Intel Haswell processors each rack also additionally has four nodes which are also Intel Haswells but each storage node has 16 2 terabyte disks right so per rack you've got 128 terabyte disk space with very fast IO bandwidth to that in addition each storage rack has an SSD so that you can experiment with storage hierarchies things of that kind in addition to that we've got 3.6 petabytes global storage because users told us that some users running experiments on big data it takes a day to upload the data to the to the testbed so we wanted to have a permanent home for that big data that they're experimenting with so that they don't don't have to do that in addition we're bringing in heterogeneous hardware so all of that is homogenous we've got 10 of those racks attack to create large homogenous partition which is necessary for high-performance computing experiments and in addition we're adding now GPUs we're adding more SSDs and eventually are going to have atom microservers and arm microservers so these are this is just hardware just more about that really what I already said and how is this testbed now configured to support computer science experiments where you go from resource discovery through provisioning configuration monitoring cycle I'll tell more about that and more about how we use OpenStack in a in a talk today in the afternoon but today we'll focus on what we do with configuration how we use ironic to configure bare metal resources so as I said we support all sorts of experiments that use that require very deep reconfiguration those are experiments in virtualization containers accessible operating systems or all of the above at the same time and those users need bare metal reconfiguration they need console access in some cases they need access to the BIOS and so forth so we started out with when we built Chameleon we started out with Juno Ironic as dictated by the date when the project started we started using the pixie driver with deployed ironic and and using the partition images this was very painful very painful for us to install but also extremely painful for our users because as you can imagine with the kind of research they are doing they very often need to boot from custom kernel and so what they had to do in order to do that is recompile the kernel upload it to glance deploy it by ironic from glance and then rebuild it again do the same thing if they wanted to reboot with different kernel parameters it was even more entertaining because they would have to compile those parameters hardwire them into the kernel and then do the same thing upload to glance deploy from glance right so this was taking a very long time they were not amused and we were looking for better ways of doing that couple months ago we upgraded to liberty and started looking at the tools in liberty which you know has the agent IPMI tool driver and the IPA image unfortunately there was this extra dependency on Swift and that was also not supporting partition images which all of our users had right so if you have 700 users each user has several images and you now tell everybody switched to a completely different image format that does not make you very popular so eventually we figured out that we could use the pixie driver and IPA image which supported both partition images and whole disk images so now our operating systems users moved to the whole disk images and loved us for that and everybody else and people doing research on security and resource management and all sorts of interesting other projects are still using partition images so that worked work reasonable well for us things that we would like from ironic for our use case so the first and foremost a very important thing for us is the ability to attach center volumes to bare metal images because those users I mentioned earlier that experiment on big data if they want to use that in their experiment now you know you don't want to compile it into the image certainly you don't want to be uploading it from an external storage system either right so they can if they can connect to a center volume have all the data there and also store the data that they produce that would be fantastic booting bare metal instances on certain center volumes would be great it would allow us to deploy them faster it would also that I was not to snapshot them but you know since essentially what would mean that if something happens if the users control goes away the image is already saved so that would be fantastic network isolation using neutron would be very important for us because in addition to operating systems user we've got users working on network functions and for example we've got users who want to run experiments on name-based networking and and different types of networking in general so we want them to be separated and isolated from each other so that your you know standard IPv4 doesn't get mixed up with name-based networking right things that are maybe maybe a little bit lower on the priority list console access via horizon web interface Codi here already has a solution for console access that some of our users are using our solution before that was them to email us and and we would send them the output which did not scale but it would be fantastic to have it via horizon so that it is like KVM and and really a lot of our wish list can be described as like KVM now snapshotting we've got something that we develop ourselves that works on command line it would be fantastic to get that from open stack and again get that via horizon and and make it easier for users and that last on the list is changing BIOS parameters we do have users who need that but with 700 users and I don't know nine months of operation at this point we've had on the two cases when they really need to do that and in those cases we just did it for them and and we were fine right but but still an important thing if you're doing a deeply reconfigurable computer science research if you'd like to hear more about chameleon and the ways we're using open stack and chameleon which ironic is probably one of the less innovative ways we're doing interesting things with advanced reservations and so forth come to our talk today in the afternoon MR 18 and or visit our website always a good thing and think about working on your next research project on chameleon as an infrastructure supporting research experiments right we're open to everybody in academia everybody in the labs and everybody in industry who works with academia in the labs so think about that all right next person one second sorry about that we need a little tech support here they were supposed to be a thank you so I'm Nathan Grodowitz from Oak Ridge National Labs we're here to talk a little bit about our work with ironic and Docker so first off I am the head of clusters and parallel file systems for a project called Cades it's the computing data environment for science it consists of HPC resources open stack cloud resources specialized analytics resources and shared storage we have many goals probably our main one in one of our main ones in my opinion is that we have support system technical knowledge basically everything in place to create very complex workflows for our users that improve their time so this is the the full view of Cades this is exactly what it's all about basically an end to end solution for our users to get work done there's some of our systems as you can see we have a very heterogeneous environment with lots of hardware variation even down into like specialized analytics machines so currently there are several HPC and cloud challenges that we have as a user we have VMs that are allocated for scientific you resources that end up staying up due to how our security has to work and so they don't really make much sense to have them running quite often there's an idle time for them these workflows also require complex environments that are really often and unsuited for general use systems like our HPC assets now as an admin we would really like to have HPC and cloud resources controlled well currently we have HPC and cloud resources controlled separately we are working to bring that down to a single pane hopefully in the future release HPC resources are slow to deploy an update despite their diskless nature we have to go through manual config files set everything up reboot the nodes VIP MI it it is a multi step process that it should not be we also maintain large amount of computation VMs and it's a high chat and that's a challenge because each one in becomes a pet rather than the cattle that we really want them to be so ironic allows rapid deployment of our diskless systems allows us to roll back to previous versions very easily and then we can just make changes through open stack rather than making changes to all these config files mentioned before also we can store centrally store all these images which is a great way to provide these to our customers and again we want to put our VM and HPC resources under a single pane of glass so we have some plans to take advantage of Docker so for deployment we want to rapidly deploy new workflow pieces on our existing hardware basically we're going to take our existing HPC nodes put Docker on them and schedule Docker containers on those nodes rather than actually scheduling traditional batch jobs this will allow us to avoid having to create special snowflake environments where they're doing like large processing they can just they can make their own Docker containers and then run them on our HPC assets so for administration we would really like to schedule Docker containers via orchestration rather than scheduling them via current VMs being kept spun up also we want to really make use of our current HPC resources that have that sometimes the queues are not filled up due to the fact that these queues are owned by specific users when they're not queued up we would like just dump Docker containers out there and run them on the back end as well as we've been looking into possibly using Docker containers to do checkpointing so that we can have an admin level checkpoint rather than a user base checkpoint that traditional HPC pulls in and so for that bring it on to Blake Blake Caldwell thank you Nathan so Nathan mentioned a couple things about that are most things are specific to our environment so the we want to take advantage of diskless booting so that's one aspect that we want to contribute to ironic and so next I want to pictorially describe like our motivations why diskless booting is useful and then actually how we how we went about implementing it so start off the you know the normal ironic boot process so you might have an Opusac administration or an admin cluster perhaps an ironic conductor that's sending that the other nodes are pixie booting from sending it the init RAM disk the VM linus file out to the different nodes for the second stage that's where typically ironic will send make a nice connection and over ice guys able transfer the image this QQo to image for example maybe five gigabytes in size so here you need to send it out to each one of those different nodes so an alternative would be to use NFS route where you have an NFS server hosting a so-called golden image that all your containers boot from and so at that second stage after the node has pixie booted all the information about how to access that NFS server and mount its root volume is passed to the node and the node can then transfer transfer you know to booting from the NFS route partition so from this point you can see that you know each of those nodes still needs to pull in image information but it's not five gigabytes times and it's a much smaller amount it's just the files that are needed for boot so that's one aspect the other aspect is we want to manage these images in our traditional fashion which has worked well and so this this golden image here is just an on an untarred file system really that we can go and make modifications to reboot the nodes it's a very convenient way for image image management change on the fly so there's a problem though what if we want actually deploy this an HPC cluster so here's a common HPC example this is this is three racks that could have up to 240 nodes in them each one will run their own individual Linux so those little tiny penguins there those so all those have to get you know from the central point and if you're trying to send them over ice guzzy well you have 240 times 5 gigabytes to risk in your network all at once so you can design your network around it but we think there are better options that scale better and NFS route is one example that we think is a good starting point so how does this accomplish so there are three steps here so as mentioned the images rely on an external NFS server this is not integrated with ironic in any fashion right now this is an existing infrastructure we set up it's not complicated to NFS server with different directories for each different image and that image metadata so the second part is that metadata for where to access the image it needs to pass the node somehow so one way of doing that is so here's a patch that was that's in review it's you know it's based on the Liberty code checkout where the image the metadata for where to contact the NFS server is stored in glance so the advantage of that is you don't have to have one global configuration for all your nodes you have one ironic conductor which boots different nodes from different images just by well the boots different image for each node just by specifying the pixel parameters within the glance image so that's something that's a work is in progress to move that to mataka and so forth and upstream the third piece is a no op deploy driver and so this is something I'd like to hear about craze implementation here but in a little bit but they face a similar problem we went about it the wound about distributing the file system in a different way we wanted to distribute the the actual file system interface for you know a positive interface rather than a block device so what I just want to say here is that we can we can work together to coalesce into single driver our approaches are very much compatible so at this point we're talking about booting a node so this is useful this is no you could ssh 10 to but this is a standardized image this is really like an useful this useful for admins so now the second stage how do we make it useful to the scientists you actually want to get worked on this is where we can bring docker in in this scenario we have a lesser file system up top we have the same 240 nodes at the bottom and now we want to take this image so this is the user environment that a scientist might have created to actually you know run his application on tested elsewhere but now to be able to actually run on these nodes we can store that on a lesser file system so parallel distributed file system where each of these different node just like NFS route are just going to pull in the files it needs or the parts of the block device and so this is how this is the same I'm going to talk about in just a second but this is what we want to accomplish so being able to use docker to catch encapsulate the image containing the actual application and distribute that to the nodes so how this is actually accomplished on so up top as a host the bomb's luster file system so just the example of one image decomposing the three different layers each of those layers is a file on the file system there's a luster mount from the file system of course to the host but a host this is so this is a driver it's a plug-in to docker so it's a graph driver plug-in API that was introduced with the experimental release so we're hoping that will become part of the the next docker release but what this allows all let's do is drop to mount those different files as a loop back device on the host so in this example we have three three mount points these are three ext4 mounts that reside in luster and then we can use over life as to merge those in a single union mount providing a unified perspective for the caner container to actually runs file system from so down the bottom is a work in progress implementation for that I'm very much like to invite other people who are interested in this to take a look at that contribute you know let us know what you know different use cases that would help out with so with that thank you very much and like to pass it on okay so hi my name is Doug Shomsky I'm based in the Bristol office of Cray over in the UK and I've got a few slides on how we use ironic at Cray so first of all why are we using OpenStack? OpenStack is the foundation of Cray's next generation system management software so this means that we need to support two key use cases and the first is rooting large numbers of nodes disclessly and the second is flexibly provisioning disco nodes so the first part of my talk will be focused on Cinder integration for ironic similar to how Blake's already mentioned and the second part will be about the Barry on agent for ironic for diskful provisioning so just to give you an idea of the scale of some of the machines that we build a typical compute blade contains four nodes and in a cabinet you can have 48 blades and the system scale to hundreds of cabinets so you can end up with tens of thousands of nodes our Cinder integration work it's based on an upstream spec developed by Satoru, Maria and others and this is still in review and rather than making any changes to OpenStack APIs or adding database tables some of the things I mentioned the spec we've configured our driver using the ironic instance info fields so this is a fairly simple driver and it supports booting disclessly from Cinder via iSCSI so there's no support for fiber channel and we use an in-band connection method to attach to the iSCSI target we can attach additional volumes at boot time but we don't support doing that dynamically and we focused our work around drake up based RAM disks although any RAM disk should be compatible and although we haven't merged our driver we've made it available on this link if you're interested so I won't go into this slide in detail but these will be available afterwards and so I'll just talk about what happens here in a general scheme so you start a discless boot by calling nova boot with the boot volume parameter and this boot volume parameter contains an idea of a Cinder volume Nova compute then calls Cinder to set up the block device and then the ironic vert driver which is part of Nova compute patches that information down to ironic through the ironic API so ironic then has all the necessary information to connect to the Cinder block device ironic the ironic conductor then boots the target node the node loads the RAM disk and the kernel from TFTP server and eventually in the RAM disk the node mounts the iSCSI target and pivots into the root file system so the second part of my talk is on diskful provisioning and at Cray we've used the Baryon agent which was part of the fuel project and this project Baryon it's mostly been developed by Mirantis although there are other contributors and we are one of those and it's really about operating system installation so in particular we use the Baryon agent which is part of the Baryon project and this is quite similar to the ironic Python agent so some of the reasons why we've used Baryon we've actually extended it to support some of these use cases we quite often have kind of non-cloud like deployments so for example booting multiple images locally on a node supporting complex partitioning scheme so sharing partitions between multiple images LVM groups targeting specific block devices by serial number for example and we also we've made use of r-sync so we can mount all these various partitions and then just use r-sync to copy the files across them seamlessly so you don't need to have multiple images for multiple partitions and there's also support for running arbitrary actions after the disk has been provisioned so again I won't go into details here but the process is quite similar to the diskless boot except this time the RAM disk contains the Baryon agent so once the target node has been booted by the ironic conductor the Baryon agent calls back to ironic and the ironic driver at this stage it drops into the vendor interface and it's SFTPs across a provision script now this driver is really is driven by some JSON scripts so you can see on the NOVA boot command I've got out there in the top left we're passing through a deployment config which contains the partitioning scheme for the node and any driver actions which you wish to complete afterwards so the ironic driver copies across a script which tells the Baryon agent what it should do to the node and then SSH is into the node and initiates the provisioning so after the disks have been provisioned the Baryon agent can do the post-apply actions which could be for example copying over an SSH key so in terms of scaling ironic I thought I'd put a quick slide on this we haven't actually got very far here we've looked at booting 128 node system with our diskless boot driver to do that we used a single cinder volume using multi-attached and an overlay file system we've looked at running multiple instances of ironic so ironic multi-conductor and our focus point at the moment is to deploy open stack using color so we can support a highly available and a scalable open stack installation and eventually we hope to scale up to this kind of order of magnitude of nodes 100,000 by sometime around 2018 so I'll now hand over to Tyler who'll talk about testing. So I'm Tyler Lastovich and I'm here with Cray as well and we just thought we'd give you guys a little bit of a background into how we came up with a test system to actually test some of this work especially with the bare metal. Alright so what did we do? We created a test infrastructure to validate bare metal deployments and installations of open stack and we were trying to do all of this in continuous integration we really needed a way to validate the individual pieces that were going into the additions that we were making for both Cray products and open stack so we really needed to get something seamless together and there wasn't really much especially when we started this project to test bare metal and so that was a big driver for it. So we used KVM to back this test infrastructure and so what we did is we used a management node essentially that's a VM to run all of our open stack services and we could attach that to either real or virtual essentially bare metal nodes that were configured using the ironic SSH driver. We installed all of our test suites using Ansible and so those could be used or not used on the fly and then we automated the entire process so it could be run with one click. So the most important and interesting piece of this is our virtualization of the management node. We used a virtual management network that we created and tore down using Vagrant and those could be either attached to other slave VMs or actually to real bare metal nodes that we have on site. So for the virtual VMs we have a small number that we typically create but you can use as many as you want. Tempest you have to typically create a good number if you want to run parallel jobs and so it allowed us to not keep real resources tied up for a long time we could just spin up the slave VMs and these are all created just by the Vagrant up command as provisioning steps. We also used physical hardware both diskless and diskful nodes. We wanted to test the provisioning across all three styles. So this is probably a little hard to see in these screens but this is just a diagram of the test process that we use. Really the important parts here are that it's you create the environment that has both virtual and real nodes you install and configure your base OS any OpenStack projects and we did all of that using Ansible. You can run all of your tests which we were doing bare metal deployment tests, tempest testing, rally testing and heat stack creation testing. When that's finished we packaged up all the results and then set notifications out to IRC email did kind of the system dumps so all of that could be retrievable later and we did all of this. We automated it through Jenkins. On the bottom diagram there there's the two little snapshot icons and that was important where in the middle while we were actually testing this we took snapshots of these VMs so we could roll back to pre-configuration on the fly so we wouldn't have to rebuild these VMs and reinstall it all the time. So probably the most important to you guys is what would we do differently if we started over. So first of all we would skip vagrant. We found that it had quite a few concurrency issues and we had to kind of hack in a few patches on the fly during every test run. It can be pretty fragile and hard to diagnose at times especially when you're running on machines that are fairly heavily loaded and we actually had to pin a version of it just to actually get it to work with how we needed to do. So the next part would be kind of a perfect world situation where we want to keep all of our package installation fully modular that really would be a lot nicer for us where you could kind of look ahead to the future of what the community is doing and pull it in piece by piece. Kind of allow for more independent merge testing. I think that would be very good. And also we think that we could do this on OpenStack. So a project like Quintuplo would be really interesting to back instead of vagrant where you could do all this bare-metal virtualization actually in an OpenStack environment. That's it. Hi guys. Well done. That was all very interesting. I had one question for you which I was kind of interested if any of you'd like to answer and that is what do you do really about the users that you can't always trust? So if you're providing a cloud environment normally you can just you know they're well encapsulated within the bounds of the KVM environment. But if they have root privilege on bare-metal hardware they can write the firmware. They can do all kinds of things. What do you do to counteract that? Right. So we did vulnerability analysis earlier on in the project and you know one thing that we found is that actually a lot of the attacks were not that different from the kind of attacks that you're going to see when you were running virtual machines. And as far as that goes both TAC and University of Chicago the machines are on isolated networks. You know all of those issues are worked out. The one issue that is not worked out is the one that I referred to in my talk which is the isolated networks. So right now if somebody runs on chameleon and for example intentionally or unintentionally runs something like DNS server right. Things can go haywire. This is why we're interested in isolating those networks and making sure that this doesn't work. There's another principle that is at play which is we're trying not to give users access and you know I'm sure that it's porous in some respects but we're trying not to give users access to anything that we cannot restore. Right. So right now we're not giving them access to BIOS for example because we haven't worked out yet what we're going to do by way of restoring that and the firmware and so forth. The one thing that is interesting about us is that I think it's to some extent it's race against time because the actual two security incidents that we did have since chameleon started came from a pool of machines that were running as KVM that were running as virtual machines. And in both cases it happened because somebody downloaded a virtual machine image somewhere from the internet forgot to change the admin password. And that you know that happened to us before running different infrastructures and it's a it's a relatively common occurrence I would say. It doesn't happen with hardware images just because there aren't so many of them. Right. So this is why I'm saying race against time when this form of making resources available becomes more common there will be more attacks and we're working right now we're working on the isolation network in networking and so forth to do something to prevent it. Anybody else wants to jump in and just to make it clear when we're talking about doing the ironic stuff we don't we have no intention of giving users root level privilege on hardware trusted image on your systems or yes when we're deploying our images those will be images that we have created they will be specifically built only in VMs will we allow users to have that sort of power and even then it will be somewhat limited. Thank you. I actually one other question which was that each of you is pushing the boundaries of what ironic and currently do in some way or other. How do you find working with the ironic project members upstream. So I find the community is a very receptive community. They want to hear use cases. They want to hear how ironic is being used. You know they get they get the public I mean the public cloud like that's that's a huge use case for it. So how you can provide bare metals of service. But I'm talking with them. They're excited about the HBC use case and you know they're very receptive to the idea of diskless boot. I think craze had good luck like pushing patches forward. And so I mean thanks to them for you know keeping on with that that initiative but you know we need more use cases and you know if you have use cases that don't fit ours propose those to the ironic developers and you know they're excellent for you know seeing how they can actually make those work. I'm actually going to refer this question to Cody because he's down there in the trenches working with that community. Yeah I guess unfortunately I don't have a whole lot of interaction with the ironic development team mostly concentrating on the implementation side and and seeing what we can squeeze out of the current state of the software. Any other questions.