 All right, thanks very much for coming. We're gonna be talking today about a project called Chameleon. I'm Kate Cahy. I am PI of this project. It's an NSF funded project to build an open cloud testbed. This is Pierre Ritop. Here is the DevOps lead on Chameleon and works with multiple OpenStack teams on the implementation thereof. We're both from University of Chicago and Argonne National Labs. So we work with HPC communities and scientific communities as well. So Chameleon is a testbed. Feel free to hang around here. Chameleon is a testbed for computer science experimentation. It's a project of five different teams, five different institutions. University of Chicago, Texas Advanced Computing Center, which is one of the largest supercomputing centers in the country. Ohio State University, DK Panda is gonna be talking about fast virtualization interconnects later today, he's the co-PI on this project, UTSA and Northwestern University. So I wanted to start off with a little bit of genesis of why building an experimental testbed for computer science. And this has been sort of my personal quest for a good testbed for the types of experiments that we're running. So my team has been working on infrastructure as a service cloud computing for a very long time. We produced the first open source infrastructure as a service implementation called Nimbus. So something very much like OpenStack except it was released 12 years ago. And I see that one of the bullets got missing, but that's okay, I'll try to talk over this that hadn't before. And finding a good testbed for experimental computer science was a bit of a personal quest. So back 12 years ago, when we were building infrastructure as a service systems, we wanted to make those systems available to scientific communities that we work with a lot. And there was just no hardware that we could run on. We also wanted to run tests comparing the overhead of virtualization, overhead of say running MPI programs in virtualization and on bare metal. And we could not do that, we just simply did not have a testbed at all. So after a little while of this, some people at Argonne said, hey, we've got some older computers that you can use, we can give it to you, you're gonna set it up as a cluster or a cloud or whatever you want to call it and you can run experiments on that. And that's the missing bullet here. And that was the case of missing hardware virtualization. So hardware virtualization came out around that time and comparing the performance of virtualization on machines that were not equipped with hardware virtualization did not really make much sense. You could run a much better, more impactful results on something else, right? So now we had hardware but was just too old to produce impactful results. So the next thing that happened, as I said, well, we obviously can't go on like this. Let me write a proposal to the National Science Foundation to get some resources for our group. And I did that and the proposal was awarded but the programs that award hardware for individual investigators give you just very little hardware. So we got something like four or six machines that we still have and we were able to now run experiments on Bermetal but unknown for nodes. So we were writing a lot of papers saying, well, here's the simulation and then we confirmed this experimentally on four nodes but we're very optimistic that we'll scale beyond that. So not great papers, in other words. So we had two small tests but now so some eight years ago there was an opportunity for me to get involved in a project called Future Grid which was building an experimental test bed for computer science. But the problem is that what Future Grid eventually built was a series of virtualized clouds. And when you run in a virtualized cloud like in Amazon, for example, your results are going to be impacted by other people who are sharing the cloud with you and they're going to be impacted by overhead from the virtualization, from the hypervisor. So that was not good either. And so in this process over many years we were finding out that while we can think, we can be very creative and think of fantastic computer science experiments that we could run, in practice the experiments that we're going to be able to run will be limited by our access to resources. We can think about whatever but in practice we can run under the things that we have resources to run. So hence a couple years ago when the opportunity offered to create a real test bed for computer science experiments, I jumped on it. And we designed a test bed strategy around the following five points. So first of all, we wanted the test bed to be very large scale, as large scale as we could afford, right? No more four node experiments. And the test bed as this today has about 600 nodes, about 15,000 cores, five petabytes of storage, and all of that is distributed over two sites, TAC and University of Chicago, that are connected by 100 gear bits per second network. So you can run experiments on large scale. Most of the nodes are concentrated in one homogenous partition, so you can scale your experiments. Secondly, we wanted the test bed to be deeply reconfigurable. So you get access to bare metal nodes and can reconfigure those nodes. You can reboot from a different kernel. You get console access to interact with your experiments better. You can run at a reconfigure things at a very deep level. So no more shared clouds. You can run in an isolated environment and no more clouds or resources that you can only submit a job to. And then thirdly, we wanted this test bed to be connected, in other words, not just provide deep reconfiguration capabilities, but also be more of a one-stop shopping for experimental computer science. So some of our users told us that hardware was yes, something that they needed very much, but they had even deeper needs. Cloud computing is a new, relatively new area. And they said, well, what we really want are, for example, traces that characterize traffic in a typical data center, and we don't have those. There's not something like parallel workload archive for cloud computing. And but we also want to provide appliances or images for users so that you come to the test bed, you don't have to develop everything from scratch. If you want to deploy OpenStack at bare metal, you can do that. Your private own isolated installation of OpenStack. If you want to deploy MPI, you can also do that. You can do that with images that were already preconfigured for you. And then finally, we're beginning now to develop tools that give you better access to the instrumentation of your experiments and better access to reproducibility and repeatability so that you can wrap up your experiment very easily, give those artifacts to somebody else, and they can come back to the test bed and repeat the experiment you were running. And then we wanted the test bed to be sustainable. This is why we're building an OpenStack. And we also wanted it to be open. And the test bed is available to all US researchers or collaborators, right? So if you're not in the US, you collaborate with somebody such as, for example, you talk to them at the OpenStack Summit, you can use the test bed. So here's a quick tour of the Chameleon hardware to give you a little bit more detail of what we have. So the basic building block of Chameleon is something that we call a standard cloud unit. This is essentially a rock. It has 42 compute nodes. Those are Intel Haswell nodes, 24-core, 128 gigabit memory. And then each rock also has four storage nodes. Those are also Intel Haswells, but each storage node also has 16 terabyte drives. So essentially, each storage node has very high bandwidth access to storage. You can allocate those nodes. You can just allocate a few nodes. You can allocate whole rock. You can allocate multiple rocks. You can allocate different nodes across rocks. So for example, you could allocate several storage nodes, get a high bandwidth IO cluster. We've got 10 of those rocks, 10 of those standard cloud units at TACC, and we've got two of those at University of Chicago. So at TACC, we've got this large homogenous partition that I mentioned earlier on. You can run codes and you can see whether they scale. Or you can have a large installation of OpenStack, for example, and experiment with that. So in addition to those basic building blocks, we also have a global store of 3.5 petabytes. And this is global storage that is used for images, for experimental data, for any artifacts that you may want to use. And in addition to that, we also have heterogeneous hardware, because it's only so much interest in running with homogenous hardware. And the heterogeneous units include nodes with memory hierarchies. So those nodes have almost one terabyte of RAM. They have NVME, they have single state drives, they have HDDs, so you can experiment with different memory hierarchies. And we also have nodes with GPUs, three different types of GPUs, 20 GPU nodes, a cluster of 16 Pascal GPUs, if you need that for some reason. And we've got four FPGAs. We've got arms for low power experiments. We've got atoms, we've got low power zons. So we've got a variety, at smaller scales, a variety of different architectures that you can experiment with for different purposes. And here's just a quick overview of all the things that I just mentioned. We've got this large homogenous partition first. Then we have the shared infrastructure, the global storage, where you can save your experimental data. And then some of the homogenous nodes were decorated with heterogeneous elements. So for example, one rock has an infinite band. You can run experiments with infinite band, compare that with Ethernet, and see what the impact is. We've got those different storage hierarchy nodes, and we've got then various different heterogeneous architectures. So this is our large scale test bed that I think anybody in this room would be able to use. So secondly, we needed to provide deep reconfigurability so that our users can reconfigure the test bed at a deep level. So if you think about a typical computer science experiments, you first design the experiment, then you say, well, what hardware can I get? What is the hardware that I can get in practice to run this experiment on? Then you somehow provision that hardware, take temporary ownership of that hardware, reconfigure it, and eventually you run your experiment, you monitor it. And as often as now, go back to the drawing board and design and run another experiment. So first, for resource discovery, what people typically want is a very fine grain description of those resources, sometimes down to the level of serial numbers of individual components so that if somebody exchanges the component, upgrades it, for example, you can find out about it. And this description, of course, at count age, it has to be up to date. You have to know exactly what you're running on, but it's good for that description also to be versioned so that if the components in the testbed change and you go back to your experiment and it turns different results, you can at a glance tell whether something changed or not. Secondly, when you provision the resources, of course, all of us would like to come to the testbed and get the resources that we want on demand. So right when we ask for them, but if you really want to reserve hundreds of nodes, you may have to wait a little bit because at any time when you come to the testbed, that testbed might be fragmented, other people may be running. So that's when you may want to do an advanced reservation. If you think ahead long enough, those people will have finished and all of the testbed will be available and at some point you can make a very large reservation. The advanced reservations are also extremely useful for sharing heterogeneous components as well. So for example, GPUs right now are very heavily, very popular resource on our testbed and we've got typically several different advanced reservations, several deep from users who are interested in using. So advanced reservation on demand access and then of course very important also is isolation. You want to run in isolated environments, not be impacted by what other users are doing. And you can do that on Chameleon, you can have individual nodes, but like I said, you can also get the whole rack when even your network is not going to be, is going to be safe from other users, it's not going to be impacted by them. All right, so you provision those resources, now how do you reconfigure them? Well, you'd like to have access to deep reconfigurability, reconfigure the resources at a very low level. As I said, when you come to the testbed, you don't want to start working from scratch, it's good to have some base images and preferably images that represent the applications that you want to work with. When you modify those images or appliances, you won't be able to save your work, so in other words, you need to snapshot them. And you also want support for what we call complex appliances, so very frequently you deploy something like an MPI cluster or a cluster with a bunch scheduler on it or an open stack installation. Those are images that are combined in various ways, you don't want to have to by hand configure the relationships between the various nodes, preferably that happens out of the box. And finally, it's good to have network isolation, right? You can stand up your own DNS server without conflicting with anybody else's DNS server. And then on the monitoring angle, once you start trying your experiment, you want access to hardware metrics, you want to be able to aggregate them in some ways and archive them finally. So this is how we build the system, this is the diagram of the implementation of the system and I'll start with the component that is common to all the testbeds with user services. So user services allow you to create accounts, allow you to create projects on the testbed. This is the part of the testbed that is the same for any testbed, whether you're serving resources to the main sciences or to experimental computer science. And for that, we're using a system called TAS which keeps track of user accounts is essentially equivalent of keynote, but it provides much richer functionality, more information about the accounts that you create. And we're using resource trackers for users to interact with the support team. And so at this point, users are able to create account, interact with support team, are not able to do anything else. To do anything else, they need to use the discovery services which are the yellow box for which we use grid 5000 implementation, grid 5000 is a project from France, from India, that has developed a very good resource model and many tools that allow us to find out about those resources. And then after that, everything is pretty much open stack. So in other words, we use Blazar Nova and Swift for resource provisioning. We contributed to both Blazar and Nova in order to make those components adaptable to our needs. We use Ironic for configuration management as well as of course, Glance and Heat for orchestration and Neutron for networking and we use Solometer for monitoring. And those components were combined so that we can synchronize the accounts created by our own account system with Keystone and that happens about every three minutes. Account information is pulled to Keystone. We can also synchronize the resource representation that is produced from grid 5000 with Blazar every time that information changes. All right, and so now I'm going to talk, we're going to talk a little bit about the specific implementation that we contributed to open stack to provide the resource provisioning and resource configuration capabilities and Pierre who's been leading this effort is going to tell us about this. Thank you, Kate. So first for the provisioning, so what we give to the user, the abstraction that we give them is a lease and with a lease that gives you access to some resources in the test bed and there are two kinds of reservations that we can provide that way. One is advanced reservation and that facilitates getting a large amount of resources in the future as Kate explained and on demand, which is a special case of advanced reservation, it's really an advanced reservation that starts now and with those reservations, the users get exclusive access to the resources and that gives them isolation. We have a fine grained for deciding which kind of resources you might want so you can ask for different node types. You can say, I want X compute nodes and Y storage nodes and design your experiment that way. To implement that, we use the combination of NOVA obviously and the Blazar project. So Blazar, which used to be called Climate was a project that was started I think in 2013 and it was active for a couple of years and then it became inactive and at the beginning of Chameleon, we saw that it really fulfilled a lot of our needs so we built Chameleon on top of it and since then we've worked with the community to revive it and make it an active open stack project so I'm part of the core reviewing team together with other contributors from NTT and NEC and we're seeing more and more interest in Blazar so there are several sessions this week that we're discussing how Blazar could fulfill different needs in open stack. In terms of what we did inside of Chameleon and we're going to contribute those changes in the near future, we have an extension to the horizon panel that gives you a calendar view so that's what you can see on the right hand side of this slide so you can see what resources are being reserved and for how long and it helps you to plan your experiments. We also extended Blazar to support our policy, our usage policies so for example we have limitations to how long resources can be reserved for so all of this has been integrated in Blazar and we're designing the way this is going to be supported upstream so in Chameleon we have specific policies but we want everyone to be able to contribute their own policies so some kind of pluggable implementation will be designed so that's for the provisioning and then for the configuration and interaction with the resources we really wanted to give access to the resources with a lot of control so the users of Chameleon should be able to reboot on a custom kernel, access the console to debug their recompiled kernel and so on and once they've built an image that they want to reuse for their experiment they need to be able to snapshot it otherwise they will redo their work over and over. It was also critical that we were able to deploy different environments that we call appliances in Chameleon into a single lease because a common pattern in computer science research is to compare different environments and different scenarios and get that into results for paper and you don't necessarily want to have different leases for each one so there need to be an end to one mapping there. We also want to have an appliance catalog and be able to manage those appliances and make them discoverable and shareable with the rest of the community, being able to deploy complex appliances so those are things like virtual clusters or complex software stacks like OpenStack is more complex than just a hello world obviously or things like Hadoop, which require a lot of coordination between different nodes so with controller nodes, compute nodes, storage nodes all working together to build one software stack and finally being able to support network isolation to make sure that the users cannot impact other users during their experiment. So to implement all of this we are relying on OpenStack services so there is Ironic, Neutron, Glance so Ironic obviously for bare metal, Neutron for managing the network, Glance is used to store the appliances and then we have in our Kameleon portal some extensions to make them discoverable. Metadata server for some of the simple configuration of appliances and then for complex configuration we use heat to do the orchestration. For Ironic we added snapshotting, the appliance management and catalog as I said this is managed in our user portal which is separate from OpenStack and the timing villains were implemented as a combination of changes to Neutron and Ironic. We don't support reconfiguring biases yet but that's on our future plans. So users will be able to provide a specific configuration like I don't want hyperthreading to be on and then it will be enacted for them by the system. So I'll say a few words about what we did with Ironic. So Ironic doesn't support snapshotting. If you go to a KVM cloud you have a button called create snapshots in the web interface and you click it and you get a snapshot. This doesn't work with Ironic. So what we did is we wrote a small script that we put inside all our appliances and it uses LibGestFS tools to create a table of the entire root file system, puts this into a Qco2 image, uploads it to Glance and basically this gives you the same result as doing the create snapshots and it supports both whole disk images. Those are images that have an MBR and different number of partitions in it or partition images which are a file system and a kernel and a RAM disk which can both be deployed by Ironic so we are able to support both. For network isolations, so we use VLANs that are dynamically allocated and to do that we configure Neutron to use the open delite plugin and then we have an open delite controller that we extended to be able to change the VLAN port tagging on our del switch and Ironic can trigger those updates after the deployment of the nodes so it's able to put the resources into a VLAN that is dedicated for a specific user. This is something that we did on our installation which is based on OpenStack Liberty and in the meantime, the community of Ironic and Neutron has worked together to develop something that's called multi-tenant bare metal networking and it's been released in Okata so it's possible that we will move to this implementation in the future, we haven't evaluated yet so we don't know if it's completely superseding what we have. There are still a few things that we would like to have in Ironic, I'll mention the three of them, so one of them is support for multiple networks, at the moment you get only one network so you get to share control on the day plane on machines that you deploy and for some experiments that's a problem, like if you want to deploy OpenStack on Chameleon then you have to share this same link for everything and researchers would like to be able to access different interfaces. I know it's being work on, there is a spec that has been approved for Ironic so I'm expecting that it will be there in the future. We are saying that Ironic is not the greatest at supporting failures so the problem with managing bare metal is that you get a kind of failure that you wouldn't get with virtual machines so IPMI sometimes doesn't reply in a timely fashion and Ironic has some full tolerance mechanism that retries but after a few retries it will give up and then the node is marked as being down and the only administrator can put it back on so we would like to have some kind of healing process in Ironic. And finally because our users do a lot of deployment and they might do it repeatedly to test different environments, if we can reduce as much as possible the deployment time that would be great so if there was support for using Kexec in Ironic then we could shave a few minutes off the deployment. A few words about appliances, so the basic Chameleon appliance is a bare metal image Qco 2 and it's compatible between the two sites UC and TAC it includes several different tools we have CC checks which verifies that the hardware is matching what's in the resource description that's managed by the grid 5000 components that Kate talked about so it allows to see when for example a RAM module might not have initialized correctly and you get a bit more RAM that you would expect or when components have been changed by the administrator and just hasn't been reflected then we have the snapshot utility that I talked about we have something to measure power utilization of CPU we have an agent that talks with Cilometer and that exports many different metrics like CPU utilization, RAM utilization, IO, etc. and finally for orchestration we need to have the heat agent running in there You want to? Right, so quickly a few stats about Chameleon and really the most important message to take from this is that we announced public availability about a year and a half ago and today have about 1400 users and 200 different research and education projects A quick vignette of what our users are doing so here's an example it's a project that a student from University of Pittsburgh was doing her name was Yuyuzu and she was comparing the performance of Docker and KVM so she was trying to nail down the trade-offs between containerization and virtualization and what she needed was a test of the supported bare metal reconfiguration she needed to customize the kernel to reboot the kernel with different parameters in order to support KVM well to capture the performance trade-offs she needed console access to debug her setup up-to-date hardware and she needed to run on large partitions so you can see a graph from the poster she's presenting at supercomputing there and it has comparisons on 64 nodes here's another project from Swan Paranou at Argon National Lab he's working on developing exascale operating systems Argon National Lab is of course one of the largest HPC centers in the nation a lot of research on that there and he had pretty much the same requirements as Yuyuzu but in his case he would do what Pierre earlier mentioned is a very common pattern he would reboot the operating system with multiple arguments and so with multiple kernel parameters over and over again and try different approaches with that so that's again him presenting a demo and some graphs from his work a few words on traces? Yeah so Kate mentioned at the beginning we want to have an archive with traces from cloud workloads so this is an activity that we do in the context of the OpenStack scientific working group so far we've reviewed what traces already exist for HPC and grid computing to see how they are used in publications today and we're working on defining a trace format similar to those traces but that would work for clouds and in particular for OpenStack so the idea is that we will be able to export information from the NOVA database and put it in traces that can then be replayed possibly also combining with metrics from telemetry in OpenStack so that you can also know not just what instances were deployed but what kind of workload was running inside those instances and then to be able to replay we're looking at things like rally or there is something from the OpenStack Innovation Center to generate workloads and we would like if you want to contribute your traces And so traces are one of the ways in which we're trying to build a connected instrument an instrument that will provide more of a bridge to research but another thing that we're trying to do is at this point we have the core capabilities of Chameleon so we're providing for the basic need of I need to run my experiment somewhere but if you think about a fuller-fledged scientific instrument it allows you to observe and measure various relevant phenomena so what does that mean? So it turns out that everything you do on the test but it's recorded somewhere in the logs when you provision resources it's recorded when you deploy an appliance, an image it's recorded what you're deploying your monitoring information gets recorded and it's currently captured in the test but it's not provided in a very intuitive fashion so what we're trying to do is take all that information from the test but consolidate it and filter it to the needs of a specific user and give them an experiment summary that they can integrate with something like Jupyter Notebook or Grafana and for example right here let me see if I can start it from here so right here we've got a quick screencast that shows you the time series of various information produced by your experiment and this experiment happens to be on power measurements and somebody is stressing the system, measuring the power but the measurement information that we get from the system as well as about CPU utilization, memory utilization and so forth so you can easily just simply pull up those graphs without having to instrument anything and get that information from the system and you get a record of everything you've done during this experiment, you can record it like a screencast by pushing a button is this is where I started experimenting this is what I ran next and so forth and you can pull up and you can get the information of what the load was on the system at any time so I'll move on in the interest of time so you can get those displays which of course shortens the time from the data that you're producing to insight but most importantly you get this information and it's recorded and now you can replay it or you can give it to somebody in a form that allows them to replay your experiment very easily so perhaps they are trying to improve on some measurements or some algorithm that you developed they can take your exact experimental setup and in that setup we run their own work so a few words who can use Chameleon like I said earlier, any US research or collaborator where collaboration is defined very, very broadly it could be as easy as talking to somebody at a conference like the OpenStack Summit for example there are various allocations that the project gets and various limits on usage that allow us to provide fairness to all users I'll skip over the next slide, a quick summary what we have provided is an open experimental testbed so again probably everybody in this room can use it for computer science research the things that we wanted to specifically provide is a large scale and deeper configurability as well as sustainable operations model and from that perspective our adventure with OpenStack using OpenStack and then also contributing back to OpenStack as part of the Blaser Projects worked out very well for us so we're really happy to be working with the community and now that allows us to turn our attention to provide more of a connectedness dimension to the instrument and give users a faster pathway from data to insight and one last thing that I wanted to say is draw your attention to something that Winston Churchill once said about buildings he said we shape our buildings thereafter the buildings shape us and he said that about the house of camels which was destroyed during the Second World War when they were rebuilding it he argued for it to be rebuilt in the form in which what was originally conceived so for example, it's a very small room everybody's within shouting distance of each other and that eliminates the need to go out to the dias and talk to a microphone and so forth and allows for a much more spontaneous conversation sort of blow by blow conversation and if you ever watch the discussions that happen in that building, that actually does happen it's very spontaneous, right? So it's an example of how shaping a building shapes the character of political debate and if that's true about buildings it's so much more true about test beds and if you go back to what I said at the beginning of the talk about what we can conceive, the experiments we can conceive of are essentially unlimited but in practice they're limited to what we can do it's not strictly speaking true because eventually we learn our creativity our ability to innovate adults to what we can use to run our experiments on and it becomes stunted if I have only four nodes to run my experiments on I'm not going to conceive of nodes that need 400 nodes because I just know that I won't get enough resources to run them so eventually you start dreaming small where we should all be dreaming big and so that's what we're trying to do in this test bed is expand people ability to run experiments expand their imagination and allow all of us to dream big so that's what we have for today and I will take questions now any questions? I don't know it's still nice to see how things are progressing so I have one question regarding the archive you would like to provide which is a science is something that I'm looking for for a while I'm not sure whether I correctly understood if it's already in production in Caminillon so are you already collecting all classes right now or not? So are you talking now about the workload process or are you talking about the monitoring information within Caminillon? Let's say the cloud traces the cloud traces so we are currently working with the scientific working group to collect those traces from other infrastructures collecting them from Caminillon is going to be a first step making them available and that hasn't been done yet at this point we have a data format that we're going to be collecting those traces in but also Caminillon is like grid 5000 this is a very unusual test bed right so those traces are not going to be as representative as they might be from other infrastructures and I think Pierre you probably can say more about this right? Yeah so for example grid 500 has traces on one of those grid workloads but there is a big caveat that it's not a typical grid or cluster so I think similarly we'd like to have cloud traces from infrastructures like Nectar or CERN which you can really consider to be production Okay so and this is the current trends will Nectar provide some classes? Yeah we are talking with them so I'm hopeful that it will happen By all means feel free to join the discussion because you know the format is important as well right? So thank you, thanks No questions? Any other questions? Okay one that canes thanks everybody very much for coming Thank you