 Okay. Hello, everybody. Thanks very much for coming to the Chameleon session. I'm Kate Cahy. I'm from Argonne National Lab and University of Chicago. Both fantastic institutions, Argonne National Lab, home to the fifth fastest supercomputer in the world, University of Chicago, one of the top 10 universities in the world, and we've got something called Computation Institute that spans both institutions. Other than that, I also lead a very exciting new project called Chameleon. So Chameleon is a project in experimental computer science test beds. You know that different sciences have different experimental devices. They have telescopes, microscopes, other scopes. The question is, what do computer scientists have? How can computer scientists operate, run and operate their instruments, their experiments? We decided that a computer science experimental infrastructure looks very much like a gecko with a curly tail, and so we named it Chameleon. It's a project of five partners, University of Chicago partners with TAC. Those are two resource hosting institutions, and in addition, we're also partnering with Northwestern Ohio State University and UTSA. The project is funded by the National Science Foundation. So a few words about how we decided to design Chameleon, how we decided to design this experimental infrastructure for computer science. So our first point of our strategy was to make it large-scale, because we know that a hot topic in cloud computing right now is how to marry HPC in cloud computing. Big data is another thing that people like exploring, and most of all, we wanted computer science experiments to scale from very small ones to potentially very large ones. So accordingly, we decided to buy as large a test bed as we could afford, and that means 650 nodes, almost 15,000 cores, five petabytes of storage distributed over two sites connected with 100G network, five petabytes of storage designed to help with big data experiments. And you notice immediately that we scale in terms of computing, we scale in terms of storage, we don't scale in terms of the number of sites. You can't scale in every possible dimension, because then you don't scale in the financial dimension, but we essentially designed the test bed to complement another test bed called Genie that already exists that has something like 50 sites. So people who want to experiment with networking, with highly distributed computing can use Genie, which has very small sites, but many of them, and people who want to experiment with HPC and big data can use Chameleon, which has on the two sides, but very large. Second design strategy point was for this infrastructure to be deeply reconfigurable. So like its namesake, Chameleon should adapt itself to the needs that, to the experimental needs that you have, right? And it should be as close as possible to what you have in your lab. You should have detailed information about the resources. You should be able to provision them in a fine grain. You should be able to reconfigure them, power on, power off, have root on them, of course, reboot them, have access to the console, to the BIOS, if necessary, and so forth. And this is necessary in order to support isolated, repeatable and reproducible experiments, right? If you run your research on Amazon Cloud, for example, you are in a multi-tenant environment and you never know how much something that somebody else is doing is influencing your experiment, right? Even if you're in a single tenancy condition, you still get interference from the hypervisor, right? So it was very important for us that our users, when they get to run on Chameleon, get access to bare metal hardware. A third point, third characteristics is we wanted the testbed to be connected. There is not something like parallel workload archive created for cloud computing, parallel workload archive is where people have traces that characterize the load on HPC machines, right? There isn't something like that in existence right now for cloud computing. Google made some traces available but on a somewhat ad hoc basis and it's hard for people right now to validate their cloud computing research, right? Because they don't have access to something that represents a typical cloud workload. So we partnered with people running production clouds in CERN and in Open Science Data Cloud which is a fantastic biology targeted cloud at the University of Chicago with Rockspace and Google and with others to get those traces from them so that researchers could have an easier time constructing their experiment. You go to one place, you can get resources, you can get traces, you can construct your experiments more easily. And of course, sharing appliances which is a partnership with users and we already have some users who contributed appliances to Chameleon which makes it easier for others to build new algorithm and new methods on top of the frameworks that they developed. Of course, complimentary, as I already said, complimentary to Genie and sustainable. In other words, easy to maintain, easy to share and I'll go back to that later. A few words about Chameleon hardware. Most of our hardware is composed of what we call a standard cloud unit. It's essentially a rack that is composed of 42 compute nodes. Each node is an Intel Haswell processor and four storage nodes which are also Intel Haswells but they have each storage node in addition has 16 two terabyte disks. So if you add it all up per rack that gives you 128 terabytes of storage space. And if you add it all up across the racks that's 1.6 petabyte. So in addition to that, we also have 3.6 petabytes of global store where users can put their experimental data so that it's already there available for them to run their big data experiments. One of the, so we've got 10 of those standard cloud units at TAC which they all together form a large homogenous partition where you can run scalability experiments. We've got a couple of those at the University of Chicago and one of those racks has InfiniBand on it so that you can run experiments with Ethernet, you can run experiments with InfiniBand. As time goes by this year we're gonna be acquiring additional heterogeneous hardware. We're gonna be acquiring GPUs, putting them into racks. You can run with GPUs without GPUs. We're gonna be buying more SSDs. We already have SSDs on the storage nodes but we're going to buy more and different ones so that you can experiment with different storage hierarchies. So it's all designed to be homogenous from one point of view so that you can run scalability experiments but heterogeneous it to a smaller extent from another point of view. So in addition to all those things we will also buy some non x86 nodes and those will be atom microsavers and arm microsavers so that people can experiment with those as well but those will be in much smaller quantities. So here's the, essentially a slide that describes all the hardware that I just talked about and here is one slide that tells you what research we expected to support on Chameleon. So if you look at the bottom level of the lowest end of the spectrum we expected support research in virtualization, operating systems, things that require a lot of control and typically people who do that research also have the skills to develop their own bare metal images with interesting operating systems, experimental operating systems and so forth. On the other end of the spectrum we wanted to just run a plain KVM cloud, OpenStack KVM cloud because this is very much in demand from people experimenting with new applications, educational projects and even people who are just developing maybe elastic scaling algorithms, resource management algorithm, that sort of thing. So here they have a ready to go virtualized cloud. In the middle of that, we have something for that last category of people. So once they develop their resource management infrastructure or their elastic scaling infrastructure they can drop down the level and now deploy OpenStack, their own version of OpenStack nobody else is using and run their algorithms, run their experiments in an isolated environment where it doesn't get interference from other users. So those are kind of three categories of users that we were seeing and three categories of skills from those users. So as we were talking to users and asking them what capabilities and experimental tests that should have from their perspective we sort of came up with a description of the experimental workflow from the perspective of the user. So if you think about it, if you want to do a computer science experiment the first thing is you design experiment, right? So maybe you want to experiment with cache hierarchies or something like that. So you design this experiment and then you say, in order to validate my new algorithm there's a certain type of hardware that I'm going to need, right? So for different storage hierarchies you're gonna need something which has, let's say storage available at different bandwidth or with different latencies or something like that. So what can I find that is similar to the model that my new algorithm is based on? So you discover those resources and here we found that users really want fine-grained descriptions, sometimes down to the serial number of individual components, because it's very important to know if somebody changes the disk, upgrades the disk or changes it as a different power signature your experiment suddenly is going to return different results and you won't know why unless you know that this is a very different component, right? Complete up to date, which implies an automated update of that storage description, right? If it's not automated then human error comes into play and so forth and it's not strict. And here's the coolest thing, version, right? So that you can always say, I ran on this version of the hardware because while understanding that, oh, the serial number changed or something like that is all great and fine, you don't want to go through that level of detail every time you run your experiments. And so this version now captures firmware changes, it captures hardware changes, all sorts of things. So I was, for example, surprised since we went public at the end of July last year, we've had 20 different versions of the testbed, right? And if you can, in the paper, if you can say I ran on the testbed version such and such, then people can understand better what context your experiments were run, right? And they can get an excruciatingly detailed description of the hardware and the firmware that came into play. And then finally verifiable, right? What if somebody changes the previous user, let's say, change the firmware on you? You want to find that out, right? You want to find that out and you want to change it back and make sure that you're running experiments on the resources as advertised. So once you discover the resources you decide, those are the resources I need to run, you need to provision them. And here it's important that they are provisioned in an interactive fashion, right? So ideally we would get resources on demand when we come to the testbed. But if you want to work on many, many resources, if you in fact would like to work on the whole testbed on hundreds of nodes, chances are that you come to the testbed and people are already running, right? Somebody's already running and you just can't boot them off, right, because you want to go. But if you think about this far enough ahead, chances are that you will be, in fact, able to reserve the whole testbed. So advanced reservations were very important for us and, of course, isolation between users. And then once you have the testbed, you want to configure it. And here, as I already said, very important that we have access to bare metal reconfiguration that is deeply reconfigurable. In other words, access to console. Many other users do need that. That you can map multiple configurations onto the same reservation because many computer science experiments look like, you know, change something, repeat your experiments, change something in the environment again, repeat your experiments, right? So it's very important that you can do that, that you can snapshot your work and that you can easily deal with complex appliances. And then finally, you configured something, you're running it and now you want to monitor it. And here the most important thing is you'd like to have access to various hardware characteristics, monitor-verse hardware characteristics that normally are not available to you. So we looked at all of this. We had interviews with about 20 teams prior to writing the proposal for developing this testbed and develop those requirements based on them. And then here is a graphical representation of what we started out with, right? It's an empty page. This is exactly what we started out with. And it was an enormous opportunity because we knew what kind of testbed we wanted to develop and we were not bound to use any specific technology we could research and use what was best for our needs. And as I already said, we went through this requirements stage when we interviewed different research teams. As soon as the project started, we formulated the architecture and when we started looking at different technologies, what could we use to build this testbed? And there were many different proposals. Most of them did not survive in close encounter with the architecture. And then, we whittled it down eventually to OpenStack and something called Grid 5000, which is an experimental testbed in France that had most of the characteristics that we wanted. And OpenStack had the advantage that, long-term, we could have a sustainable solution where we work with a large community. We contribute to and that helps us develop our testbed. However, there was a lot to develop upfront. So it represented a very high short-term risk but lower long-term risk. And Grid 5000, on the other hand, was very small, very low short-term risk but higher long-term risk. So eventually, we decided to pick OpenStack to implement what I just described. Once we started the implementation, we started it in the first code started in January of 2015. It took about three months to our technology preview release. So really, using OpenStack, we managed to implement this testbed within just three months, which was surprising to us. The infrastructure that we put on top of the testbed, we ended up calling Cableon Infrastructure or Chi for short. And it's about, those numbers are, of course, a little bit taken out of a hat. But we figured it's about 65% OpenStack, 10% Grid 5000 technology, and 25% our own special source, which is integration, implementation of snapshotting, contributions to Blaser, and so forth. So here is breakdown of those four different aspects of the testbed that I was talking about earlier. As far as discovery, on top here, you've got the requirements, the requirements that I already discussed, and we used Grid 5000 technology for that because they had a really well thought through and extremely detailed resource representation for the testbed, written, developed really with experimental science in mind. We had to, of course, develop our own portal representation of that information. We had to develop a mapper which took this representation and now generated something that Nova could understand because then it worked with Nova on the other side. So we had to do some work here, but essentially it's something that worked for us. Also, we used Grid 5000 checks for testbed verification. So something you run after you get your testbed allocation, you run this to make sure that the resources you got are as described. For provisioning, we wanted something that had advanced reservation. We looked at Nova and it did not have advanced reservation, only had on demand. But fortunately, Opalstack already had an incubated component called Blaser that we started working with, made contribution to that component, contributed, essentially developed whatever it is we needed to develop in order to deploy it and have our users work with that. So, definite contribution there. You see on the Gantt chart on the side what it looks like when you come to the testbed and there's this orange user running, there's this orange reservation there, but you want to reserve the whole testbed so you've got time on the x-axis so you have to look a little bit ahead in order to do that. Configure and interact. We of course wanted to work with bare metal and for that we started working with Ironic and then we wanted to allow deeper configurability so we made our own venture in console management but we're also looking forward to working with the community on that and making sure that everybody's on the same page as far as what those capabilities should look like. We added Snapshotting, our own implementation of Snapshotting which is currently available from the command line. Hopefully, in the Ironic session today I was on my wish list to come from the OpenStack community that would make it much easier for our users and we're developing our own appliance management tools that manage OpenStack images, generate them, manage complex appliances and so forth. We also have an appliance marketplace that users can contribute to. And then finally instrumentation monitoring, not much to say they're the same requirements that I presented before and we're using Solometer as the aggregator of the monitoring information and then we're writing our own drivers for the different types of monitoring information that users will need. So here is a little bit about Camille and Timeline status. The project, like I said NSF funded project started in October 2014. We, TAC and University of Chicago earlier partnered on a project called Future Grid which was essentially configuring virtualized clouds and so we configured very quickly a virtualized cloud on old hardware that we had as a bridge for the Future Grid community. So that was available still that same year. And then started developing Camille in January 2015 in at the beginning of April 2015, we made the technology preview of our experimental capabilities available. And then eventually we built the, between April and June we built the new hardware, received the new hardware, built it, put our implementation of the experimental infrastructure on the hardware and at the end of July we went into public release, made Camille unavailable to users. So today we've got over 700 users and we've got over 180 projects. And in this year one of the major capabilities that we're planning to offer to our users we're planning to deploy more heterogeneous hardware. So what do all these users do, the 70 plus users? There are many, many projects. I picked five highlights of the projects, of the research projects people are doing. The first project is from University of Pittsburgh. Uyuzu is a student at the research team that develops lightweight virtualization for HPC resources. So there's this debate in the community right now whether to use virtualization or containerization, right? On one hand virtualization gives you more features than running containers. On the other hand containers are much faster. So Uyuzu was doing a project where she was comparing containers and what was she was comparing Docker and KVM to be specific. You've got a graph of the performance comparison and how it scales. In this graph she scales to 64 nodes but I know that she's had graphs that which was doing 256 nodes which is of course possible on Chameleon. And then below Uyuzu proudly presenting her poster at Supercomputing 15. So what did she need in order to make those experiments work? She needed of course bare metal access, the power on, power off. She needed to be able to deploy custom kernels so she needed console access to debug it. She needed up-to-date hardware, right? Doesn't make sense to do this performance comparison if the hardware is not up-to-date and she really needed large scale experimentation, right? Because if you look at the graphs comparing Docker and KVM on one node the performance is almost the same, right? It is the scale that brings out the differences. So here is another project, right? Exascale Operating Systems. This is a project at Argon, developed at Argon. It's part of the Argo project that develops operating system for the next scale of resources. And if you want to know why just imagine that you put on 50 pounds and you're trying to fit into the same clothes, right? If we get an exascale supercomputer we need to develop an operating system that takes advantage of the new architecture and can run things on it in a very fast manner that can take advantage of it. So what they needed for the test bit was bare metal over configuration. They needed to boot kernel with different parameters, essentially do parameter search over the space in which they were developing. They needed to be able to reconfigure things very fast. So they were throwing many different images repeatedly on the test bed. And they needed hardware performance counters, many core nodes and so forth. If you want to find out more about the research, there's a paper that they recently published on the system-wide power management with Argo. And you see Swan Paranu who was running the experiments showing a demonstration at supercomputing last year of his system. So at Argon, of course, we do have a supercomputer or two knocking around the lab, but not every institution is so fortunate. And in particular, one of our users here doing research on cloud security. They're classifying security attacks. They come from a very small university of Arkansas at Pine Bluff. So it's a very small school. It's a historically black college. And what we're trying to do is give resources to those schools. That's the whole point of building a national resource that people can share. And that it levels playing field for small schools that have the talent and the grit to come up with interesting research results and then compete well or compare well with richer institutions that have many resources. So their research was classifying cybersecurity attacks and their test bed requirements were essentially an easy to use open stack installation. So that was our KVM virtualized open stack installation and access to the same infrastructure for multiple collaborators. So there's infrastructure collaboration from many, many schools. And then the next project, I wanted to, we were fortunate in having one of our users right here. So I would like to introduce Paul Ruth who will tell you about what he's doing with federated networks. Yes, so I'm Paul Ruth from Renaissance Competing Institute called Renzi at UNC Chapel Hill. And the work that we do there is to federate networked clouds for domain science. So that's a bunch of words, but what does that mean? As an example, I'd say like our project is mostly under this project called ExoGene, which is, Kate mentioned earlier, the Gene project. And that was a federation of cloud computing that has a wider reach, but is smaller individual clouds. So what we have here is about 20 small open stack installations all over the world. So we're in three countries or four countries on three continents. We can get compute network and storage from these different open stack installations, but our special sauce is that we consider the dynamic circuit providers that sit between these racks to be another type of cloud provider, but they're providing layer two circuits between our racks. So what you can do with ExoGene is you can get compute network and storage in various places all around the world, but we can also give you a layer two network that connects your resources. So this is great, it's very powerful. We're a full wide reaching footprint. And what we want to do with Chameleon where we're somewhere between a user of Chameleon and helping them develop pieces of Chameleon. So what we wanna do is be able to deploy on top of Chameleon new ExoGene sites that still have this power to stitch layer two networks with the rest of the ExoGene federation. And this requires a lot of thinking about how to use Neutron and the different networking capabilities of open stack. It doesn't quite fit our model, so we're working with K to make this happen. So this is what we call stitching of layer two networks and the target here is HPC because that's kind of what we do. So we need InfiniBand or SROV and MPI, and mini cores and all these things that all the HPC people have been talking about. So picture of our group, there's a couple of people on there that have moved on to bigger and better things, but they've done a lot of work, so we wanted to include them in the picture. And that's about all I had. If you want any more information, come find me later. Okay, thanks much. Again, one last project is teaching cloud computing. So this is a project by users from University of Arizona. And as I'm sure everybody's familiar, there are many applications, scientific applications right now that discovered that if you have resources on demand, what they used to run on their laptop, and used to take days, but if you have an application that can be embarrassingly parallel, you can provision resources on demand, you can run it on multiple resources, and what used to take days all of a sudden takes hours, or even minutes. So it turns out that one of the researchers at University of Arizona had such applications and the challenge now was to build a harness that would take those cloud computing resources and run the applications on those cloud computing resources. The application, by the way, is looking for exoplanets. Those are planets that orbit other stars than the sun, sun being a very special star. So one of his colleagues in computer science department said, why don't I organize a class around this? This will be a class project. And later on, once we have this application-specific infrastructure that runs this application on cloud computing, we can share it with others. But the question now was, where do we find those resources? And very easily, they could just come to Chameleon and run it on our KVM cloud. So they developed the students that you see in the picture here, developed this infrastructure on our KVM cloud. And again, what they needed is easy to use infrastructure as a service with KVM. And in the students case, in particular, the ease of use is something that needs to be emphasized. If it's too hard, then all of a sudden, the professor has to spend a lot of cycles with them explaining things to them. Minimum startup time, right? Minimum startup time in terms of how you use the infrastructure and how you spin up the individual users. Support for distributed workers and also storage, important storage, because in this particular application, of course, it's a data mining application. So they need to store the data somewhere. So those are the applications. Now, what we have in the pipeline are year one theme for the first year of the project is let's make this test bit possible. Let's build the test bit. And we built the test bit. We deployed the hardware. We developed the software to make it run to give our users the capability that they wanted. Our Y2 theme is let's go from possible to easy. Let's make things easier for users. And that means let's give them all the missing functionality, but let's also develop ways of processing appliances that will make it easier for them. Easier snapshotting, easier sharing of appliances between various different groups. An appliance marketplace where different appliances are available. A user can build on top of that. But we're also users can easily submit their own appliances. So we had a contribution, for example, from the Barcelona Supercomputing Center. They developed an infrastructure that elastically provisions virtual machines across multiple clouds and they packaged it all up into a chameleon appliance contributed to our marketplace so that now others can just simply take others who want to develop stuff on top of counts, which is what the infrastructure is called, can simply take that appliance and develop with that appliance. So all sorts of things that make it easier. Also a tool called Experiment Blueprint that allows you to repeat your experiments very, very easily. Repeat them and reproduce them. That will also allow others to take your experiment and run it in exactly the way that you had run it. So this is all we're planning for this year. We're in the middle of developing those things right now. Ideally, year three theme would be going from easy to efficient, but my prediction is that making it easy will take us a little bit longer than that. Functionality that we still need are, first of all, networking capabilities. Paul alluded to that. There's the stitching problem in a presentation earlier today. I was talking about dynamic villains that we would like to have in order to isolate users from each other, which would make a lot of things easier. Different instrumentation agents, right, that send very sophisticated implementation hardware-level instrumentation to Cilometer, better allocation management, and all sorts of other features. So a few parting thoughts. So first of all, just to summarize, what we developed here is a scientific instrument for computer science experimental research, right? So think about it as a telescope, microscope, as a scientific instrument for experiments. Secondly, work on your next project at chameleoncloud.org, right? So come to chameleoncloud.org, see if you can get an account. We support all research, all academic research, also all research from the labs. International collaborators and industry collaborators of academia and the labs are welcome, right? So very, very inclusive. And of course, we're always looking for ways of making it sustainable. So if you need chameleon, count, make it within those allocations within the umbrella that Dennis have funded, please talk to us. We would like to support every single user that we can. Another thing is we went from vision to reality with really express delivery. So within three months, using OpenStack, we were able to develop new experimental tools, tailor-made, to the capabilities that the users told us we wanted. We ourselves were astonished how quickly that that really was possible. And on a shoestring, there aren't that many people working on chameleon. We wanted to put as much money as we could into hardware, but you've got Cody Hammock, Jonathan Pasture here. I don't know if anybody else from chameleon is here, but certainly feel free to talk to them. And so we have, at this point, operational testbeds, 700-plus users, 180 projects, more than that. And the last thing I wanted to say is that sustainability was a very important design criterion. There are other experimental testbeds using other infrastructures. What sets us apart is that we started with this blank page that I showed before, and we wanted to build the testbed as an application of cloud computing, because there is already infrastructure. Already infrastructure exists that supports that mode of use. The benefits are for us, because we are leveraging all the fantastic work that is going on in OpenStack. There are benefits for broader community, because we contribute to projects like Blaser and many other projects contributing new capabilities in bug fixes. And it's also for potentially other testbeds or other people who would want to deploy the chameleon infrastructure for experimental science because they may already have, on staff, people who are trained to administer OpenStack, or people who've heard about cloud computing are familiar with that mode of operation. So last but not least, I wanted to introduce you to the fantastic chameleon team. This is, of course, myself. I'm the chameleon PI. It also serves as science director and architect of the system. Joe Mambretti is in case of the Programmable Networks, works very closely with Paul right now, trying to make his application work in chameleon. DK Panda, an expert in high performance networking, currently giving a talk in the room next door. He is doing a lot of interesting stuff with virtualization in Finland. He makes sure that our in Finland that users can make the most out of it. Paul Rad, who is joint appointment with UTSA and Dragspace, so he's our industry liaison and also takes care of education and training. Pierre Rito, who unfortunately could not be here with us today, he's the chameleon DevOps lead and probably put the most blood sweat and tears into the development of that initial infrastructure. But somehow he chose getting married over coming to the OpenStack Summit, you know. There's no accounting for tastes. And finally, Dan Stanzion, who some of you may have seen yesterday during the keynote in the morning, he's the director of tech and serves as facilities director on the chameleon project. So with that, this is my talk. Are there any questions? Yes. Yes, we have. So in fact, currently, when you come to chameleon, you get a quota of 20,000 service units where a service unit is essentially a note for an hour and a one note per hour. When that quota expires, that's it. You now have to apply for an extension. And this is fine, we'll save your image, we'll save your appliance. You can come back and run later. So you can apply for extension. The other thing that we found out last December was that actually we had so many users using chameleon that was impossible for people who wanted to run on hundreds of nodes to run their experiments. So I showed this student who was comparing KVM and Docker, for example, she was not able at that time to run an experiment with enough nodes, even though it had been possible before. So because of that, we introduced a one-week limit on a single reservation. And we do make exceptions from that rule. There are some people who need to run long-running services, and it's only maybe on one node. And it's within their quota otherwise, so that's fine. And we did have to make modifications in order to support this quota-based system. Yes. Absolutely, it's open source. So I'll be happy to point you at the code if this is something that you're interested in. I was talking with someone last week about cluster design, and they said, oh, HPC people will never go for something like OpenStack because they don't like VMs even because it introduces jitter, or all their CPUs have to be perfectly matched. Otherwise, variations will cause huge performance loss, and they want to abandon all this stuff. But clearly, you're doing it. So I feel like the way to reconcile this is that he's perhaps defining a very narrow Jaguar-style supercomputer cluster. And you're a more loosely-coupled kind of cluster. I wonder if you could talk about the design space and what size of market, if you will, of HPC workloads you wanted to support, and whether you feel like you've hit a majority of those kinds of workloads with this design. Yeah, I'm delighted to talk about that because actually, that's the area in which I work. So you're right. I'd say there's HPC and HPC, right? And there's HPC, which is the 1%, right? The luxury end of computing, which is the top 10 supercomputers in the world and all of that. And it's true that they are extremely suspicious of virtualization because virtualization introduces jitter and overhead and all of those things. However, if you think about this, if you have a supercomputer is composed of million elements, right? And on the one of those elements fails, right? You have to go back to square one. That is to your last checkpoint, right? And we start from there, which is a huge inefficiency. So those programming models that require constant barriers and constant synchronization are not necessarily efficient anymore for the newer supercomputing resources that are composed of so many processing elements. So within HPC, for reasons completely distinct from cloud computing or anything like that, there is a movement to move to a different programming model itself. That's one thing. Our kind of HPC hundreds of nodes is what in real HPC world is called medium scale. In real HPC worlds, you're talking about hundreds of thousands of nodes and now millions of nodes. So what we're trying to do is buy as many nodes as we can afford to enable people to show as much scalability as they can afford because there is no computer science test bed right now that will support scalability experiments of thousands of nodes. And it's very interesting that, for example, people at Argonne are using Chameleon quite a bit, even though they have many other resources that they could potentially use. However, all that said, cloud computing is making inroads in HPC right now. So I'm working in a completely different hot, in a completely different sense of being I'm working with one of the computing outfits at Argonne National Lab. And we're deploying a sort of hybrid infrastructure that behaves a little bit like mesos in that it runs torque, gives out a partition to torque, and then partition to open stock. But unlike mesos, where those partitions are static, it dynamically negotiates the size of the partition so that if an on-demand request comes, you can borrow some nodes from torque. People are talking about high-performance computing malleable jobs, which there are operating systems like Tron++, there are frameworks that are workflow-based that support those malleable jobs, and they can take more resources or give up some resources as needed. So it's a very hot-to-debate issue right now in HPC. And I don't think that we could really afford to have, at a luxury end, a 1% of computing somewhere there that's completely distinct and plays by very different rules than everything else in the world, because then there's innovation that happens here that does not propagate here. So I'll be delighted to talk more about that subject later, but I see that there's another question. Well, yeah, I guess maybe a little follow-up point to that question as well is that, I mean, now standard CPU architectures are introducing more jitter in performance. Like the standard Haswell CPU performance differences are about like a 7% or 8% or something is normal for Intel, right? But my question was actually, you skipped really quickly over the slide that had the details of the hardware that you had, and I was interested to know whether, you mentioned you've got one rack that's got InfiniBand, but in the rest of the cluster, do you have RDIMA capability? Do I have what capability? Do you have remote direct memory access capability in the rest of the cluster? So do you have Rocky or something over Ethernet? OK, well, no, it's a fairly vanilla Haswell node cluster kind of thing, right? So it's an interesting question to what extent the users will eventually want to customize those capabilities and how they will want to work with that, right? But right now, what you see is what you get. We might get something else in place. So is the networking in that part of the cluster, then, just 10 gig Ethernet or something? We've got, it's 10 gig. But there are connections. So I'll get a daughter board. Yeah. Cheers. And there's actually 40 gig uplinks to the Core Network. So but it's, yeah, mostly 10 gig. Any other questions? I'm not sure. Let's go with them. Hello. Thank you for your presentation. I apologize. I missed the first 10 minutes, so I may have missed this earlier on. But have you looked at doing any bare metal workloads to help potentially reduce that impact of virtualization, the penalty that you pay for virtualization, and maybe some integration with Ironic to be able to do bare metal as a service to allow much more direct metal contact for application testing? OK. So we ourselves don't really run experiments as our users, that run experiments, typically. Our users have, I know that they have run experiments like that. In particular, one of the experiments that I mentioned out of University of Pittsburgh, that experiment itself was comparing KVM and Docker. But that whole research group focuses on developing a lightweight hypervisor called Placios. So if you Google for Placios, you'll see how that works. And what they effectively do is they take out all the virtualization bits out of it. There's no virtualization left. It's essentially a sort of modified operating system, like bare metal operating system, then very carefully put it back. But they had very impressive results. So I remember that they ran on the Sandia Redstone, which is a supercomputer at the Sandia Lab on thousands of nodes and got something like, I think, within 5% difference of execution on bare metal, which is very, very impressive. So it's really worth checking out. The other interesting place to look is there's a project called Hubs in the Department of Energy Space. And there's an overlap between those two teams. But they are developing a very lightweight hypervisor. Thank you. And you had a question? OK, Cody, would you like to answer that question? So we, in fact, have only a very small part of the three petabytes configured and available right now. And we're trying to configure it in a way that will consume all three petabytes. Yeah, I'm being shown that just one last question. Is that OK? All right. I had two, but you can pick one. So the first one was how do you actually account for the level of usage on the system? Do you actually charge back to any of the people, or is it always free to use? And the other question is, how do you actually separate your tenant's networking, then, if you don't have VLANs in Ironic? I mean, I see that you've got OpenFlow enabled switches. Are you doing something like network slicing with OpenFlow or what? OK, so this is a fantastic question we would love to do the OpenFlow slicing. We're not doing it right now. The tools are not ready, so we've got people working on it. But it's not deployed in practice. I think that probably the first step for us will be isolated VLANs and being able to deploy that. And we need that for many reasons. Number one is to enable more interesting networking experiments. And then ultimately, we would like to make the OpenFlow capabilities available to users. Right now, they are in the hardware, but it's not. There isn't a good way of using them. Although, if you would like to use them, we would love that. And then we would be working with you the way we're working with Paul. We don't quite have the capabilities that he needs to do his work. So he's both a user and also working with us on doing that. And then to answer your other question, how we're accounting. So we're taking the time for your reservation. It's not the image deployment time is the reservation time when your reservation starts clocking in, when you clock out, and we charge that against your 20,000 service unit allocation and also against your lease time. OK, thanks, everybody, very much for listening. And come and use our infrastructure, we'll have users.