 Good morning. I'm Mike Lowe. I work for Indiana University and my colleague here is Robert Budden from Pittsburgh Supercomputing Center. Let me apologize right up front. The quality of the conference is inversely proportional with the quality of my voice. So I'm going to try and make it through here, but bear with me. So this is a little bit different than the audience that we're used to, I'm thinking, guessing. Normally we're presenting these systems to people who know more about the NSF than we do, but may not necessarily know what OpenStack is. I'm assuming by Wednesday morning that everyone here knows what OpenStack is, and we might need to discuss the National Science Foundation a bit. So traditionally they've been funding large systems that focus on delivering floating point numbers. A couple of the most recent ones here, the big boys, you know, Stampede right here at UT Austin at the Texas Bands Computing Center, TAC. A couple of ones from the recent past here. And a few months before they funded our systems they were funding some cloud systems. But these are fundamentally different systems because when they fund two different kinds of systems, systems that are instruments to study computer science. They're designed to study how to build clouds, and then they have clouds that are designed to study everything else that isn't building a cloud. And that's what our systems are. These awards were made out of what's colloquially known as the Track 2G solicitation. There is a multi-year program that acquires one or two systems a year. And these all fall under the HBC umbrella organization known as XEED. So all the accounts and accounting goes back through that organization. And so does the first line of support. A couple of systems are listed there. These are the most recent ones that have been added to XEED. XEED is a five-year, $120 million, mostly support project to help people get onto these systems and deliver these resources to researchers at institutions. I lifted a few phrases from the solicitation. You may notice that they look a little bit... There are some things that kind of stand out. Like they mention new three times and communities twice and keep talking about capabilities that we don't already have. And these numbers might sort of illustrate why they're on about something that's not traditional HBC. Only a handful of the researchers actually use any of the HBC systems that are provided. And only a few dozen researchers actually use most of the cycles that are delivered in this country. And they've given a few reasons in surveys. So this is the question that was asked and these systems are our answer to this question. Slightly unusual. Usually the solicitations are for one award. The NSF seemed to think that our responses were meritorious enough to warrant two awards. And I think that speaks to the quality of the proposals that we, both of our institutions, delivered. So specifically about Jet Street, so we focused on user-friendly. We focused on delivering a library of VMs so that researchers could get started quickly. And once you log in, we wanted to be 30 seconds, 60 seconds, and you're logged into your VM doing work. We targeted as many disciplines as we could possibly manage. If you take the numbers and you put them on a rough graph, you wind up with this what we call a long tail of science. And that's what we were really targeting here. Something bigger than a laptop, but smaller than a large, largest shared memory machine. Just side note, it looks like Bridges fits in very nicely, just right adjacent to the community that we were targeting. So we also wanted to support some science gateways on our systems. We're going to have some long-running, persistent VMs. And one of the other communities that we targeted was the EPSCORE states, they're listed out there. I personally believe it's consistent with both the letter and the spirit of the mission of the National Science Foundation to have more people doing more science. And we're better off by being more inclusive as a project. We can, we've heard further of these aims and so that's what we're going to do. Specifically what we bought. We bought 320 blades, Haswell nodes, 24 cores for a blade. All the numbers are listed out there, you can read those. But we apportioned our flavor sizes in a very specific way because we can only report our usage in terms of itanium CPU hours. So, right. So we can't bill for memory or disk or any of those things we have to bill for CPU. So we sort of chopped up the blade and that's what we came up with. It's actually two separate distinct clouds. One here at TAC, our partners. The other one in Bloomington at IU. We're plugged in both with 100 gig networks back to internet 2 and we have a private network for the exceed that is 10 gig. There's a small development test cluster that went off to our partners at University of Arizona. A quick diagram of the hardware in Bloomington. The Texas hardware is slightly reconfigured but it's got all the same pieces. We've got racks of compute with storage and management and various assorted service nodes in the middle there. We peeled off one blade from each of the five racks for a network node, for another network node. The first three racks got another blade peeled off to the controller nodes. Not the best diagram but you sort of get the idea. The blades each have two 10 gig links back to each of the chassis switches. Those are stacked. Those go up to the top of rack with 4x40 for 160 gig aggregate. We've got 2 to 1 over subscription at this level being 320 gigs from the blades aggregate back into the chassis switches. Four chassis to a rack. Those all go up to the top of rack. Top of racks are 4x40 gig spine switches tied by 340s. So that's 120 gigs should one of them should one of the past be broken there. And 4x40 for 160 gigs back off to our core router. A little bit more detail about the service nodes. We used a 3-way Galera, MySQL, RabbitMQ cluster there. Those are backed by SSDs. And we have a pair of load balancers. Those got 40 gig, 240 gig nicks each. So they also have 160 gigs aggregate. We put a primary virtual IP with keep life on each of those and use DNS round robin to roughly balance between those. If one of those goes down it takes over the primary virtual IP of its mate. So there's a lot of things that we did that like everybody else they're not particularly special. You know you can just sort of follow the stock install docs and arrive at the same answer we did. You know sender and clans are backed by Cef, KBM, Livevert, regular Nova stuff. We just grabbed the publicly available packaged bits installed those. Slightly more unusual. We use Linux bridge instead of OVS, VXLAN for our tenant networks. We grabbed what at the time was brand speaking new Intel X7 10x and they have VXLAN offload. We use Keystone V3 with domains. Just not terribly common. We put everything behind HA that we could possibly afford to do. And all of our deployment is done with at the Bloomington cluster is done with Saltstack. The Texas guys use Puppet, but a little bit more on that later. Stuff that nobody else has. So our partners at Arizona they have created the atmosphere web user interface. It is the same one that goes in front of iPlanet, now Cybers and a recent rename. We've got Watercooled Doors which are kind of cool but somebody might be interested in those. And 100 gig networking back that we need to try and max out. So the atmosphere if you're not familiar with it, it's best described as a re-implementation of Horizon. But it takes care of all the networking, security groups, a whole lot of stuff so that you can just click and go. Not exactly to scale, but a block diagram of atmosphere and open stack. Open stack being in the kind of lower right hand quadrant there. A little bit more about the Watercooled Doors. I just like them because they change colors. They've got tricolor LEDs. When something goes wrong you can walk by the machine room and spot it at a distance. The exhaust air, these actually cool the room. The exhaust is colder than the input. A screenshot of the atmosphere dashboard when you first log in. And this kind of launch that I keep talking about. So you log in, you pick an image, you hit the launch button, you say what size and you say whether you want it to launch at Bloomington or in Texas. And two or three minutes later you can log in. And that's really all that there is to it. There's no setting of virtual routers or networking or any of that stuff. Just click, click, go. The authentication winds up being slightly unusual. One of our mandates was that there would be no more, that we not add another username and password. So we're using OAuth2 and the Globus auth pieces to use the credentials that you normally log into at the top level at the exceed portal. So that winds up giving us kicking back a username that we then mapped to attack username. The attack accounts are synced via LDAP to us. We use Keystone Trust to impersonate the user and then use that token to hit OpenStack from there. We ran HPL, LIMPAC inside and outside of VM. You can see the bare metal there and the cost of virtualization is kind of mapped out in that slide too. Just some quick rally benchmarking. It takes about 10 seconds from doing the Nova Boot request to being in with the login. Laundry list of partners, we've partnered with, I believe, 40-some partners, but these are the ones that'll fit in the slide. Laundry list of links, the used JetStream that's our user interface. We'll get that right now, poke around, look around and see what it looks like. Of course, the exceed portal where you'll need to get an account, some training documentation, a very detailed paper, and maybe the most important link if you are deploying yourself. We have a public GitHub repo that has every Stalt State configuration that we use to deploy the Bloomington system, at least the Texas guys based all of their puppet off of this repo. This is live, so this is the configuration that is continuously applied today. If you use this, then you'll wind up with whatever I have at any given time. For better or for worse. Depending on how you feel about public speaking, I'm either very lucky or very unlucky to be up here today. I'm representing the hard work and considerable talents of dozens and dozens of people. Here are a few of them, and I'm sure I've left some people out, but they deserve a lot of the credit for all of this. Hi, I'm Robert. As Mike introduced me, I work for the Pittsburgh Super Community Center, and I'm going to talk to you a little bit about the bridge's architecture and how it kind of fits the other gap for the HPC needs. So what is bridges? We wanted bridges to be a uniquely flexible resource in the HPC world. The NSF wanted us to basically be able to support traditional and non-traditional workflows. To kind of bridge the gap between communities, HPC, and big data. So we built a heterogeneous cluster. We wanted to be able to do a lot of different things. The majority is general compute, large shared memory, and then we also have sections of GPU nodes, database and web server specific nodes, and then also data transfer nodes for bringing in your data sets. So there's a lot of different use cases. I've been going to touch on a few since I'm not a scientist myself, but basically we do a lot of big data application workflows. We're doing science gateways and communities where they want to bring, set up a web service, and have a workflow orchestrated through the web that submits to the private back-end batch cluster. Other things like graph analytics, machine learning, really large memory, really large in-memory databases. And then lots of genomics work we do at PFC. The PGR, the Pittsburgh Genomics Research Center, we're working with Pitt to do a lot of things. They have some huge data sets that they're generating, hundreds of terabytes of data that are being stored and processed. So I'm not going to go through all of these, but if you want to know more information about the science, we can get you in touch with anybody that you'd like in your field. So a little bit about our technology partners. HP Enterprise was the ones who did the compute servers. They did the architecture, the design and the installation of the hardware. Under the hood, it's Intel. We're using Intel CPUs. They're new on the path architecture. Interconnect that I'll talk about a little bit later. And also recently acquired Lustre file system that we used for a distributed file system at PSE. Also, the GPUs are NVIDIA, and then all of our storage servers, we kind of traditionally go with Supermicro and we roll those ourselves. So a little more in depth about the hardware. This is the phase one. This is in production right now. We just exited our early user test phase, and if you need grants or anything, we can apply and we can get you on there now. There's 752 of what we call the RSM nodes. There are regular shared memory. They're kind of the general purpose compute. The features are there are 128 gigabyte of RAM, terabytes of local scratch storage, and they're the Intel Xeon E5s. And then a subset of that, we have 16 RSM nodes that have been packed with dual Tesla K80s. These are for doing the GPU simulations, open ACC type work, or any other codes you have that could run and take advantage of the GPUs. And then to get into the more specialized nodes, we have what we call the LSMs and the ESMs, the large shared memory and the extra large shared memory. The larger memory nodes are three terabytes of main RAM, and the extra large are 12 terabytes of main RAM. So we can support really high memory intensive applications, or these large data sets that need to be loaded and RAM and processed. We don't want to take the performance penalty to be going off the disk or going off the distributed file system. And then on the external side of things, we have database and web server nodes and data transfer nodes. These will be externally accessible and also accessible to the bridges private. So this will be kind of the gateway or the bridge of the gap between the private batch cluster and a web site or a web workflow like PGR that needs to be able to allow community users to come in and submit through a web portal and not really have to understand how batch HPC type jobs work. So the phase one total is about .9 petaflops and about 144 terabytes of RAM. In the summer, by the end of the summer, we should have the phase two. So we kind of staggered this to be able to take advantage of some new technology that's coming out. So we'll be adding 32 more of the GPU nodes with slightly higher revision of Intel Xeon processors that I believe are not quite out yet. And the same with the Intel coming out with new generation of GPUs. We're also going to add 34 more of the LSM nodes and two more of the SuperDomeX nodes. So by the end of the summer, we should have a much larger pool also of these large shared memory nodes to facilitate user jobs. So one of the key points for us was the Intel Omnipath architecture. We were the first major deployment of it. It's 100 gigabit fabric. All of Bridge's private is interconnected with this. It's designed to do really high message passing rate. They do 160 million messages per second. And so this is kind of our HPC high performance side of things on the back end. This is an architecture of how the network is laid out. You can go to bridges or psc.edu and you can see there's an interactive Chrome app that you can go through and you can look at the architecture and see how it's all laid out. It looks fairly complicated, but it's much simpler when you just look at the details. So now I might as well talk about the software stack. What are we doing? Well obviously we're doing OpenStack because I'm here. We're using OpenStack in a couple different ways. The main way is we're using Ironic. The entire cluster is being bare metal booted through OpenStack Ironic. We're using that as the provisioning architecture behind pushing out the OSs. We're also doing some virtual machines on the Science Gateways and these database nodes. So we're using a separate OpenStack setup to facilitate the VMs. We use Slurm. People are familiar with traditional HPC systems. It's basically like PBS or LSF. Our sun grid engine is basically a batch processing for doing MPI jobs or other batch type work. We use Puppet in conjunction with OpenStack and Ironic and all of these pieces to kind of spin up services to handle the node configuration after OpenStack pushes out the bare metal. And then on the back end we've got two distributed file systems. We've got about 10 petabytes of disk that's split roughly right now 50-50 between Lustre and Slash 2. Slash 2 is an in-house file system we developed that was designed to do data replication in archival storage. We're also doing some Docker. We've had a lot of interest from users wanting to do Docker and wanting to know how we're going to support Docker. So that's become a pretty hot topic recently for us. So as I said we're using multiple OpenStack setups. Let me skip through to the Ironic. So what we got from Ironic was we've replaced our traditional HPC netboot infrastructure. Traditionally we had been booting through Pixie and TFTP very similar to how Ironic works but we had problems with thinking how this would scale and how we would handle like image management. How we would handle the organization of everything. Building all these images we get a lot of functionality from disk image builder. We can automate the building of the OS. We can automate adding all of our packages into it. So we gain all these things through OpenStack. The other interesting thing is we gain the ability to provision based on the type of hardware it is. We can version the storage with different images and the compute with different ones as well. So this is a really easy flexible way to boot the nodes and not have to manage any of the Pixie and TFTP stuff manually. We wind up we use Puppet to automatically change the configuration of the computes on the fly. The nodes take about 10 minutes to boot, give or take. So instead of rebooting every time with Ironic we wind up we have a larger disk image and we use Puppet to manage the services. Whether it becomes a Hadoop cluster for a certain portion of time or whether it's spinning up Nova computes and doing KVMs on the back end etc. We can control this with Puppet and use Ironic basically to do major updates and major image pushes out to the hardware. So for us Ironic the big thing for us was to be able to take advantage of the on the path. We have a very early generation since we're the first to have it that does not support SRRV. So doing VMs for us is not really feasible. We want people on the bare metal or on containers to be able to take full advantage of the interconnect that we have. The other thing that's great with Ironic and Puppet is being able to reproduce the deployment. So we have our Ironic image and we have Puppet that creates a service. We can wipe the node clean if something goes wrong and we can guarantee that it's in the exact same state that it was before. So for VMs we have the separate setup. It's a Liberty. We're using Neutron with OpenVswitch. The main reason we're using OpenVswitch is all of our computes are in private space. So we wind up setting up OVS tunnels in order for any VMs to be able to talk externally. So this allows users to be able to get in from the outside instead of having to bounce off our login nodes or set up some kind of proxy service. So it fits our needs almost perfectly. And then we offer both PSC managed and OpenStack managed VMs. So we have users that basically don't want to know the gory details. They just need a database. They want something spun up for them. They don't want to manage it. They don't need root access. We have VM templates to spin this up for them, allowing them to log in through their standard PSC LDAP exceed credentials. But for the advanced users that want more control, we just provision the resources into OpenStack and allow them to push out whatever images they'd like. Container support has become a big topic. We see a big increase in support for Docker. And as I said, with our on the path iteration, we need to get the users to stay closer to the metal. Right now, this is kind of a thing we're working with the user through user services and sysadmins to facilitate spinning up these Docker containers. But we're looking at Magnum. We're looking at Nova Docker as a way to have OpenStack automate this process instead of having some manual intervention. I think some of the biggest things with Bridges is our roadmap. So we have lots of ideas. We're going to be migrating to Mataka. Probably as soon as I get home, I'll start working on that. But again, for looking at Magnum, Nova Docker, looking at ways to use OpenStack to set up our Hadoop clusters or Hadoop portions on the fly instead of having user services in our Hadoop guy have to do it manually. The other great thing is as OpenStack improves, Bridges can improve. So as these new projects come online, seeing Trove and Sahara and Manila, we can incorporate those features and the Bridges architecture can grow with the community. The other thing we're doing is I'm automating this piece with Slurm. Basically, if a user wants nodes in OpenStack, what we can do is I have scripting set up that the Slurm prologue can spin up the Nova compute on their reservation. It can metadata tags that hardware for them, sets up a proper flavor to that hardware spec, metadata tags it so that they're guaranteed that they can only run on the nodes that they have reserved in Slurm. So we end up using Slurm almost as the accounting to keep track of how many hours and how many cores, how much memory they've used, and we just let OpenStack do its thing. The other thing we're going to be working on is Ironic boot over the On The Path. So we're planning on modifying the Ironic deploy image to install the On The Path drivers and set that up so we can push these images out the 100 gig pipe instead of out the back end giggy Ethernet. We're also looking at containerized setup. We basically are looking for ways that we can roll out new releases, but in the event of something not being quite right, we can easily roll back to what we know works. So I'm looking at COLA project to be able to facilitate this process. I know that that project has come a long way from the last time I looked at it. So we're excited about what's happening in there as well. And then also we want to do an increased HA setup. So because our two clusters or our two OpenStack setups are split, we'd like to find a way to maybe unify them and then put it all into one single one and have more HA than having them separate. And a big thing for me particularly is to contribute back to the community. I'm a developer by nature and I do sysadmin as well, but I would like to fix some of the bugs I found, some of the scaling issues that we've seen and push this back out to the community so that we can become a part of it and contribute back. For additional information, these are just myself at the top and our PIs. There's a ton of people behind the scenes, as Mike said, that contribute in many different ways, whether it be from puppet or networking. So I didn't even attempt to list them, but they really deserve a lot of credit as well. If you need more information, feel free to contact myself or PIs or just head to our website and you can check anything out for applying for grants or getting time on the machine. Thank you. Oh, that's actually a video. I did not know that. Anyways, so this is Bridges. It's a time lapse of it being built. None of it was there. I thought it was a still image, but I guess enjoy. If you have questions, feel free to come up to the microphone and ask Mike or myself if you'd like to know anything more. I have a question. Since you are the first few using the Intel Omnipath, how's the performance with the MPI and the Omnipath at this stage? So I'm not sure if I can attest to the performance myself. I could get you some numbers on that. I know that we're still working through driver stuff with Omnipath, working directly with Intel. They've been really good at supporting it and coming on site and helping us with any problems we've had with Lustre or that. If you want specific numbers, if you would get into contact me, I can put you in touch with somebody that can. For the JetStreams case, for the CPU performance numbers, can the server suggest that Nova tweaks to make it faster or not? No, that was just stock. Nothing special there, just took the stock. I recommend looking at them because that seems to be quite cool. Either system or? You would take that one? Accounting is still in process, so we're experimenting with Nokia and the Atmosphere web user interface also has its own accounting system and quota system and it has a concept called AU Atmosphere Unit. That is the overtime quota in contrast with the traditional Nova quotas which is the instantaneous. It is still being worked on and hopefully we'll have that hammered out here in a couple months. Our acceptance review is Tuesday and Wednesday. Once we get the report back, we'll go into full production. We're in early operations right now. We're thinking probably in the June-July timeframe, that's our deadline to have all the accounting stuff worked out. We've got a lot of work ahead of us. We'll make every effort to kick stuff back upstream, whatever it is we figure out. I think we're in the same boat for looking at Niaki Cloud Kitty, but currently we've always done our accounting through PBS, through the battery source. Since Slurm is spinning up the reservations on bridges, we already at least have that portion and maybe we can find grain, tune it with what we see them actually using an open stack. Right now it would be basically them paying for the reservation to do cloud type stuff or core hours for the back end batch stuff. In bridges, when the Slurm reservation expires, what happens to the VMs? That's a good question. There's an epilogue in the reservation also. What we'd like to do, we actually haven't encountered this yet, so we're going to be dealing with it, but the idea would be to snapshot the VMs or hopefully have the user do that. They know when the reservation ends. They're kind of in charge of taking care of getting their data off the VMs or migrating or snapshotting it. But I think what would be nice to do is to have our sender back in store their snapshot for some period of time, that if they're coming back to do more compute, they don't have to start over from scratch with a blank VM. Can you use the microphone, please? We have two questions. One side you use soft, the other side you use puppet. Could you get into that a little bit, explain why? Sure. So I think it was about five weeks from the time we rolled the racks off the dock to the time we started our first VM. And I had the feeling that that is kind of faster than usual. We were delayed by about six months, but we didn't let any of our back-end deadline slip. So one of the strategies was to go with stuff that we already used. So the guy's attack had deployed chameleon with puppet. And so they were really familiar with that. They had some stuff already built. I had been doing a number of projects with Salt for the past several years. So I was comfortable with that. Basically, their functionality is very similar. It's more about being familiar with your tools and getting up to speed with new hardware and not new hardware and new config quickly. So the reason I asked is we are using puppet and considering Salt at the next step. And I'm wondering, is there any similarity that you could basically take advantage of if you know one, then you can move to the other? Yeah. So their core functionality between all of them is puppet, Salt, Ansible, Chef, CF engine. Take your pick. Their core functionality, their mission is all the same. So you just need a Rosetta Stone type. And you can probably get up to... If you already are comfortable with it and know the concepts, you can probably get up to speed very quickly, just kind of comparing two configs and do the same thing. Right. How about have both in the same environment? I have heard that you can use Ansible on top of puppet, basically to control different puppets. Sure. Actually, one of the kind of interesting things about our user interface is they use Ansible to do a bunch of Sysmin type stuff on behalf of the user on startup so that they don't have to do a whole bunch of Sysmin stuff that is not necessarily on their critical path to getting their papers out the door. So yeah, in that example, we actually use two different configuration things, one inside to manage OpenStack and one to manage all the VMs that get started on it. So I came in a little bit later, so I might have missed it, but are there two different accountings here in this environment or one accounting? There is one accounting. But there's two separate clouds, two keystones. The user IDs actually match, the passwords match, all of that, but they're not dependent on each other. I see. How do you track that? If I were to... So one user, can you have two accounts or just one user? One account. Yeah, I made it actually explicitly. The atmosphere was designed to span clouds. So at VM instantiation, you pick which cloud or just take the default. So it's capable of communicating to two different OpenStacks. So does it mean that if I attach volume with NOVA on one cloud, then the other cloud would understand it and be able to use it? No. There's no federation here, basically. No. So federation is a thing that we have been talking about. And we replicate the images between the two clouds. So if you have an image in one place or even if you are running along and you decide you want to create a new image for your running instance, that will, you know, FFT show up on the other side. Okay. Got it. Thank you. Sure. Over here? Yeah. Okay, so the question was about hardware monitoring performance. In inventory. In inventory. Yeah. I know at PSC, we're using Nagios for a lot of our, or actually the spin-off now name to do most of our monitoring. We've just used it in the past. And so it was a logical fit to just bring over. We already had the infrastructure, most of the test cases that we wanted to monitor. And then Puppet has stuff built in as well. They have some kind of a web GUI that basically watches for nodes, whether they're checking in or not. So we get feedback from both of those two things. Zabix for some of the services. Some of the open managed pieces. It's a Dell XR vendor. Reliability of OpenStack. So some of the choices that I made were specifically for being able to debug. You know, with the Linux bridge, you can take TCP dump and watch every single packet go all the way through from end to end. So that being said, there are some particular agents that are being restarted hourly on Cron because they seem to just check out. So it's the best one yet. Liberty so far is the best one yet compared to Kilo and Juno. So I have high hopes for some day taking things out of Cron. Yeah, I would also say that every release has gotten consistently better for Ironic. There still are some issues. I know we're pretty much running out of time. I can take offline. There are a few scalability issues, nothing that were show stoppers. I can touch more on that if people are interested, if you want to talk offline. I'm not really sure. Hey, Jeremy. Off the top of my head, we just came out of... 159 end users, 140 exceed staff members. I'd have to check some stats for you. I don't have that in front of me right now. All right. Well, thank you guys. Appreciate it.