 Okay, good afternoon. My name is Mike Moore, and I'm presenting today about the NASA Goddard Private Cloud. I have two titles, Director of Software Engineering for Business Integra. Business Integra is a small business that recently graduated out of the 8A Small Business Program. And my title at NASA, where I'm a contractor, is Cloud Engineer. The reason that I have two titles, I started working at corporate, and I was promoted to Director in January of 2017. And then Business Integra started the NASA contract in February, and pretty much immediately the contract put out a task order asking for a wide breadth of support for Private Cloud. And me being the nerd that I am, always wanting to work for NASA, I was just like me, me, me. So I work for NASA as a contractor now. And so that's why I have two titles. So like I mentioned, I'm a contractor, I have to give this disclaimer. Because I'm a contractor, opinions are not necessarily the official position of Business Integra at NASA or the US government. So the agenda today, I'm going to talk about the genesis of the Goddard Private Cloud, what we built, and then how we handled a couple things like telemetry billing, data protection, disaster recovery, security, how we're kind of handling containers, we want to get better at that, and how we're guiding traditional HPC users through the paradigm shift into cloud computing. And then I'll go through a couple case studies from some of our GPC users. One fun tidbit, if you are familiar with the history of OpenStack, which I'm not as familiar, but it was, NASA was involved in the beginning and then didn't stay as involved because the product wasn't as mature yet, but now NASA's using it again. And our sister project, the Advanced Data Analytics Platform 2.0, is 144 nodes in an OpenStack Nebula shipping container that we had on NASA's campus from back when NASA was working with OpenStack in the beginning. So it's come full circle, and the container is now running OpenStack again. And that project is in that container is currently on the pike release. So the genesis in the fall of 2016, one of the directors for the science directorate decided that we have a lot of spare compute hardware from the NASA Center for Climate Simulation, the NCCS discovers HPC supercomputer. They rotate that out, every year they take a chunk, a scalable compute unit in SCU, take that chunk out and replace it. And then that old hardware, perfectly good, it's just three years old, we're able to claim that now. And so we started with a prototype environment, the first environment that we gave the name Galaxy, which was running Mataka. We had 16 compute nodes from SCU 8, they were upgraded to 256 gigs of RAM, 16 core dual socket Sandy Bridge CPUs at 2.6 gigahertz, with 10 gigabit Ethernet and a GPFS shared file system, that's IBM spectrum scale, if you might know it by that term. And then we created a second environment for ITAR higher security that was also running Mataka and added 16 more SCU 8 nodes for that. And so we're continuing the process of reclaiming use that use hardware, but we're hoping once we finish starting to bring users in and charging them, the prototype was free for about a year, hopefully we can buy some new hardware. So the production environment that we have built, it consists of three racks, central control plane rack with four new servers running our open stack control plane in virtual machines, for use IBM data Plex DX 360s, which are the SCU 8 nodes, and those are also running some open stack services in VMs, a new edge core one gig management switch and two 16 port 100 gig ethernet Melanoc switches that are running cumulus. And for storage, we're using a net app FAS 2650 with 82 terabytes raw SSD and 426 terabytes raw hard drive, a near line SAS through add-on shelves. And one cool thing about the net app is that the data deduplication for us with glance images, we're getting 15 to 1 deduplication, so we're saving a good bit. And then we have two compute racks that are configured identically, each compute rack has four SCU 8 nodes with memory upgrades, a management switch and two 16 port 100 gig ethernet switches, everything is dual connected, we're going for redundancy and lag to improve throughput by sharing paths. We're using Xcat to provision the bare metal and we're also using Xcat to manage the open stack control plane VMs that are KVM VMs and our infrastructure is all running on CentOS 7. This is a diagram of our network infrastructure. I want to thank my colleague Jonathan Mills, who is unfortunately not here. He is in Dallas at supercomputing 2018 presenting very much the same topic. So the SEN is the NASA science and engineering network and everything is connected at at least 10 gigabit. Every 16 port switch we're using breakout cables so that we can connect every node twice at 10 gigabit in the compute racks and then we have 40 gigabit interconnect between buildings. Currently we're only running the production cloud in building 32, but we will be adding additional instances of open stack using cells V2 in other buildings down the road. So this is our control plane, how everything is broken out. We're using a load balancer in front of all the API end points. We're using puppet to manage the configuration to push all these services out to the Xcat managed control plane VMs and in our use case all of our API end points have to use SSL and that became tricky in some cases, but it's a government mandate that everything is encrypted. So the services that we're running, we have Glantz, Cinder and Nova, Keystone Horizon and Cloud Kitty, Neutron, Heat is planned but not in use yet and then we're using MariaDB, RabbitMQ, Nochi and our NetApp with the NetApp Cinder storage driver and then within each compute rack we have a Nova cell so that they operate independently without contention for RabbitMQ. So the first piece that we had to address was how to do telemetry and billing because unlike the prototype the production instance can't be free. We had a lot of users use the prototype, it was awesome, but we have to recoup those expenses at some point and now that we are production ready, we actually passed the production ready phase in the end of October. We were a couple weeks off of our goal but we came really close. So we're using Cloud Kitty and we want to use Cloud Kitty in Horizon but it's not as feature rich as we would like if you've seen it's just very minimal what the user can see in Horizon. So we'll probably start by using the Cloud Kitty API end points to create our invoices for users, but I just had a nice talk with Luca, the Cloud Kitty team lead and there are going to be some improvements coming in the Stein release that will have opened up for Grafana support so that will be much nicer than what we have right now available and we're running the Rocky version on our Queens install with patches for HTTPS support. Remember I said that we have the requirement to have SSL encryption everywhere and the Cloud Kitty team was gracious enough to provide us a patch to add SSL support. For storage, data protection and disaster recovery, we have an enterprise NetApp solution instead of rolling our own using server attached Jbods or GPFS. We could have used staff, we considered it, but we didn't because we run NetApp before at NASA. One of my colleagues actually used to work for NetApp and installing the extras upper decks in all those SCU 8 nodes. The SCU 8 nodes are configurable to be one server per one rack unit, so we can fit 40 in a rack if we double density them, but they if you want to add an extra hard drive in those, there's an upper deck in the chassis so that it becomes one node in two view and we would lose half our compute capacity just trying to have the storage for stuff. So we went with NetApp instead. We're using Trilio for data backups in addition to the NetApp features for snap mirroring. Trilio is really cool. We used in our Mataka prototype cloud. We did a demo install. Everything went really well and it was nice and user friendly and we are working closely with Trilio to get it configured in our Queens environment. Again, SSL requirements, they've been patching support in for us for our requirements and we haven't got it 100 percent running yet primarily because Jonathan and I were preparing for these conferences. So when we get back next week, we plan to get that fixed and all the way running. One great thing about Trilio is that it allows the users to self-service back up their workloads, back up their VMs, their sender volumes and it doesn't require installing agents in the VM which is the other option that we had available to us that NASA had licensing for required installing agents in every VM and we wanted to avoid that. And then we're also planning as I mentioned probably about this time in 2019 expanding to a second building on campus with an additional net app for data replication and using cells V2 to isolate that. Security, all the API endpoints encrypted. The cloud kitty and Trilio teams like I said provided patches for us. We have a requirement that all data at rest is encrypted. Encrypt everything is just the general rule and so the net app is taking care of that for us and then you can imagine there are government requirements for additional security software such as vulnerability scans, antivirus monitoring, the whole enchilada and so what we have to do is provide gold images that are blessed security scanned and patched up to current patch levels and those are the only images that our users are allowed to use and so we offer Windows Server 2016, CentOS 7 and Ubuntu 16.04 and now 18.04. We did get one nice concession for ephemeral workers that are auto-scaled or temporary. They don't have to have the full suite running immediately. There's a grace period before all of the additional security software gets booted up and enabled some number of days after the VM has launched to make sure that it stays in compliance with patches and everything else but users who just boot something from one of the blessed images don't have that extra overhead in the VM and I want a big thanks to the cloud kitty and Trilio teams for support. We had direct access to the developers to get the patches that we needed for the SSL support and we're hoping to use an automated build and certification pipeline in the future. Currently I am updating the gold images every month. I'd like to avoid that manual step. For performance most of our users are science users, they're heavy computation and so the performance of the workload is very important. So we have customized our flavors in the metadata to reduce cross-domain performance loss in NUMA and we're still tweaking the best way to do that but there is a substantial difference between configuration settings on our hardware. And then the net app sender driver provides the copy off load feature which is really cool if you're not familiar with it. Instead of the client node doing the copy over NFS it just tells the net app do the copy locally and it's very fast. We can boot the 50 gig windows images in seconds because of this. Containers. Containers is kind of like cloud 2.0. Most of our users aren't even really quite understanding cloud yet. So containers is just an extra fun step beyond getting ready for cloud. We do have some users using containers but it's very few. We have security concerns with users accessing Docker hub because anyone can put stuff up into Docker hub. One solution that we're looking at will be a GPC managed registry but in the interim we've asked users to only use images from official upstream projects in Docker hub like Mongo, MariaDB and so forth. The current users doing containers are just running Docker within a VM. Their workloads aren't so substantial that they need orchestration yet but down the road that's really plausible that that will happen. So the future plans that we have adding a second site data replication and then adding additional OpenStack components in order of preference, Trove database as a service. We have some users that have asked for that. Manila for NFS self service. Load balancer as a service. Currently we have users that we configure HAProxy for them in a VM. We want to avoid that and let the user do everything for themselves as possible. We're looking at Sendlin for auto scaling so that we can create auto scaling groups. Currently users have to manually boot and terminate VMs for scaling. Swift is something that would be very cool. Object storage is nice but in our use case most of our users just want simple POSIX file systems and they aren't ready to make the leap to object storage. And something that we've had asked but we haven't been able to provide yet our GPUs and it seems like from what I've seen today at the conference a lot of people are starting to have a good solution for that instead of just manually PCI pass through. So looking forward to that and possibly doing GPUs through bare metal as a service because virtualizing GPUs you lose some of that throughput. And then some additional services that we might offer would be hosted as a service. One example might be CI CD maybe Zool. Zool seems pretty capable. There are others message boards, chat features, anything that would be useful that we could host for users as a service is possibly in the pipeline it just depends on priorities and what people ask for. So I mentioned before most of the users are used to HPC job schedulers they aren't really they don't know how to do cloud yet. So part of the problem with that is with scientists there are varying degrees of software development skills. Some are familiar with distributed computing patterns some aren't optimization different languages and technologies that they have experience or the availability to use. For us at NASA, Python is the most common language used when the users are writing their own code. And then there are different tools by discipline engineering users are likely running Windows but the science users are usually running Linux. And getting them into this new paradigm of how to do things in the cloud. A thought that we have is that scientists are undeniably brilliant. Software engineers are also brilliant. But software engineers are not necessarily the best scientists and vice versa. The intersection of brilliance across both disciplines is very rare and so it's important that we bring the right tools and technology to the science users so that they can do their work faster and less expensive and hopefully more accurately. So now I'm going to go through a couple of our GPC user use cases. The first is the exoplanet modeling and analysis center EMAC. And EMAC serves as a repository and integration platform for modeling and analysis resources focused on the study of exoplanet characteristics and environments. EMAC provides community access to hosted models and tools along with web-friendly web interfaces, user-friendly web interfaces and a searchable database of exoplanet resources. EMAC is a key project of the Goddard Space Flight Center sellers exoplanet environments collaboration. This is a large umbrella group studying exoplanets. The users run several Docker containers in one VM. They don't have the need to scale beyond that one VM yet. But that may be coming. We also have another user that is looking at building a Kubernetes cluster inside their own tenant. We'd like to be able to offer container orchestration in a tenant-isolated container orchestration registry, all that good stuff. It looks like that it's possible. And I'm hoping to take something back with me that can help with that. And the EMAC project is also available worldwide. So it's a private cloud with a couple public services. The next case study is the planetary spectrum generator or PSG. And that project generates high-resolution spectra of planetary bodies like planets, moons, comets, and exoplanets. It's a combination of shell scripts and C code and provides an HTTP API for users to consume data. Another project that is available worldwide. And when we first brought them into the GPC prototype, we noted, well, the user noted and told us and complained rightly so that performance in the GPC was a lot slower than their previous environment. So we took a look at their code and extensively debugged and improved the performance working with them. This is that scientists and software engineers happy harmony that gets better results. So we found that the code was opening and closing file handles much too frequently, which we all know is a very high performance hit. So the user fixed the code to reuse file handles instead of opening and closing files too frequently. For their temporary storage, we configured that to be in a RAM disk instead of writing and reading from NFS. We also recommended that the user build multithreading support because GPC in the cloud, you have a lot more cores available, but because we're using that older hardware that was free, the clock speed is not as high, so you have to use multithreading to take advantage of the additional cores. And the user implemented those changes in their code. It now performs faster in GPC than even his MacBook Pro with a solid state drive and higher speed CPU. Our most problematic simulation, remember these are single API calls over HTTP. It reduced from 27 seconds to five seconds as a result of the work that we put in with the user. And that's something that's unique that you don't get in other clouds, especially commercial clouds. You don't get that level of support where someone can actually sit down with you and look at the code and help you improve it. And so we're able to provide full stack development and best practices support. Future work that we're planning to work on for PSG is to help them scale out behind a load balancer with shared storage. Currently their one VM is able to handle the load that they have, but it's going to grow. And that's a case where load balancer as a service would make it easier for the users instead of us having to add HAProxy in their tenant VMs for them. The next case study is Keyshot. And you can see the example rendering from the vendor. I can't show any of the NASA internal stuff for various reasons. It's a CAD rendering software package that the engineering teams use. They previously ran this software on their laptops and desktops. So they were maxed out at about eight cores in order to create their renderings. Some of the renderings take days or weeks. One cool thing about Keyshot is that at scale is linearly. So if you double the number of cores, you have the amount of time it takes to finish that job. So this is running in GPC as a managed service. This is a software as a service, but it's one particular user base. It's very targeted. It's not sign up and go at it. So users point their Keyshot client on their desktop to the rendering farm in GPC and push their jobs out. Currently we're running that on Windows, but we're working with the Keyshot vendor to get their Linux software. They have the Linux version of the rendering software coming out. We wanted to be able to scale up massively for these users when they have a deadline approaching. And for our Windows based rendering farm, we were limited by the number of Windows licenses we have available. They have licensing for Keyshot to run up to 512 virtual CPUs in GPC, which is one fifth of our total vCPU capacity at launch. So they'll be able to scale pretty big once we get them off of Windows and onto Linux. The next case study is the HFSS for Superconductivity Analysis. And this is all above my head. They're running the ANSYS High Frequency Simulator. Simulations are used by the engineers to design science instruments. So what they're looking at is what material, what thickness is appropriate for measuring the particular type of microwave radiation that the sensor needs to look at. Before the GPC was available to them, the users had to run their software on expensive server class desktops with eight cores. So the GPC obviously allows many more cores to be utilized for a job. However, unlike Keyshot, the performance is based on what the job demands. It doesn't scale linearly. So some jobs perform better at low-core count, high-frequency, such as the desktop servers, and some perform better with high-core lower frequency in the GPC. And it's a, the user has to figure out which is going to be the right way to go with it. And one fun thing is that we're actually doing MPI over TCP IP Ethernet. That's not ideal. It just does not perform as well as InfiniBand would. But because the software runs on a Windows head node, it could not go into the Discover supercomputer because it's all running Linux. And this is possibly my favorite case study, the Mdwarf planets. It's a Python simulation to determine the exposure time and other parameters required to observe exoplanets with the James Webb's based telescope or other telescopes like Hubble. And they're using data from the PSG project, which is also on GPC. So in this case study, the user had to run 17,496 simulations. And they estimated that that would take six weeks to complete on the developer's laptop with eight threads. We work with them and were able to complete all the simulations required for their research on the GPC in 21 hours. And their results are pending submission to the American Astronomical Society journals. So what we did, this is another one of those software developers working with scientists things, I built them a Python wrapper that uses a MySQL database to keep track of the jobs and report the results of the jobs in. I then built a system D service around that Python wrapper so that it would start as a service when the VM boots and gave the user a giant comment, put your code right here. And it worked. We spawned 26 VMs for 104 total threads between both the Galaxy and the Nebula prototype metaka cloud environments. We maxed out the capacity of both environments because we have other users. And something that the discover team is briefing at super computing is cloud bursting, moving workloads from within a private cloud or in their case the supercomputer up into a public cloud. This is kind of a tongue-in-teek joke. Poor man's cloud bursting because I was able to spawn workers in a different environment and just point them back to the database of the original environment so that we were able to give them more capacity when we were already maxed out. And so the database was extremely simple and this is probably boring as a developer that already knows this does this pattern all the time. But this is something that the science users haven't seen before, haven't had to do. So it was an improvement for them. And we just tracked job ID, the parameters and then the results and keep track of what job was started, finished, at what time. Okay. So the software that we used for that Ubuntu 1804 with MySQL Python and NumPy, the average simulation took 256 seconds. The longest one took 95 minutes. The minimum was 41 seconds and we had 104 concurrent threads across two OpenStack clouds. It took 21 hours and ran 17,496 jobs and now they're able to start on their paper the next day instead of waiting six weeks, which is what they would have done without this support. Some potential future improvements. We'd like to make this a reusable job engine because this is a common pattern that developers already know, but science users don't. And if we can bundle it up, make it nicer, not be a giant potential code here type situation, it'll be easier for other users in the future. We'd like to add better error handling, market job failed, move on instead of just stopping right there, which is what it was doing. And we'd like to be able to self terminate the worker instances once there are no jobs left so that the users can spawn, finish their work and then not get billed for idle VMs. And then this is something that would be very well containerized if we had the container orchestration because we'd reduce the overhead of having the full open to OS repeated multiple times. But we have to solve the container problem and I'm going to be talking with people and looking at what we can do for containers in the future. Are there any questions? Thank you for your presentation. Couldn't you explain more details why your decision was to switch from Chef to NetApp storage? To switch from Chef? If I got you right, you roll out from Chef, yes? No, we considered stuff, but we did not use it because we don't have any experience with it. And the servers in order to add extra hard drives to each of the compute nodes, we would have to have upper shelf on each server and add an extra U to each server. So adding stuff would be a new technology for us and it would also have reduced our compute capacity by half. So we chose to use the NetApp instead. Thanks. Thank you. Nice talk. I was wondering for the last case study, looks like you rebuild the batch scheduling system more or less. Wouldn't that case study actually or what was the advantage of using kind of a cloud framework instead of going slurm or some other batch scheduling system for simulation? So in that case, I've never used a job scheduling system myself. So it was easier for me to quickly build something for them and give them and just put your code here situation. A job scheduling system is possibly one of the features that we would add. But usually users who need job scheduling will go into the discover super computer instead of the cloud. And the second question unrelated. Do you have oversubscription for the hypervisors in your cloud or is it one VM per physical core or do you oversubscribe? I'm kind of having a little trouble here. Do you oversubscribe on your cloud? Do we oversubscribe? No, we're not oversubscribing the CPUs. So the scheduler is allowing the CPUs based on what the host has. We're not oversubscribing. Yeah, I have a question. Do you use any special tool for like certificate handling regarding the TLS termination of APIs? Any special tool for the certificates? Yeah. So it's a self signed certificate. We're not using any special tools right now. We've looked at Barbican, but our focus was from the prototype environment building the production environment with the same features. And now that we've got that completed, we'll be adding features such as Barbican or Trove load balancers, whatever pops to the top of the queue that we're have time to focus on. So thanks for the presentation. Did you or do you use any kind of monitoring auditing tools since you have this security requirement for the system? Yes. So for monitoring, we're using Nagios on the infrastructure layer. And that is we're using the check MK. If you're familiar with that, plug in to drive Nagios and we're it's really impressive. We're able to monitor all the nodes. Our switches are running cumulus. We're able to monitor all the switches. If we have a cable loose, we get we get a notification. And it's really robust for us and fast. And the great thing about the way check MK does this is it makes an SSH call runs a very small job to get data and then processes it back on the check MK server. So the node is just sending just stats. It's not doing any computation like an agent in some other monitoring tools would be computing. And so it's lower load on the nodes. Going once. Going twice. All right, sold. Thank you.