 Nope. Oh, is your clicker? Where's the clicker? Okay, let's see if that works Nope Yeah, I Just need to get to the next page There we go. I'll just do this Is it working now? And to go back I go up. Okay. So what's the mock? It's a public private Group that got together to say we want to create a public cloud. We have about 400 users some of those users are serving over 10,000 other users, so they're running things on the cluster that support lots of other researchers It's hosted at the mass screen high-performance computing center This is a two-acre data center in Holyoke, Massachusetts. It's platinum lead certified and One of the cool things about it is it has high-speed fiber connecting the entire northeast region our planned upcoming growth over the next six months or so is That the northeastern storage exchange is going to be standing up a 20 petabyte data lake at the MGH PCC And it's going to be integrated with the MOC, so you'll be able to do compute against really large data sets IBM is putting 11 power 9 servers in They're each going to have a terabyte of memory and 4 GPUs There are about 5,000 cores coming over from Harvard over the next six or eight months and As we just discussed in the past couple of days. We're really excited that the Red Hat data hub is going to be standing up on the MOC So you guys probably already knew this most of today's clouds are run by one company they basically don't share their data and That makes it really hard both for us as researchers within the university setting But it also has an impact on folks who want to do optimizations for example all of us working on open source So four or five years ago Peter and or on Krieger Got together with a couple of other researchers and thought There should be a different model possible and so the idea of an open cloud exchange is that You have a mechanism that allows you to take Information or sorry compute from one provider GPUs from another provider storage from yet a different provider and plug them all together so that they plug and play together but it also the ideas behind this also lead to the concept of having a A level playing field so that a big company and a small company or you or me Could create our own services and offer them to anybody and so as we thought about that He said what are the things that would be necessary to create a minimal viable product? We need production cloud services We need to have a method for single sign-on that allows us to use the resources across multiple providers We need to have a billing model We need to have a elastic hardware and we need to have research Federation and The MOC is a production cloud service We offer single sign-on through Keystone We have a billing model We have elastic hardware and we're working on that So why do we actually care about this as essentially part of a group of research universities and private industry? we care because You can't do open-source development and you can't do research if you don't have access to real data access to real users and access to scale and We all know the story of the cathedral and the bazaar If we open things up Then we end up in a situation where we have improved efficiency more competition. It's a better outcome for everybody That's why we care Why should the community care? well There's a lot of reasons but one of the big ones that we keep running into as we break them on is There's no place to run continuous integration and deployment at scale So there's lots of issues that don't get found until they show up at the customer site Or in our case at the MOC And some of those are because of interface changes that get missed between different projects or they get out of sync Some of those are because they only begin to show up at scale Some of those are because the implementers were solving a different set of problems than what we need in running a commercial cloud and Some of it's just because as you get to large quantities of users you start to see as I mentioned before Really different scales and types of problems So We're gonna look at four ways that we've broken the cloud. I'm sure there's more but these seemed like good illustrations We've had major storage failures We've run into problems trying to stand up telemetry for our cloud We've run into problems trying to do charge back and show back and I think we're one of the early folks who's been doing a lot with open shift on open stack and we've run into problems there, too So I want to tell you about the great mock sef upgrade of August 2017. It was a really simple plan Do it in August There aren't a lot of users because we're a research cloud, right? We're trying to essentially triple the size of our staff cluster Nothing could go wrong here, could it Unfortunately Or should I say fortunately nobody was eaten by a velociraptor but seriously It was a disaster in every way possible. Well, I don't know about in every way possible. Nobody died. Nobody got eaten, but Let's get this out of the way up front We lost all the data Like What's the one thing you want your cloud to not do? You don't want it to lose your data and Yes, Peter's pointing out the 170 terabyte was three quarters full So There's some good news. We did a really good job of telling people make sure your bank data is backed up and Pretty much all the users were okay, and that's really good But there's a really embarrassing thing that I need to share with you We believe in eating our own dog food and so our web page Was running on our VMs backed by Seth and not properly backed up So we lost Yeah, we Might as well get it out of the way we were the people primarily affected. So how did this actually happen? So remember this was our plan triple the scale triple the size so This is some really important background our cluster started out as a Fujitsu appliance and It had a lot of placement groups something like ten times as many placement groups as Would be recommended for something that wasn't a Fujitsu appliance and It's also important to note We weren't changing versions We were just adding storage and there's a pretty well-known clear mechanism for taking advantage of this failure recovery You add an OSD it remaps everything It adds the data to the spreads the data out. What we didn't know is that that version of Seth Also stored every bit of history for every change you'd ever made No one had ever tried it with so many placement groups before and There was another thing if you're adding a whole bunch of OSD's You want to actually minimize the amount of time that you're driving out an hour and a half to Holyoke and putting the drives in so we put a bunch of drives in all at the same time and We got a live lock it would come up it would crash it would come up it would crash and What makes it really? particularly painful is that We got ram and we drove out will not we these guys Peter drove out and upgraded the ram But it turned out as we learned later there probably wasn't enough ram in the universe to actually help us in this situation So how did it end? We reached out to red hat in the community for help And the community in red hat said Let us help you which was awesome I'm pretty sure that folks didn't say it to us but said to themselves. Oh my god. Why weren't they using supported bits? Oh my god. What the hell were they thinking but as We were using supported bits at that point Okay As I said, I'm the storyteller here an unsupported config unsupported bits See I'm learning new parts to this story So how did we fix it? I don't actually know how many how long this went on was it weeks Ah Yes, our Rado who's our infrastructure engineer This was planned so that it wouldn't impact his vacation So it was like a month before he was leaving right three weeks a month We've learned a really important lesson Don't schedule anything even remotely near when Rado is going on vacation because every time he goes on vacation bad things happen So the first try involved a script Because there was no way to recover from this other than manually so we there was a script that we tried to create with a lot of help and Unfortunately, there was a bug in the script and the way that it was described to me was the stuff cluster became Swiss cheese four megabyte holes got stuck throughout the master boot record and all over the place So we rebuilt the stuff cluster from scratch now This is a really interesting thing and and there's something really magical about the open-source community because When you can actually reach out to the guy who? Essentially wrote saff and say hey Are we doing this right this time? It's really kind of amazing so Thank You sage So then there was the smaller stuff outage this one will be quicker Data centers are never supposed to lose power right well Peter's going to tell us how this one actually happened because he has a better More details on it than I do Okay, so I think yeah so basically you need some type of fire suppression in a you know in a data center and Really there isn't anything better or especially if you're trying to make code Then sprinklers at the same time if you have 400 volt AC for your distribution to all the racks You really don't want to sprinkle 400 volts so and you also don't want water in the sprinkler pipes normally because Over time some sprinkler heads will fail and you'll sprinkle expensive equipment and 400 volts So they have this fancy setup with a high-tech Smoke alarm that basically sucks air from above the racks into a central thing and the sprinkler pipes are empty and Then the first phase then when the smoke alarm fires It fills the sprinkler pipes and then the sprinkler heads work like normal heat activated However, once the water enters the sprinkler pipes again, we don't want to Sprinkle 400 volts, especially if I'm in one of the racks It triggers an emergency power off No generator no nothing emergency power off and There's also a feature to test valves that trigger the emergency power off that was used when they were building the building There's some little valves that you know, I think they look like the ones for your garden hose That will let water in past each of the valves so that you can test that the power off trips Which they did before there were any computers there Now one of those have been cross-threaded and it started leaking a little bit So some maintenance guy went saw a leaking faucet. You know, this is on the computer floor It's like a hundred decibels or something. You know two acres of racks of computers He cranks the valve harder trying to get it to stop dripping Evidently that caused it to do the opposite the water went through the valve It triggered the emergency power off. So he must have been standing there and the whole floor went quiet So I Can't repeat whatever it is that he said I'm sure but It's not a situation that I would want to be in But it wasn't his fault Yeah, anyway, so that that's one of the more Fascinating outages. I've heard a root cause of it took a while to figure out why that happens Yeah, so basically there was an open conference call from the moment that happened until the moment it was resolved and When the power came back on This is the summary of the bug power out of your data center VMs came up without stuff access Turns out there was a configuration problem that we didn't realize we've made The newest VMs didn't have permissions to break the old locks It was a known issue By the time we figured out that it was a known issue. We'd basically manually unlocked all the images So the cluster in this case was not available Roughly overnight These are some of the folks who helped us Thank You Michael Kidd. Thank You Emma and Janatha and thank you Larsh who works for Hugh and spends a huge amount of time helping us get them off operationally more more more more rigorous So That was our major storage area our storage failures telemetry data so We care a lot about telemetry data and we care about it because we want to do research a lot of the interesting research right now Is not coming out of academia as much as it is out of private companies because they've got the data We want that information to be available to researchers and we want it to be available to the community So you guys can do optimizations on things and so we figured out we should do something about this And so in late 2015 early 2016, we stirred up salameter and was very slow and it led to network slowdowns and It led to situations where jobs wouldn't finish because the network was so slow so We investigated manasca a manasca looked pretty good We were using it pretty pretty reasonably But the thing is we also Remember we talked about wanting to do charge back and show back Show back is when you go to the partner universities and you say see how much compute resources and storage resources You're using we really would like you to help fund it support it all those sorts of discussions Charge back and show back They seem pretty straightforward like AWS when I was there sent me a bill Every month couldn't always understand it, but they sent it to me we didn't realize that we were on the bleeding edge and It wasn't even like what we wanted to do was that hard from our perspective So this is sort of the report we'd like to generate I'm gonna pull out the parts that Matter for the MVP. We wanted a report per project With some pretty standard things the project name sponsoring organization Per VM compute usage per VM use per VM memory usage. I think I left off storage like block storage stuff Pretty simple We wanted a second report projects by institution so that we could go to Harvard we could go to MIT We could go to BU and we could say This is what's going on. This is actually really makes a lot of sense. You should you should put more money into this and For that again the MVP was pretty simple project name project lead name compute usage memory usage I think I also left off storage usage and we we wanted Literally by institution one line with each of those things Seemed pretty simple Originally, we were gonna pull the data together ourselves and generate the reports remember at this point. It was probably late 2017 And then we heard about cloud forms and it looked really cool There was just one thing it needed salameter so We went back and we looked at salameter again and It wasn't easy, but we became convinced that It was actually really performant and in fact, I think it's fair to say it it has been We started running it on a test system and it seemed to work pretty well, but then we tried to generate reports and It was hard because our data was living in different places Some of it was coming from rabbit MQ around CPU usage Some of it was coming right out of salameter. Some of it was coming from another another database file and They weren't all accessible via cloud forms and We were talking to some of the guys who were really deeply involved with cloud forms and they would say oh That's a really cool use case. We should figure out how to do that But it's not gonna be in the next version and so we went back and we said okay, we're gonna generate our own reports for now and This was another thing that went on that was frustrating So Cloud forms did something kind of cool it updated itself but the problem was there was something messed up with our ruby gems configuration and It took us several days to get it working again, but this is really Critical both for research and for doing an open cloud exchange And I want to take a second to dwell on it. That's kind of why I made it in bold If you want to offer services to everybody and You want to be able to cover your costs? There's got to be a basic billing model. There's got to be a mechanism that allows us to charge for usage and It just is really hard to do right now all the folks who are doing this now are writing custom things And so like please let me make a plug if there's anybody in the room who's working in this space Please raise your hand and come and talk to us because we would really like to collaborate on it And it's worth noting we found a couple of Groups within Red Hat and other places that are working on this and we're gonna be working reaching out to them to figure out If there's a way that we can work more closely with them, but What we have now works for an MVP, but what we have now isn't gonna work When each of us want to offer a service and get some remuneration for it, so we've got to we've got to fix that Here's some of the folks who helped us Again, thank you Larsh. Thank you Adam Young Thank you Jason Rittenauer and thank you Sumayn Chen these guys were really really patient and Really really helpful at pointing us in the right directions so We've talked about major storage failures I Want to come back to that for a second though One of the things that's really been fascinating is that the research that we've been doing on the MOC has To a great extent begun moving off of things like Manasca and Solometer and more towards tracing and more to things like open tracing and What's really interesting about that and talking to to mania who spoke about this yesterday And talking to Raja who is I think you're is it accurate? He's your PhD advisor, right? No, Peter's your PhD advisor, so Raja's doing a lot of work on tracing is That the guys who are working on on on the open tracing stuff are already thinking about the fact that hey With open tracing we're gonna be able to do really really Detail level of of charge back and show back and billing and so this is an area that We think there's a lot of interesting overlap both between research and between the larger community and we'd really like to figure out ways to So to work on that more So Open shift on open stack. I don't know that we are The only people doing this in fact, I know we're not but it feels like it a lot of the time And this is just a funny fun shift on stack on split stack because we're gonna be running split stack split stack which is a triple open but What are the some of the things that we've run into that have broken open shift for us One of them was we went to update from three seven to three nine In an open shift on stack environment And at least for us it didn't work and as far as I can tell there's problems with it It's not just us in other words Here's some others and I should call out Rob whose our subject matter expert on this isn't here So if you guys want deeper details, I'm gonna get contact information and we'll we'll have a discussion about it We ran into some fun stuff with Ansible We started out by just using IP addresses In order to get the information over to Kubernetes correctly, but some things don't like that some of the different parts didn't like that So it turned out we really needed to have DNS just for open shift I Think you guys probably already know this but DNS name resolution between open shift and open stack There's stuff going on there. It's really hard Certificates have been another Really huge area of pain for us Some of that's probably because maybe we were using unusual certificates, I don't know but One of the things we ran into was So it turns out that if the first system with the etcd gets corrupted or goes down Then all of your certs fail Probably not where we want to be probably not what we'll be in three months, but these are the things that we've run into Trying to trying to run a production cloud so again Really really critical. I think all the people who've been amazingly helpful about this starting with Dan who Must be one of the most I don't know if Dan is in the room But if you haven't met him he must be one of the most patient and smart guys I've ever come across Raffiella, Savetta, Sam Padgett, Aaron Woodacamp, and Daniel Jeffery they all Some of them in just in the past week or so have become involved to help us solve some of these cert problems And we're really grateful for that So some things we've learned This is a really important one that I wish we didn't do this about as often as we do Anytime we think a project or real feature will work We ask are any current customers using it this way and it turns out that often they are but often they aren't Usually we just ignore the fact that we ask the question which is on us five minutes. Okay If Rado's out of town bad things will happen We learned that too This is a really important one for us because we are very grateful for the close relationship We have with red hat but when we open a ticket if it's critical we ask somebody at red hat to help us sort of Nudged along Larsh Is I don't know if you guys know Larsh, but Larsh is this wonderful developer who's helped us a huge amount And anytime I start to write an email. I literally hear him in the back of my mind going why is this not a Trello cart? and so That's a really powerful thing and it's really important I'm pretty sure I forgot to say thank you to some of the many people who helped us both within the community and within red hat And so please let this thank you go to the people I may have not known about or forgotten And he wasn't supposed to be here. So this next one Is also really important to point out Please if you see you You sitting right there say, thanks you But don't tell them why It'll be pretty amusing if he doesn't know Any questions Anybody who wants to work on some of these cool problems that we're talking about Nobody's raising their hands. It's okay. Well hunt you guys down and find you Okay