 So as the last folks are coming in I'll go ahead and get started and introduce myself I'll keep it brief, but I think my background is pretty relevant here, and it even says it in the title So my name is Cornelia Davis. I work for Pivotal I came to Pivotal from EMC. So as you know Pivotal is a spinoff from EMC and VMware I worked in the corporate CTO office at EMC for about seven years Essentially doing emerging tech and I spent a lot of time doing restful web services and service oriented architectures and all of that stuff worked with a lot of product groups across EMC and About six months before the Pivotal spinoff my boss Tom McGuire said to me Hey, you know, I want you to start looking at this new Paz thing and I said, oh great Always want to learn something new and let's take a look at Cloud Foundry. It's VMware after all so that was you know part of EMC if you will and so I started learning about Cloud Foundry and I actually first did some research and I started reading about this platform as a service thing and Everything I read said platform as a services for the developer It was all about developer agility making the developers lives easier and all of that stuff and I thought wow That's really cool. This is all about me. It's all about my life and so I Started working with Cloud Foundry and I started working with another group within EMC actually with Gary Frankl's group the content management group and We started playing with Cloud Foundry together with with documentum with content management learned about Cloud Foundry We have the Pivotal spinoff and I was working in another project for a while and then I joined the Cloud Foundry team So it's been about two and a half year more than two and a half years that I've been playing with Cloud Foundry About two years ago I was joining the Cloud Foundry group and I joined in a role. I'm the director of platform engineering there I joined in this platform engineering role and platform engineers are We're in the product team. So we're engineers. We all cut code, but we're field-facing in More of a post sales capacity, but maybe sometimes pre sales. It really doesn't matter It's not whether it's pre sales or post sales, but we go deep with customers and with partners so I started going out and I started talking to customers and Learning about what their their challenges were and I was talking to them about developer experience and platform as a service and Developer agility and all of that stuff and in the back of my mind I was always thinking really they're projecting this huge market size for this product for for this area Platform as a service really are people going to drop that kind of cash just making my life easier. I Just didn't kind of compute for me Well, I spent about a month out there talking with all of you talking with the customers and I realized that operations was really that the hard part and That operations that the platform in fact had a tremendous amount of value from an operational perspective So I after 25 years being in development. I'm like, oh my gosh. I'm working on an operations product So that said I'm still 25 years in development and earlier this year in from like the mid January to the mid February time frame I spent a month doing ops. I Signed up for it. I reached out to Tony Hansman our cloud ops the guy who runs our cloud ops group And I said report reporting for duty, sir I'd really like to spend some time doing operations and it's as a result of that that I present that I proposed this topic here at At CF summit, so I'm going to spend the next now 25 minutes or so Telling you about that month I'm going to tell you about how our operations team works the tools that we use the process of the processes that we have in place and Really in fact systems deployed the practices that we use and so on Monitoring of course is hugely key So I'm going to talk a little bit about what we use for our monitoring Keeping up to speed on that and throughout all of that Which is going to be a little bit more. Hey, let me tell you about how we do things I'm going to share with you a handful of stories because I think in the stories is where we really have these aha moments these real Like oh my gosh real big insights And of course, we're going to talk about the platform as it is So first of all, what is it that we're operating? So what I'm going to tell you about here today and what I'd spent one month being on the operations team for is our pivotal web Services, so pivotal web services is what you find at run dot pivotal.io How many people have here have a pivotal web services account? Okay, a handful of you so go to run dot pivotal dot.io you can get a free account You can start pushing applications and you can start doing all that developer stuff What I worked on was the team that keeps pws pdubs as we refer to it We keep it up and running 24 7 0 downtime and that's the team that I'm working on So first let me tell you about what that deployment looks like So let me give you the deployment topology and I'm going to start from the left And I'm going to start from the perspective of how did we stand this thing up in what? Components do we use to stand up other components within the system? So it all starts with a jump box a jump box in this case I'll tell you is that we run run dot pivotal dot.io is running on Amazon web services So as you know cloud foundry runs over a number of different I as layers they'll we do Amazon web services we do vSphere v cloud air we do open stack We have experimental support for other infrastructures as a service and of course Microsoft just announced that they're going to provide support for Azure So it's really great in this particular case. We're on AWS We start with the jump box the jump box is a virtual machine that we've provisioned through the AWS console And that jump box is going to give us access to the other boxes in the entire system We can use that to lock down access to some of those other boxes And we of course allow just SSH access into the jump box Using keys we manage all the keys So every individual in the operations organization has their own keys Those keys have been registered with the jump box so I can SSH into the jump box and then run the rest of cloud foundry So I'm going to use the Bosch CLI then to access Micro Bosch so Micro Bosch is all of Bosch in a single virtual machine and that virtual machine has the ability for me to deploy Other clusters how many people here are familiar with Bosch? Okay, good good about half of you So Bosch is the subsystem of cloud foundry that you can use to manage the elastic runtime Which is where you deploy your apps to manage all of your other clusters like your rabbit MQ cluster You're my sequel cluster your homegrown time series database cluster all of those things So it's the thing that manages virtual machines So that's what micro Bosch is that system And all in one single virtual machine now you'll notice that the micro Bosch is connected to rds That's where we have that's what we use for our database for persistence and it's connected to s3 as well So we're externalizing those two databases rds and s3 and we allow rds and s3 The resilience that's baked into those in the SLAs that are baked into those we leverage that here in this deployment That's something you need to think about when you're deploying and running your cloud foundry instance Whether it's on-prem or in some cloud offering. So you need to think about having resilient storage Now micro Bosch then in the case of AWS We use micro Bosch to deploy what we call full Bosch So that Bosch system that i'm talking about is a very sophisticated system that has many different components It has a director. It has a health monitor. It has a message bus It has all of those different components in micro Bosch. They're all running as processes on a single virtual machine But you can deploy Bosch running across a cluster of virtual machines You can do that for scale for resilience all of those things Well, it turns out that Bosch can be used to deploy Bosch, which is really quite cool So does anybody know what Bosch stands for? It actually stands for Bosch outer shell. So Bosch stands for Bosch outer shell That's what we do. We're engineers. We like to do geeky things like self referential things That's what we do. So Bosch outer shell So we use it to deploy full Bosch Now once full Bosch is deployed then I can use the Bosch CLI to connect to full Bosch Now notice that full Bosch is also connected to an RDS and an s3 And finally I use full Bosch to deploy run.pivotal.io So right there pivotal web services you can see that that's deployed using these Bosch systems on the left hand side Whether you're doing it on-prem or you're doing it in the cloud You will use Bosch to deploy cloud foundry And if you don't you're insane because Bosch is so freaking cool that it keeps all these things up and running for you So if you haven't seen it look back at a year ago I did a five-minute Lightning talk on the four levels of HA two of them come from Bosch and they're really really cool So have a look at that So again, you'll notice there that that deployment also connects to RDS and s3 And then the other things that we use From amazon that you would use even if you were doing this on-prem is you need to do some dns configuration We use route 53 for that And you're going to have to um stand up some type of a SSL termination point. So something that's going to handle SSL and feed up certs What we do is we use elastic load balancers and those elastic load balancers we stand up one Per domain one per cert. So you can see that we have one for cf apps.io One dot pivotal.io, etc So we have about a dozen or so different elbs that are handling different domains for the system Okay, I see a lot of people taking pictures by all means keep doing that But I promise that I will I will put these things up on slide share this afternoon You can find me at c davis afc. That's I use that handle everywhere twitter slide share everywhere Okay, so then What are some of the principles? The first thing that I will tell you is that we do deployments during regular working hours How many people here are in ops that have to do deployment from midnight until four in the morning? So when you have bosh You don't have to do that anymore We intentionally do it during regular business hours because Cloud Foundry bosh has so many safeguards in it that if something goes wrong you can roll back to a safe state And we put a lot of processes in place as well and I'll talk about a few more of those The other advantage of doing them during regular business hours is that the developers of the system itself are on hand So that if something goes wrong, we can actually go over to the people who are building the runtime code and say Hey, come take a look at this log with me And we can get that fixed so you don't have to be on your own at two in the morning doing this deployment all by yourself That's what dev ops is all about. Let's work together on this Now the other thing that i'll tell you is that we categorize our deployments into a number of different types So we have for example a new release So if you were going to go from v204 to v205 What that means is that some of the cloud foundry components the health manager the the cloud controller The logger gator some of those are going to get revved some or all of those components are going to get revved That is one type of deploy is when we know that we're revving components Another type of deploy might happen after something like heart bleed comes along We're not going to rev any of the components We just want to switch out the operating system underneath And bosh allows you to do that. So we have a deployment type that does that And then we have something else that's called the manifest only deploy Which means that i'm not changing the os. I'm not changing any of the components. I'm just changing my topology I need a bigger cluster or I need a smaller cluster. So those are manifest only deploys Now I make this distinction to tell you about to point out Two different things generally new releases and stem cell upgrades take a little bit longer So we start those in the morning. We don't start them at three o'clock in the afternoon Because those take longer and we want to get through this during like regular working hours Manifest only deploys where I'm going to add some deas Let's say or I'm going to add some other component or reduce some component those generally take on the order of minutes So let me tell you about the first story. So this was great It happened the very first day that I started on the cloud ops team So I showed up for stand-up and right after stand-up we had an incident on run dot pivotal dot i o So the sundance film festival is one of our customers on run run dot pivotal dot i o And this was two days before the film festival was going to open And they were opening up a block of tickets and they were expecting as a surge in traffic So a couple of days before they had started planning for this They had scaled out the number of instances on their application In anticipation of this spike in traffic the spike in traffic came as we expected and you know what the app worked flawlessly No problems whatsoever However We had trouble behind the scenes. We were dropping log messages So we have a logger gator component that aggregates logs. We were dropping log messages because With the added capacity with the added traffic. We got added log messages So all of a sudden we were like, oh gosh, we didn't think about scaling out the logger gators So what we did was we got on we did pair ops My pair the pair that I was part of another pair And somebody from the logger gator team got on a hangout. We started looking at things and we did a manifest only deploy We scaled our logger gators and let's see if I got the right thing So you can see here that we have logger gators across two different availability zones That's what z1 and z2 are and we scaled it from 10 instances per zone to 20 instances per zone All of that took less than an hour in less than an hour We were able to respond to an outage like that not an outage But an incident like that and recover from dropping logs on the ground and by the way customer never even knew And so i'll talk a little bit more about that as we go along as well Okay, so that's where we are The other thing that i'll point out to you is that when we do deployments those those new release deployments can take Hours and if you've done a deployment how many people have done a deployment and watched the compilation Take a while Okay, so the compiling the packages takes a while We've arranged our pipelines and i'll show you a picture in just a moment. We've arranged our pipelines So that when we're doing a deploy into production We are not doing compiles anymore the packages have been pre compiled By people that are early on in the process and in fact this is the slide here On the left hand side it talks a little bit about how the dependency between the cf runtime and the services So the runtime team and the services team But really what's key here is that we have a number of different cloud foundry instances that drive our pipeline So there's some stuff here about how one is for the development The runtime team one is for this services team But the key is right here that we have a system that is our non-prod system It's a staging system and that's where we do deploys before we go to prod obviously But here's the kicker we have a shared package cache So that when we do the deploys on a one The package compilation happens and gets stored in the shared package cache So that when we do the deployment into prod We draw from that shared package cache and we could to save ourselves all of that time in package compilation That's a very pragmatic A technique that you should be using in prod to speed up or using in your environments to speed up prod deployments Makes a huge huge difference Okay Oh boy, I'm so far behind already Okay, so very quickly the other thing that I'll mention is that we have checklists for each of those different types of deploys So when we do a deploy we go into github everything is in github checklists are in github Infrastructure is code in github everything's in github. So we have checklists and in those checklists We have a number of pre deployment steps that we do that include things like generating final releases Double checking that i've got the latest out of github and so on Then I do some deployment steps where I start out logging into the jump box. I pull down from git again So i'm using git across the thing across the board log into bosh Upload releases and all of that stuff And then I have my post deployment steps Which are things like publishing the final releases that for some of you who are open source customers are probably Leveraging those final releases that you find with the v204 v205 the yaml files We generate we publish those final releases and we update any update the checklist with anything that we've learned That maybe we didn't have documented before All right, so let's talk about monitoring then so now i've got pivotal web services. It's up and running So what are what am I using for monitoring? How do I know that this thing is still working? And how do I know when something's gone wrong? Well, there's a couple of things first of all all of the components in pivotal web services in the elastic runtime Are configured with syslog endpoints? Here's the first trick is that those system syslog endpoints You don't need to have all of them pointing to the same syslog end point or same same account So we in fact have one account for log stash where almost everything is going to probably everything And then we have another one that the lamb team the lamb team is the logger gator team the loglogger gator log Monitoring metrics all of that stuff. So we have syslog messages going out to these log logger gator I'm sorry to the to log stash On the other side We have the collector which is an internal component of cloud foundry and that is sending metrics Over to datadog. We use datadog You can use a number of different things if you've got jmx dashboards. You can use that as well with ops metrics So that's what we've set up So then how do we use that? Well, okay, so oh here and this is what a dashboard looks like So we've got the dashboard here and you can see all sorts of things like dea status Diego status router status and so on And the thing that I am here to tell you is if you want this dashboard You can get it. It's all open source. So a couple of months ago We open sourced all of the datadog configurations for cloud foundry So whether you use datadog or not you can go take a look at the configuration of this dashboard And you can take a look at the metrics that we're using to keep everything up and running in operations Okay, so story number two This one's kind of interesting in that this was on a day where I was on part of a pair We were where we were updating that dashboard What was cool about this was that we get to go in it has a whizzy wig editor and we get to in staging Oh and just like everything else. This is another principle is we never do anything directly in prod We do everything in staging first So but we're going to compare what's in staging to what's in prod So we checked out from something that was in prod. We deployed it to staging and it broke And the reason it broke was because there was a bug. Let me tell you where the bug was In datadog datadog is not really designed for continuous integration It's not designed with this principle in mind of hey, I want things to move through a life cycle I'm going to build my dashboard for staging and then I'm going to deploy that same dashboard into production It's not designed for that We built that on top so you can see that in that open source repository We take the dashboard that we create in staging and then watch this We do a little transformation and we deploy it into prod We had a bug in that transformation So we we had to build that ourselves We had to layer continuous integration on top of datadog and we had a bug there and we fixed it I just want to contrast that to Our platform the cloud foundry platform and how we have a design for continuous integration When you're working with cloud foundry, we expect you to set up a number of different Different spaces maybe even cloud foundry deployments and you want to be able to move the same artifact All the way through those different stages and we take care of the abstractions The abstractions are here in the event configuration env configuration and the services abstractions So we take care of that. So that was a really good little lesson on um on a On continuous integration around the ops process All right, so we have those things in place. How do we use them? Well datadog allows you to define alerts and those alerts are tied to pager duty pager duty, of course is connected to a person So a person will get paged and that person then will start to do things like look at the datadog Dashboards and start looking at the log the logs that are that have been aggregated So they start doing their troubleshooting. They might send out a And there's a picture of log log stash and I don't have time to go over the details But this is the tool that they're using to do that troubleshooting And then they might send something out to a status page. We might if you get a text message It says something's wrong with pdubs. They might put something out on the status page and then finally One thing that's really important is that we have a set of smoke tests that are constantly running in prod We are constantly every 10 minutes. We deploy an app. We tie it to services. We scale that app We access the app and so on we have a set of tests that we're constantly running that we make sure are running and if those tests fail Then that goes into datadog and into pager duty and somebody responds to it And then finally we also use pingdom to make sure that the console app is up and various apps are up on pivotal web services So this gives you kind of a landscape. Oh, and by the way, there's one other thing We also have bots That will put things from the the uh alerting mechanism into slack So we have slack bots as well So this kind of gives you a topology of the entire monitoring system that we use to keep things up and running Now what I want to do here for a moment is I want to pause and point out That platform what we've been talking about so far is platform operations and you might ask well, what about application operations? So pivotal network runs on run dot pivotal dot i o The console the app manager runs on run dot pivotal dot i o All of those things are running on run dot pivotal dot i o Do we as an ops team handle that? Actually, we don't We handle platform operations So we're the bottom half that keeps pivotal web services up and running The other teams the console app team keeps the application up and running the pivot net team Keeps the application up and running So we've really broken that out and that's what the platform enables for you And when you see the slides, I won't go through it in detail here But you can see the different roles and the different responsibilities that the different types of Developers and the two different types types of operators have as well All right, so i'm coming down to the end here and I have one more story to share with you and this one is my favorite So we were part I I had the opportunity toward the end of my month. We were doing a um We were doing a full deployment. We were doing a new release deployment and I said uh at morning stand up I said I want to be on that team So I sat down with my colleague kai and I said to him god. I hope something goes wrong And he said What what are you saying you can't possibly mean that and I said no really because I want to learn And jim's here jim's another one of my colleagues from the cloud ops team I'm like I want to learn and you learn a lot more when something goes wrong And so we started our deployment and I can tell you that we got a couple of hours into it and I was like This is so boring We're like looking at things and we're cleaning up a little things here and we're doing administrative stuff But we're we're mainly watching it but nothing's happening And that's pretty typical for cloud foundry is you do a deployment and it's pretty dull And then we started updating the deas. We're updating the runners And we got a whole long way into updating the runners And all of a sudden one of the runners failed And that Should never happen. Thank you. That should never happen If one of them works, that's the whole point of canary style upgrades if one of them works If 10 of them work the 11th one should work as well So what on earth went wrong? So we started looking at things we ssh'd into one of the runners that had worked fine So runner number 94 let's say and we ssh'd into runner number 96 And we compared we started looking at what's different because oh by the way We looked in the logs and the logs were telling us that we had a port conflict So we logged into those boxes and we started looking at what ports were bound to And this is what we found On the healthy dea we found that the bosh agent was bound to port one five five six zero And the directory server was bound to port three four five six seven very creative three four five six seven On the unhealthy dea the bosh agent Was bound to three four five six seven And then the port binding failed at the directory server So obviously you can't have two different things bound to the same ports So we were like what what is going on here? So To tell you the bosh agent is the first thing that starts It's always the first thing that starts and it starts listening right away So the the the bosh agent had already grabbed port three four five six seven And so we're looking at this. We're saying why is it bound to one five five six zero the first time and three four five six seven the next time And Once we found the port conflict we went to our best source and we went to talk to dmitri Dmitri is bosh And we we got halfway into our sentence And he said ah, I know what the problem is if I could do a russian accent I would do it but I can't so he said I know what the problem is and dmitri always knows what the problem is And he pointed us To something called ephemeral ports There's a wikipedia article And ephemeral ports are basically it's a range of ports that when you're dynamically assigning a port When you're dynamically assigning a port you can get it from that range safely punchline You should never statically assign a port in this range well That statically assigned port is in that range So but look how big the range is it's like 30 000 big So we had never hit this bug before because we um because the range is so huge So That's the the lesson on ephemeral ports and so I got my wish something went wrong and I learned something in the process So just wrapping up in the last 30 seconds here Bosh is awesome. The experience was awesome There's all sorts of things here that are positive immutable infrastructure all of these great themes And the final thing that I'll leave you with is that I have been blogging about the experience So all three of those stories that I shared with you this morning I've blogged about and so you can find these all on our blog dot pivotal.io And you can read about them in a little slower pace in a little bit more detail So I thank you all for your attention And I'll be around for all two days of the conference. So please seek me out if you have any questions