 All right, well, I guess we'll get started big crowd Just so everyone knows this isn't a talk about artisanal internets. This is Not actually a talk about how it's great to work for Craig either We're gonna talk about running Cloud Foundry if everyone's here for Cloud Foundry. That's what we're gonna talk about If you want to talk about our tisinal internets go see dr. Nick So my name is Tim and this is Neville and we work for Comcast and and What we wanted to talk to you today about is what our experience has been running with Cloud Foundry for the last couple of years So if you were here with us last year, we had another talk and we were talking about kind of the same thing what it was like running Cloud Foundry back then and We didn't really have too much going on. We had a single foundation. We had a few orgs a few apps not too much traffic and over the past year It's definitely grown. So we've we've seen a rapid growth in adoption people love the platform We have six foundations now across the country. We have a whole ton of orgs and We're running about 900 apps and some of those are running Critical applications for us. They're they're right within the critical path of our customer interaction So it's been a really big year and Last year when we were talking to folks the biggest question that we got from them was what it was like to run Cloud Foundry and To use an analogy sort of similar to what Greg was talking about in the keynote. It is like flying an airplane So back then when we went on our journey, we were you know taking that leap of faith and jumping off the airplane And now we're flying the airplane. So we have this this highly modern fast airplane which is Cloud Foundry that gets our Developers to where they're going faster than ever before and More conveniently than ever before and we get to be the pilots. So it's a good opportunity for us, right? Yeah, absolutely I mean we feel fantastic, you know driving this is the best seats in the house We've been running it for a couple of years and every time I walk in there to like do my job I feel like Iceman not not to be I don't mean to offend the maverick and goose fans either But frankly speaking, this is what our customers expect from us But in all reality at the end of the day That's exactly how we feel. We feel the cabin is hot We're sweating like that striker and we are ready to jump out of that airplane on days like these You can actually find Tim walking around without his shirt Why Tim? Well, not exactly without my shirt, but but basically because we're we're faced with this we're faced with a cockpit that that can be a little overwhelming at first and You know you have metrics upon metrics metrics here metrics there you have a lot of different management interfaces that Control Cloud Foundry we use pivotal Cloud Foundry. So we're constantly going between Bosch and ops manager trying to figure out what What's the best way of doing what? We have backups that we have to worry about build pack maintenance load balancing And we're a full-service shop. So we we manage the infrastructure underneath We're VMware engineers as well and we manage the underlying physical hardware So it can be a lot sometimes and Also can't forget about who's flying the plane who's riding the plane with us Anybody want to take a guess on who these guys might be who are these guys? Well, they are actually our developers. I have a feeling that we have a lot of developers here, Tim So we don't have anything against developers. They keep us on our toes They're big proponents of the environment, but every time there is a slight turbulence in the air They start looking for life for us. You know, they are ready to They start kicking and screaming and they go into crash positions So Tim, what have we if I if I may maybe What kind of oxygen masks have we given them to make them comfortable in the environment? Right, so it's important for for not only our developers not to go in crash position But we're sort of in crash position too So we want to talk to you some about some of the tools and automation that we've we've Leveraged to make this more of an easy experience for our developers Make it easier for us to support as well. Absolutely. So The first thing we want to talk to you about is monitoring So I think there's been some other talks about monitoring and we have our own implementation as well But it was the first problem that we had to solve. We needed visibility into cloud foundry We can't be keep on running Bosch VMs all the time to see how things are operating You can't keep on going into ops manager So our our monitoring solution leverages Nagios At the center. So we have Nagios that has a jmx plug-in That is pulling metrics out of ops metrics. So ops metrics Publishes all sorts of metrics that are low-level at the operating system level for CPU memory disk all those kinds of things and also More higher-level things that have to do with cloud foundry like how many requests you're getting per second into your router layer How many stages you have available on your DEA layer So we take all those metrics we pull them into Nagios we set some thresholds when those thresholds are met We can send alerts out to our operations team, but we can also Take those metrics pull performance data off of them and forward those off to our Dashboards which use influx database. So influx is a tool that is built as a data store for time series metrics and it's really good at pulling in thousands and thousands of metrics and being able to query them very quickly and We use Grafana to pull those metrics out and create dashboards so Looking at this system what we end up with is some pretty cool-looking Grafana Dashboards and this is the one that we look at every day Even they want to see it So this is what we look at every day and it gives us a really good quick visual indicator on what's going wrong with the system So I'm just going to go through a couple of these real quick So if we look at the top right We have the requests coming in per second into our router layer And this this metric is available through ops metrics and we can publish it through Grafana So here you can get an indication of how busy your system is if you see rapid drops in this in this metric You can get a quick visual indicator that maybe something's wrong with your load balance for layer If you have a rapid spike you you're gonna have you may have issues with response time You may have you may tell that something is very busy or something is kind of running away So easy easy quick indicator On the top left we have what we consider as our user experience graph So this is response time for all of the endpoints that our customers see So you have the console response time you have the API response time you have Application response time so if you start seeing things like this You call Neville Sometime no We usually thinks it's me who did it, but yeah, yeah exactly so So this is this is telling us that our dev console is actually spiking in response time Over over the last Over over the last few days, so we have to definitely take take action to figure out what's going on with this and Try to bring the dev console back for our end users So the the next if you go further down you can see more metrics having to do with HAProxy You have some some system level metrics for our router layer and If further down we have DEA's Metrics having to do with our DEA's which is which is important looking at DEA's sometimes we see spikes and our DEA's like a single DEA You'll see spiking and then you'll get an apt application developer coming back to us saying I'm having terrible performance and you can say well It looks like your application is running a single instance because it's only spiking on a single DEA Or Diego cell or whatever you're right those are great conversations to have over a period of time What we have seen is it's easier to discuss that with your application owners because they get more familiar with the environment and understand How to use the environment right right? So this is not only providing visibility for us But you know we can send these off to our application developers and and have them see and then that gives them a little bit more assurance that If their application is not performing adequately and they can look at our graphs and see that you know We're we're fine everything looks good on our side. So it it allows them to troubleshoot in the right direction So it helps us to do online troubleshooting, but what happens when? You do a bad push or never does a bad push and somehow you delete all of you delete several jobs or Something goes wrong. Yeah, absolutely. And that's why you need backups when we started having our first few deployments of Cloud Foundry We started experimenting with the environment and kept bringing it down. So we decided let's do backups We use pivotal Cloud Foundry and God bless the soul who wrote that six-page Document about how to perform backups So we had to sit in front of a screen You know fingers on the keyboard try to execute all these state of statements interpret what they mean and Hours later you would have a successful backup During these this whole time your application owners are sitting on their hands because they cannot deploy new releases Cloud controller goes into a read-only state when backups happen Luckily for us we found CF Ops CF Ops is an open-source tool that helps with automating the backup of the system of a Cloud Foundry deployment We decided to take them to the next level Today when a backup kicks off in our environment, we update our environmental status pages We update our channels to notify users that backup has been kicked off We also take metrics out of that How long did it take for a backup to complete within every environment? That data helps us as well as the developers It helps us because if for the last ten weeks if something was happening a backup was completing in an hour and Today it took five hours. It's an indicator that we need to go and check out. What's wrong? it or it helps our developers because Now they have visibility as to when backups are happening so that they can make sure their deployment cycles do not fall during the Backup period that we have spoke about monitoring and we spoke about backups But in between two of these there's also the need to prove the resiliency of the environment How do you know Cloud Foundry is working like it's supposed to work? You don't want to wait till the power goes off to find out that the generator is not working You want that generator to do those tests, you know every day and Make sure it's ready when the when the time comes when you lose power We try to do something similar in our environment for Cloud Foundry We have chaos lemur chaos lemur is also an open-source tool that helps us destroy components within Cloud Foundry Destruction is not really a good thing But what you would expect to happen what you would expect Cloud Foundry to do at that point is make sure that the VM Resurrector is bringing the components that have gone down back up back up and brings the environment into a stable state It also helps us with enforcing cloud-ready architectures So when a component within Cloud Foundry gets destroyed If your application owners your developers are breathing, you know down your neck saying what is going on? You just lost something. My application is totally Undresponsive that is the time to have a great conversation with them You want to talk to them about why their application is so much dependent on the infrastructure Can we? Build more resiliency into that application layer. Can you be multi-site? You know other architectural decisions that you had in a previous legacy system that you ported over and simply doesn't work in The environment instances go down. These are great conversations. Just get them to stop screaming first And it's become easier like I said before it's become easier over a period of time To have these conversations with them because they are new to the Architecture and it takes a while before they get to understand how best to operate their applications in this environment What we have also seen is this destruction also helps in cleansing We have seen memory leaks in certain components. We have also seen Some funky stuff happening within the environment. I usually I blame Tim on it but this process of destroying and Recreating all these VMs or certain components brings all of them to back to a pristine state And this happens often enough that We start to see those this funky state things less and less In addition to all of this it's great to have a Jedi master amongst ourselves and for us that is Sergei He's right here with us Sergei hands up right? Sergei is a great addition to our team. He works. He brings a wealth of knowledge Not only on the infrastructure side, but also the application side he's a great liaison between the application teams who are just coming on board and The infrastructure side where he's provided, you know, some great services that he has written by himself Tim, what have we and Sergei together done to help our lives and our application owner's life? Yeah, yeah, so it's it's good to have that visibility have that understanding Into the the lives of the developer so Sergei's helped us develop some custom tools that we'll talk to you about and the first thing that we had to figure out is You know when your developers are moving into a platform like this They lose a lot of visibility. They lose a lot of freedom to log on Do some trace routes do some connectivity testing and get them through so without that capability Neva and I had to be on phone calls all the time doing Netcat doing trace routes making sure that connectivity outbound from Cloud Foundry was there So I don't miss those calls by the way in the middle of the night You know people will wake you up to do trace routes and things this is conference so So Sergei developed some connectivity testing tools we call it is connected so users can go on to Cloud Foundry and they can Put their endpoints into the url and it tests connectivity outbound from Cloud Foundry So it's a really quick and easy way for Them to go and see if they have connected connectivity to github or any external resources that their application needs prior to going live or you know as that it's part of their development cycle And it saves us from being able to have to run stuff manually and allows for the platform to be more self-sustaining So tcp outbound trace route outbound really helps out so Another another useful tool is apps metrics dashboards So we have if you're just getting started with Cloud Foundry and you just want to have quick time series data to show How your application is performing We have quick dashboards that they can see Similar to the ones that you saw before where they can go in and look at apps metrics and And get a good indication before having to invest time into building their own tool or or using some other tools That require a little bit more bootstrap time Another another thing that Sergei has definitely helped us out with And if you want to talk to him about any of these things he'll he'll be available for questions afterwards, right? I didn't we didn't talk about that beforehand, but he's good So we have custom service brokers as well So he has a really good system of leveraging docker to be able to create service brokers on the fly that are You know single tenant pretty much, you know, you have a docker container that contains an entire elk stack for you So when you instantiate a service instance, it's going off and creating that service instance for you using some some pretty cool technology behind the scenes So this has given us a really good I a really good method for creating custom service brokers that allow people to do all sorts of things Like create elk stacks create idea Create proxy servers that allow them to have outbound access into the internet if it's in a protected network So there's all sorts of of ways and it's very extensible So I encourage you to ask questions if you have them After the talk, yeah, these are great tools What you will find is over a period of time once people get used to having everything within the past environment They don't want to go out and build their own infrastructure as a service anymore So the more services you can build into the platform and keep them there You know, they're very happy with with getting away from that infrastructure as a service layer that they've written before Yep, definitely And outside of the developer for our for our part, we've we've done a lot of scripting To help automate some of the things that we do every day as well And our operations team does every every day So if you're familiar with cloud foundry, you're familiar with pivotal cloud foundry. It's sometimes it can be a challenge to Do user intake because there's a dance of you know logging on the first time and then being added to an org and then they can log on and and It allows that then they have the access they need So we we have scripts that do a user intake that pre create users Pre add them to organizations. So the first time they log on their their environments ready for them And as you know, we have a lot of sites as you could see So we had a lot of our users need to clone their environment from one to another So there's some scripts out there that we do that allow that or that cloning to happen So their environment and their spaces all look the same between two different sites, which has been really cool for us But Tim, this is all tech talk, right? I mean, we are engineers and we love technology And we could talk about technology all day, but there's this they call social interactions that we are not really good at And you know, even on facebook, I mean, I see your selfies man. You need some improvement So Over in our in our cloud phone reenvironment. We have worked on the social interaction with our customers You know, that's where transparency comes in All the metrics that we saw initially at the start of this This talk Is exposed to our customers So when they have an application problem, they not only look at their application data But they also have access to all the platform metrics that We see so we are very transparent. We don't have anything to hide and when that is actually a problem We try to address it as openly as possible I would like to give an example of what happened. It was about six months ago There were it was a perfect storm. We had Two things happen one we decided we're going to help our customers by providing more bill packs in the environment The second was we decided for whatever reason monitoring of ephemeral disks was not a big deal Right when was the last time ephemeral disks brought down the environment? I I don't even remember that but So the new bill packs started caching on the deas and we pretty much Ran out of disk space on all our deas. We have a lot of deas just to give you the context So applications came pretty much to a standstill in that particular cloud foundation Environment that we had so application owners had to move some of their traffic out of that foundation into other foundations that we have across different data centers It was not a really good day for us. I would say But what we did as an rca for that was have an open forum with our customers and tell them This is exactly what happened. This is what we failed to do and we also told steps that we have taken to make sure that this doesn't happen again and You know knock on wood, you know, it hasn't happened yet and we're very glad what it also gave us is the Ability to go back and look at what are the other assumptions that we made Right. We thought f metal disks were not okay to be monitored What were the other things that we made assumption on so we had that exercise and we're very happy that It's an eye-opener. You know when incidents happen we have status pages where we inform our user base on what is happening within the environment And we have slack channels where we have really close contact with our customers We have a slack channels about like 200 and 75 to 300 people on the slack channel what we find Over the period of time We have all these new customers who come, you know, who are interested in the environment through the gray point They actually heard about cloud foundry and they come and ask questions Before we start replying to them or existing user communities replying them. They're like, oh, you know We ran into this problem before here's the solution. They they want to help us out. So today As we stand here, we are not supporting the environment our slack channels are being manned by our user base You know our developers are helping each other out and there are very few questions that we answer today Yeah, having that kind of community brings together Sort of like this self-sustaining ecosystem of users That can share information and help each other out and we we see that all the time So it's been very important that's crowdsourced, you know, and I think that's that's one of the very very big keys to our success in Comcast Well, I guess that brings us to keys to success When we started making this presentation, I told Tim, you know People shouldn't be going through all these presentations to see, you know, how to succeed in the environment Let's just give them an abbreviation, right? We all we all love abbreviations, right? lol brb lmf a o not not the rap artist, right? RTFM RTFM, okay So I told him I'll make an abbreviation and after days and days of searching. I finally got it from word of the day And so today you guys have a claw What does it mean? It means to achieve brilliant success in something So how do how can you replicate, you know, some of the success that we have had in our environment? So we start with e energize your base Me and Tim have had the the benefit of just walking into a room and starting to talk about platform as a service and Don't have people stare at you like you just told them a claw right, so you want to talk and communicate to your base about what clown foundry is what other benefits Talk talk about it up and down your stack, you know, the people upper management to the to the developers The more educated they are the more interest they will have in How the technology can help their application their kpi indicators if they have apps applications going down because of legacy technology How can they scale quickly? How can they take more load, you know, with very little notice? see Change the way you think When I started this I was an infrastructure as a service engineer Greg put me on a two-day poc with all the developers and all I was sitting there and I was thinking Where is the infrastructure piece it took about one hour to explain How to set up the infrastructure Today I we talk about talk with our developers all the time. We speak about build packs logging Load balancing how to do connectivity across different zones So it's definitely, you know, you have to change the way you think if you are coming from, you know An infrastructure as an engineer role or even from legacy roles L Live it some people might call it eating your own dog food But today we have our own cloud portals running within cloud foundry Showing this kind of belief in our environment and us running very important portals That customers consume within the environment gives them the feeling that yes, we can invest in this environment We are we had they gives them the It gives them the ability to believe that their applications would work very well in this environment too a automation automation Like Tim said, there are going to be tons of people interested once you start ramping up You there are tons of work that you will have to do going to be user intake that are going to be our creation coders. There's going to be People who want new services people who want to know how to Subscribe to services so that So the most important thing is to automate as much as possible So that you don't have to be stuck in the BAU cycle all the time Right, we could you can concentrate and put that amount of time into creating new services and trying to keep people within the platform T transparency we just spoke about it. It's very important to listen to your customers I think it's one of the biggest keys to our success and what we have done is Like I said, the more transparent you are They would be doing the work for you like just this how we are right here and they are doing the work for us within our channels today Well, I guess that's a club for you. That's a new abbreviation. Tim. What do you think I've overachieved? Yeah, not too bad hashtag a club so So if you guys are If you guys are interested in what we're doing Obviously, thank you for coming and listening to our talk If you're interested we have we have openings So come to us or go visit that site quick shout out to Nick who has a talk today as well at 2 10 p.m in this room So please, please see how our journey has continued into their into our developers world and see how they've Changed the way they do business So with that if anyone has any questions Time for you Yeah Oh, there's a mic right there in case you guys have any questions Okay, absolutely Okay, all right. Well, thank you. Well, thank you guys if you have any questions