 All right, so it's 2.30, so we'll go ahead and get started. I've got the dubious honor of following lunch on the last day of the conference, so, you know, hopefully no one falls asleep, bear with me. So thank you all for coming. We're gonna talk about extreme ops, kind of a play on extreme programming, but how we do platform operations at the Home Depot. I should have probably worn like an American Gladiators costume or something like that. Probably would have worked out better, but. Actually, I forgot to put my married name on this. I'm Kevin Nigarashi Ball. I'm a staff software engineer for the Home Depot. I work with the App Platforms team. We're primarily focused on delivering platforms to our software engineers throughout the enterprise, one of those being Cloud Foundry. You'll also notice in the title that it does say software engineer. That's kind of important as we progress through this talk. We see ourselves as a software team that happens to deliver infrastructure versus an infrastructure team that actually just happens to write software. Other titles for this slide could have been presentation, could have been, you can do it, we can help. More developer enabled, more saving, more doing. Or, let's just let's do this. By the way, I'm Dan from Bringing More Home Rebuckets Home by my husband. So this is the closest I could get with one. So give you a quick glance of where we are as a platform team right now. We are running six foundations and growing. Currently we have about 1,200 engineers. That depends on where you wanna look for that number. It could be higher or lower. So I tend to go off of conversations. And then in our largest foundation, we're running about 1,500 applications with about 4,000 plus instances. If you look at non-production, double or triple that number. Don't really count those. So let's go back. So our journey with Cloud Foundry started about two years ago. And we've, as a team, we've had a couple of different application platforms. And adopting Cloud Foundry, we, and not to go off of the buzzword of the day, but started our digital transformation as part of, kinda coincided when we were bringing Cloud Foundry online. And as part of that, we actually started to look at what's really impacting our software engineers. What is keeping them from delivering value? And there was a common couple of themes that came out. A lot of it came around matrix people. So actually having to reach out to someone else to do, reach out to someone else to do something for you. And not being empowered to do their own thing. And really around this came a couple of different concepts of, or different pain points. One of them was tickets. And one thing, and you may have seen the tagline, no cues, no tickets. And really, we took that to heart. We said, okay, you're gonna come to Cloud Foundry and you're not gonna have to put a single ticket in to interact with us. You may be asking yourself, how is this possible? How in the world can you go and not have a ticketing cue system to do work? Isn't that how I monitor things, how I know what my volume is? Am I getting better or worse? But it also kind of detached us from our guys. So anyone remember Tom from Office Space? Takes us, no that's what tickets really do. T, you know, Tom works really well. I take the specs from the customers and I give them to the engineers. Engineers, do we like to talk to people? Sometimes, sometimes I love just coming in, putting my headphones in, going for the day and being fine with it. But Tom's big caveat for this was he has people skills. And really, you know, this ticket's removed that ability from us having people skills. No, because how many of you had to fill out a ticket and then actually have to have someone call you to figure out what your details were? Yeah, anyone ever had a ticket like that was self-service the entire way through never had to talk to the person on the end? Yeah, never works. Always have to do that. So trust me, came from a help desk background, doesn't work. You have to talk to someone. So why should I have you put fill information into a ticket when I can just pick up the phone and call you or you can call me. We'll get to where that doesn't, that kind of breaks down how we interact with that. So we took on a thing called conversational support. Really, you pick up or talk to me and we'll figure out how to help you. This is not just picking up the phone, this is a lot, we're heavy users of Slack. Some of this is an email. A lot of this was also just road shows and going and actually talking to people and going to the different floors of the development teams and being like, hey, let me go to your stand up and see what's your pain point right now. Or, hey, this has been really cool in Cloud Foundry that we're introducing, here's how it might help you. We also found this kind of worked into some workshops. So when you're new to Cloud Foundry, how do you see a push? How do you look for your logs? Things like that. And we found out of those workshops that teams really had issues learning or not only did they not know Cloud Foundry, they didn't know Git, they didn't know Slack. And so if we hadn't been talking with these people, we wouldn't have known that those are issues that we need to help cover. And so that workshop actually morphed into a kind of a modern development workshop of how do we talk about, hey, here's how you engage with people on Slack. Be kind, don't channel everyone in a massive channel, please. And so right now, to look at where we are, like our main Cloud Foundry channel is where all our interactions are, where our community lives, has about 1200 developers. That's kind of where the number came from. That comes up and goes down. But if anyone's familiar with running Cloud Foundry, it's not just, hey, it's the foundation, that's what I'm running. There really is a whole team, especially in an enterprise that you're gonna interact with. You're gonna deal with your networking guys, your infrastructure guys, your security guys, your monitoring people, they're all there. And they need a place to talk. And really in the big main channel is a little bit too much to talk to these people. So we actually spun off another channel called PCF Operators. This is where the nitty-gritty, hey, let's talk about SNPs and NATs and all the other fun stuff happens. And we leave that, but it's also, you'll notice it's not a private channel. It's a completely open channel for anyone to join. We have developers sitting there, or engineers sitting there, watching us talk about the deep, diving entrances of Cloud Foundry. One thing that comes out of this though, with Cloud Foundry's conversational support, and this really comes from extreme programming, trying to do face-to-face conversation, trying to get that feedback loop shortened as fast as we can. We'll get to some other ways we do that so we know things are working. But with that, anyone ever been overwhelmed by messages and Slack? That really happens a lot. I became quickly the PCF dude, because I'm the one that's been working with the Foundation this long. Getting 20 DMs, 30 DMs in a day was easy to happen, and it's like, please go talk in the channel while I answer you there so someone can search. But really, the whole team would spend, you could spend time just talking, just doing support items. And so we introduced the concept of an interrupt support pair. So we already pair program, so why don't we just pair support? It's not, you know, it's maybe not the most glamorous thing in the world, but it does work, especially as you're growing a team. As you're growing a team of operators, you want having that anchor, bringing a junior person from the team and actually work through this support. Hey, here's how we go through Bosch and actually troubleshoot these things. Or here's where you find logs and where they're stored on the Diego cells. Here's where they're stored and how you look for those. And oh yeah, you can blow that file away and not do anything, but don't touch that one. Plus, so in the top of our channel, we put, hey, here's who's support this week. Go talk to them, bug them, leave the rest of the team alone to be iterating on deploying new features in Cloud Foundry or to actually, as I said, we're a software team. We build things that go around Cloud Foundry. One of our big things is we build things that other people don't want to build. Getting identity information and things like that. It's kind of nice to have a centralized team that can work on those things. And the support pair isn't a horrible thing because you're kind of not designed to have anything coming through. You're not on the roadmap, you're not in Tracker, you're not the one trying to actually get work done. So you can either work on some stuff on the side that's not as important or some of those kind of out there projects of what if we had time to do this while you're sitting there watching Slack? It gives us a little bit of flexibility that we normally wouldn't have. But again, 90%, so you have all this conversation, you have someone actually handling that. But what you really see are a lot of the questions about Cloud Foundry or they just happen to be that I'm running on Cloud Foundry, so I'm going to ask the Cloud Foundry guys because they'll know about it. As a lot of, hey, how do I do this in spring on Cloud Foundry, which Cloud Foundry is just that little bolted piece on the end. And really that's where we started to take it from being this little support pair to building our community. We do a heavy, we have a huge community, it's all 12,000 developers. They range in there from people who understand Ruby who are hardcore spring guys. There's some awesome monitoring guys in there that are like, oh, this is how you can find these things if you went to this report. A lot of those questions that I wish Slack gave it to me that I could show the graph of how much our team doesn't talk in there anymore and how many of the community are contributing. It doesn't, so. But really that's where the value is because I can answer your question or another question. But if you post there, we have a conversation, here's what you're running into. And then two days later when someone else joins the channel and has to figure out what that question is, it's in there in the search feature. Trust me, that's usually my first question of have you searched Slack, have you done in Cloud Foundry and searched? Because usually that question's already been asked. And like I said, there's a lot of times where we have cases where people maybe have you done this in Cloud Foundry? And people can help with it. And really that's where we started shaping the community into aspects of what's the difference between help, support, guidance, advice? Our job as operators is to operate the platform. That's where our SLA comes from. Our goal is to ensure that the platform's up and running, serving traffic and your apps can push. But like I said, we're software engineers as well. And so as part of that, we may not be able to support you because I'm not the one sitting there with your application, but I can help you as a fellow, someone getting to go, I can help you a little bit with that. I can tell you where here's some articles that I found that really are beneficial. We have other people who post guidance of things like maybe you shouldn't try that because horizontal scaling's better than vertical scaling. Let me tell you why. Why that doesn't work really well when you try to place massive VMs. But that's one thing we really, we get into what we call community management of now we've built this massive community. How do I get people who aren't my team to answer questions for my team? And also how do we get the questions that really do need to matter? Like, hey, I'm seeing this huge error. What does this actually mean? But you can't talk entirely. You can't sit here and actually talk to people in Slack and be like, oh, that's great, it works. I can solve world problems. Really caught in everyone should have knows Cloud Foundry is a big hairy beast sometimes to go find where an issue is. So how do you go find that? How do you do that when you're getting pinged all the time as a support pair of like, what's the problem? So we lean heavily into automation for everyone. This is getting back into the whole extreme programming of like, again, fast failures. One of our big things is how do you know the platform is working? How do you, when someone says I can't push an app, how do you prove that? How do you prove that it's you, not me, you know, that kind of thing? Usually works the other way around, but. And so one thing that we do is we constantly run how we prove that is we do it ourselves. So we run tests every five, 10 minutes on the platform that exercise everything. So we know, hey, the router stopped responding, you know, those were gonna get alerts on those. If you can't push code because this build pack is offline, we know about that. And I would rather get woken up by that page with a little, hey, this is what's working. And then someone's saying, yeah, I can't push an app. I don't know why. Please join this call. That's like, no. The other thing we wanna get into start doing is actually automating upgrades. I like to say I have artisanally handcrafted foundations right now. They're, you know, they're unique. They're snowflakes, they're pretty. Love them. They're not that unique in that snowflake, but it takes like a little bit of time to actually do upgrades. Wouldn't I love just to have a pipe wine that said, hey, there's a new version of this out. Let's go send that to Sandbox and let's take those smoke tests that we had and let's just run that on there for like five hours. See what happens. And then if everything looks good, then yeah, maybe we're gonna send it to non-broad. That's my pipe dream over the next couple of months, but we gotta get there. Because as you said, we're growing, as I said, we're growing. When you start growing foundations and having more and more, it gets really hard for that handcrafted artisanal product to happen. And you have to start stamping out constant, you know, iterations of those. The other thing we lean in heavily is cell service. This comes in many forms. One of the biggest ones is from, well, I wouldn't say from day one, from day one, it was like, let's apply traditional platform ups, let's lock everything down. You can't get in and you have to ask these people or you get a granted access. We did that for a little bit and it works. People aren't in love with it. So if you look at our non-production environment, we said, hey, what if you just, all you had to do is be able to authenticate and you could create an org on your own. Don't want to talk to anyone. We'll give you space, we'll give you two gig. Here, go play. And we saw adoption skyrocket as soon as we did that. There's good things and bad things about resources being used, but obviously it got more people on the platform. We get into, when you start getting cell service, we started seeing, hey, how do you do some of this tricky networking stuff that you used to have to talk to someone to do? Hey, let's just, again, for a team that writes tools for people who don't usually want to write tools, let's go write something that deals with that networking layer and automates that for you and say, hey, instead of you going to fill out this Excel spreadsheet, which we know we all love Excel for some reason, I don't, I've done horrible things in Excel, trust me. Really, really bad things generating like 35 page web pages for a morning report. Yeah, not pretty. But let's automate these things. Give, when you give the developers that access and just say, hey, go and you get out of the way, Cloud Founder generally keeps things running and so now you're not asking me for help and so I can start iterating on those hard to solve problems faster. Or I can be able to get those, those upgrades that take, right now, it'll take a while because they're increasingly handcrafted a lot faster out into the world. So, a lot of this comes down to the point of how do we know it's working? How do I know that this is actually something that's anecdotally working? Because I've now removed my whole ticketing system. That's traditionally gonna help us background. I can take tickets, I can plot them, easy to see. I know something's actually working. A lot of it does come down to anecdotal evidence. It's going and talking to people. But if you talk to them, again, going back to conversational support and we're kind of feedbacking our own stuff here, is does this work, does it work for you? The overwhelming experience has been very positive. It's been a very plus, you know, we like it because we don't have to deal with that ticketing system. But most developers like it because they don't have to deal with one either. They don't have to deal with going and asking, you know, they ask for quota increases instead of, hey, let me give you initial space. They ask for, hey, wouldn't it be great if this club vendor would do this? And it's like, it's coming in this release, awesome. Or you start getting out of the small problems of let me do this for you, to how do I do these new, cool things? And we've really seen a huge adoption of those things where people are talking more positively about a platform instead of, hey, this old legacy thing that required this financial code and this thing to get working, works really well. So that's all I have in the spirit of being conversational support. I would love to take questions, talk about what I can talk about and how we've done things and where you might have questions from. So I'm happy to open that up to the floor. I have a microphone. Yeah, so the question was, how do we run a health check on the foundation? So we have a couple different things. We run a modified version of the CF acceptance suite that actually runs smoke tests. But what we've done is we've actually broken that out into a pipeline that we can run concurrently. And so we actually test, hey, are the routers working? Is the authentication, is UAA responding? Can I off? And then we actually do a full end to end app deployment, scaling, our logs coming out. And so from that, we can get alerts on those and see if it's actually working. As well as some of the administrative stuff that we do of like, hey, can I create orgs and spaces? Can I remove them? Can I change quotas? Things like that. We also have a set of status bots that look and look for latency on the network. So if we have a component that's getting overloaded, we'll get a latency trigger that says, hey, this clock controller is overloaded right now because you decided to move 400 containers at once. Or you're draining 20, 30 cells. Crap. That's what we use. We use a lot of different combination and we're looking to expand. Like anything that would require a human to look at, we want to automate as much as we can and have that run as part of the smoke testing suite as often as possible. Yeah. How long does your team last? So if you look at the core team that supports it, it's about six or seven people that support the platform. That being said, there's a larger team of about probably 20. If you add in networking, infrastructure, things like that, but those aren't part of our team. They're just very close partners. I would say start by just opening up a public channel. Like if you're heavy users, if you're easy to go slack or even sending them an email address works, but that's kind of still that asynchronousness. But a lot of it's just talking in the open. And that can be anything publicly, it's just being able to talk to people. One thing that was always, it kind of frustrating is when you have a ticket, you don't use submit the ticket. And then I have no idea the two or three hours that happened on the back end where there's probably four or five different discussions going on until I get the response back to you. If all that's happening in the open where it's like everyone can see that, that if that does take four hours to solve something, at least they're seeing this is the actual progress that's being made. And you're introducing people to actually like, I'm just the front for the core cloud foundry. There may be, this may actually be something like in the networking layer. So here's the networking guy. So if you have these kinds of things in the future, you may not need to talk to me, just go talk to the networking people and just get everyone to talk more in the open. So, one second. Yes? How do you keep- How do you, for what? Do real work. Doing real work. So we're kind of fortunate that we're kind of separated off a little bit, but when we weren't, we actually, you'd have like the support pair sets here and there's a little, you know, like here they are, or if they would walk up, it's like, go talk to them. I'm in the middle of doing something. It's still a little- The walk-up side. It's still, so the inner pair would do both. So they would do walk-up or, mainly like I said, we're heavy users of Slack, so Slack support. But, so if they're chatting with someone in Slack, if someone walks up, it's pretty much, you're gonna have to hold a minute or I'm talking to this other person. Usually it's never where their heads down completely where they can't kind of give them at least five minutes. But there are some cases where it's like, I'm really, we're into a weird issue. So can you just give me, or can you come back in like five, 10 minutes or where do you sit? Let me go find you and go and talk to them. Because usually they're gonna rope in a couple other people as well. I always love the walk-ups that have like four or five people show up at the same time. It's like, I'm outnumbered. Help. So, yes. Hi, you guys, especially with developers having a little more control of what they can do with platforms, like networking or spaces, things like that, how does it become a little bit better? Yeah, I can just sort of agree. So as far as what's open and available to them, so non-prod is open, which is our development environment to everyone to do work. The production environments are locked down. So you can't just log in the production and push code. There is change controls around that where you do have to have documentation in order to go to production. We prefer that going through some kind of CICD pipeline. Don't get automated, don't manually ship it. We don't necessarily restrict that. But we try to allow it as much as we can, but yeah, there's for compliance and regular reasons we have to have. No, you can't touch prod. But that's not having, on our legacy system, going to non-prod required another team. So being able to talk directly there and push my own code for my local workstation there, as a lot of people have really been beneficial by that, so. Go for it. Yes. So then people can see the data? Yes, so we do maintain, so we have this one Slack channel where usually all this happens. We have spawned some out and gone back in. A lot of, if it becomes a question that we get often, we actually have a knowledge base that we post those things to. We also use pinning a lot in Slack, but if ideally you don't want your pins to grow, so if it's something that's constantly referenced, we'll pull it out of a pin and drop it into the knowledge base and remove the pin. So we do have two places you can search for that information. So if we're not able to solve it, it's kind of fun, but no, so if it goes to the support parent, they're not able to solve or they need to pull in someone who's more staff within, they can pull us off to help them. If that's not, we will reach out for help from the community or from different people to get those, it just varies on the case by case. A lot of it is low level, like hey, I need this kind of thing or how do you do this? So. Non-prolucidation, non-functional testing environments and all that stuff, so how do you organize all that? Yeah, so we do full separation of a production and a non-production foundation. As far as orgs and spaces, we don't necessarily dictate how teams do that. We prefer to have self-organizing teams, so if they want to organize by their business unit, they can or if they happen to be like, this is the application name, they want to call that org that. That's completely up to them and they can define spaces inside of there as they wish. We try to not to have an opinionated stance on that. We run PCF everywhere, so we run Pivotal Cloud Foundry. Do you find it difficult managing all the different? Not necessarily, like I said, it is right now we have to automate a lot, we don't have a lot of automation for upgrading, but as far as from our administrative side, we've got it tied back into some of our identity providers that we don't have to worry about, hey, this is how you authenticate in some of those things. And it's not really, it's just figuring out where you're going and where they, that's really our biggest question of opening it up is, hey, where are you asking this question from? Like where are you, what foundation? So we can target the right place to look. Same thing for a ticketing perspective, like hey, can you tell me where this is? It's okay, I probably can't say. Where are we running Cloud Foundry? What are we running it on? So we run, we run on-premise and private cloud. We run on in-house and in our own data centers on private cloud. So you're good with that, sorry, Tony's my manager, so I'm getting the green light of like PR, can I talk about that? So we run on vSphere on-premise. We are looking at public cloud as well. But for us Cloud Foundry, our goal to delivering to developers is Cloud Foundry to Cloud Foundry. You just need to tell it which foundation you're talking to. We kind of take care of everything under the covers. Yes. It's something that we're definitely considering. It's depending from our auditing and regulatory controls of is it good enough or are they comfortable enough with it? But it's definitely something that we're considering. Yes. That's kind of an easy way of saying our capacity is shrinking or our needs are growing and we need to add that counter to the team that's overseeing our understaff or something like that. When you've lost that signal and now you're in fully conversational support, number of slack messages doesn't indicate complexity. There's no time to resolution unless you're doing something on paper or something else. And you've found ways in order to try and figure out how to balance that meeting to staff up to the right level and what's the signal that you use up that just aren't really going to work? Yeah. So the question is like how do you, since you've lost that metric from ticketing, how do you actually go and actually figure out how do you monitor as the team healthy and things like that? Really for us, it becomes, you start monitoring actually the team itself. Is the team happy? It becomes kind of that. And that's kind of a weird kind of DevOps emotional kind of thing. But it really is, you can tell when the workload's getting too much that the team kind of changes slightly. And so you kind of, and you can tell like, okay, this is I'm done with being support or if the team's getting bogged down on other development tasks and the support pair isn't rotating. If they do, you know, you just still get these people you're gonna do support right now. You can kind of start looking like, so we're understaffed. Like things are underperforming we need to be able to start rotating again because we're holding anchors on other projects. But really you can, it's one of those emotional things you can kind of tell as the team gets overloaded. Yes? Those are lots of devs and lots of apps here. We are, so at the moment, we have not. So that is definitely something we are looking at more of. I definitely from a showback perspective of here's how much you're costing and hoping that helps people realize like if we could write things differently maybe it'll be a little cheaper. But we really wanted to drive adoption as our biggest thing of getting people to migrate to Cloud Foundry. Because we saw that as a really strategic advantage. Yes? Fourth pair, so I'm assuming that means two? Usually two. How many developers and how many app teams are you supporting with that? So that's about 1,000 to 2,000 devs. And it's not always a support pair. You can sometimes just, they fly solo depending on team strength. But there's always usually someone dedicated for support. We do run some data services concurrently with it. But primarily we're focused on runtime. So our team is kind of a sub component of a development or enablement group who kind of manage the other side of the house with DevTools. So they can kind of have both groups up and go to that. We really as far as excellence go to trying to have other people like talk to the other people who are doing this. And there's usually from our channels, we know who's the spring expert or really vocal in spring. Or hey, these are the IaaS people who are really good at running and or here's where your Oracle people are. We have a lot of Slack channels for people to do that. You're about to try to put some stuff that I designed that as an application. Do we have anything like that? So for our team, we support Cloud Foundry. So we don't actually operate the application layer. That's actually up to the application teams to either do their own operations or have their own ops team. Our goal is to say, hey, Cloud Foundry is running, your apps are being served. So we don't necessarily specify like, hey, this is the foundation you should be on. Or not sorry, it's not foundation framework that you should be using or a language or whatever. It's what makes sense for your team. Yeah, we just run it. We make sure it's up and running and then bring your apps to us. We'll host them. So hit it off, but it's past their clock. So thank you so much.