 All right, so I'm Chisa and I am the research and development technical program manager based in London for Pivotal and Today we are running a panel So I'm actually going to let the panel introduce themselves and I'm also going to let them describe exactly what we're talking about So I'll start with introductions and then we'll go through and discuss what the goals of effectively what we're talking about are Howdy, okay, so my name is Deborah Wood I am the product manager for the team that runs platforms in Pivotal one of the platforms that we run is for Pivotal tracker in production and the conversation today is around the platform that we ran for The sport relief campaign We ran the platform. I'm a queen he wrote the software to collect donations and we're just going to talk about Some of the things that we did to mitigate fear given that we were distributed into prosperous countries and The comic relief thing is a pretty high pressure night of TV donations collect millions for charity You don't want things to go up kaboom. So We just had a couple of pragmatic things that we did that we think might be helpful for yourself. So Deborah Hi, I'm I'm Xenon. I'm I work for our Makuni. We're a cloud native software consultancy I've been involved with this particular project for much longer than that for about the last eight years I've been doing the donations platform for comic relief. It's been quite quite a joyful journey and Yeah, I'm the COO of our Makuni, but I generally means we're quite small company that means just do all the crap So that's me My name is James win. I'm a staff engineer for Pivotal cloud ops in Dublin This was the second year. We did the comic relief platform out of Dublin office Had to do a few different things this time Great. So to start and we've already talked about the fact that we are effectively going to look at incident handling I'd like to start with Xenon and just tell us a little bit about the problem that you saw and the opportunity and Can you just tell us what happened? Okay, so for those that don't know comic relief for a UK charity and once a year they have a TV show that runs for about seven hours and They collect a lot of the money during that TV show. So we about eight seven eight seven years ago We had to rebuild the platform. It was an old platform I'll Java platform that had 12 different organizations that came together to help deliver it about 35 40 people Tin would arrive on the 1st of January. You'd kind of get everyone together, you know, some of the big players, you know Oracle Cisco some of the other kind of IBM, etc. They'd all come together and Patch the old platform together and just like hope for hell that it would still work and it kind of in 2011 It kind of reached the peaks where we it almost broke because of the level of traffic So we realized we needed to rewrite it and so I'm a coonie Pitch to rewrite it and tried a few of the kind of platform of service Providers out there and settled on cloud foundry. So for the first few years We ran it on open source cloud foundry across a private vSphere environment across AWS and then a couple of years ago we partnered with pivotal and we run it across PCF and so that worked really well and whereas before it was a single team that was Doing all of development and supporting the platforms as well the proper kind of in a proper kind of single single DevOps team It then changed when pivotal came in because pivotal running the platform and the team were under development So we had to change some of the practices and the things that we did and Yeah So Debra, can you be can you expand upon that and kind of be a little bit more explicit around the role that pivotal played in this particular incident Yeah ready so Pivotal was providing the PCF Foundation In this particular year so this year we changed it up a bit. So we had a multi foundation set up We had three PCF installations One was serving as a canary environment that we could just push Product updates to and then we had two PCF installations that served as two different regions for the actual production receiving of Donations What we did a little differently this year was we wanted to experiment with allowing internal product teams to liaise directly With application developers to just understand the use cases of how they were running the software that we write So this year we had pks which was our pivotal container service We were getting ready to go GA on that They were interested in participating just to learn from their views as Just get firsthand empathy of How it is to run software on pks. We also had health watch, which is the tool that we have to observe a platform Behavior over so that team was also participating actively so they were personally pushing their software to To the environment. We also had Redis So we what we were doing just from a product side of pivotal is we we saw this as an opportunity to have Internal pivotal teams working directly with our mccuny at dev teams to experience the the highs and lows and interesting moments of Getting ready for a a very high profile High sensitivity TV telethon. I mean, I think it gets much better to to learn how to do things, right? So that's what we were doing differently this year, but with the same objective and And then James can you start us off But then anyone can also add in in terms of how exactly that played out step-by-step I'm sure it's played out step-by-step, but I just wonder things as well. That was big very different for us this year was Debbie mentioned pks will be also had a mongo DB as well and we were running this on pks So this would have been the first It's probably the first production loads ever on pks to be run. So people are quite Anxious to say at least about this. So what we did was we decided to Put together a process to try to alleviate people's fear. So we started off with kind of simple the simple drills of just people receiving a page didn't have to do anything, but just Hey, I now know what a page looks like. I now get some sense of the feeling of Getting it to call at the middle of night and then we kind of increased the complexity Each time testing out things like access make sure people had the right permissions and so on the whole way up until we built an exercise whereby Someone had to go in and actually try to debug a fake mongo issue and True this we were able to build up an awful of confidence because a lot of people didn't actually want nobody wanted to support the mongo instance This was a big problem for us That's that like that was probably the That was the interesting part of this exercise is that in the previous years we had we were running mongo on a particular tile and We'd had to tweak that Tyler we but but That was how it had been done last year and there were clear understandings of who owns what Who who handles incidents in this particular component this year? Because we wanted to use the opportunity to run on pks. It was more complicated for the armor Cooney team because while they were the Experts in mongo DB and troubleshooting that They were not and could not be expected to be experts on pks. So they were like grand I know how to troubleshoot a mongo DB issue I don't know how to necessarily get there and I certainly don't know how to get there under pressure It's like in the morning if someone goes bananas and so I can't reasonably be expected to Hold operational responsibility for this in the past. Yes, but now I can't and Like we were in a bit of a stalemate because we couldn't use the tile of last year because it had gone out of support We could use the latest and greatest and patched update of mongo DB if we ran it on pks But now the knowledge required to troubleshoot the incident was spread across three teams So we had the platform team, which is my team. We had the pks team So my team aren't experts in pks the pks team is experts in pks And then the armor Cooney team was experts in mongo DB. So there was this it took us almost six weeks to just With with night of TV coming up very quickly for us to decide worst-case scenario something happens on mongo DB It is a bit more complicated this year and everyone's very afraid of this hot potato No one can be expected to To have all of the knowledge should something go down. So that was It becomes an interesting challenge because It's that kind of separation of duty. So the last few years that you've run it with pivotal the operating The platform team haven't really known anything about the about the app, you know, it's 26 different micro services It's eventually consistent things going to queues and various mongo DB could actually go away And it wouldn't be a problem. We take the money It'd be harder to report on how much money we had but we take the money and that's the kind of primary goal of the platform so But then when you move into a situation that we had now where there was kind of like a lack of clarity of Not so much who's responsible just who's responsible for what but actually what the process is if something goes on How do you get the information? How do you get the place you need to get to to find out what's going on and where that responsibility lies? It led to some really interesting kind of challenges And I'd like to probably call out something you said earlier on the stage this morning She said when you were talking about that communication and about the kind of teams working together And I think a really big Plus point that came out of all the exercise to your running was about that developing a trust and developing that kind of feeling that even though We were in a way three separate teams or three separate responsibilities We're all working together to achieve that goal And we kind of sort of once you build that kind of trust and psychological safety between the teams and you build that kind Of confidence in each other's abilities That was a really important thing because under pressure if you don't have those things it can quickly quickly go wrong So one one of the challenges that you have all mentioned is that there was just this fear of Responsibility amongst the humans And kind of like a lack of confidence. I'm wondering are there any other challenges that stood out to you in in this experience This kind of goes back to the fear thing again a little bit One of the other things is our mccuny know their platform and have been doing this for many years as he not just mentioned Our team is an operations team the pks team was a brand new or in the development team and Their concept of kind of support. They didn't necessarily have a concept support They didn't have a concept of having to dial in somewhere and so on So it was really important to get their confidence up on doing this because even though they were the experts They weren't they were not necessarily the experts while they were in the trenches and that was that was important to try to build up that confidence with them And earlier we were we were talking and one thing that you mentioned was Having access to the right things. Can you expand upon that a little bit? interestingly our director so David Lang has a rich history and operations and his His kind of gem of understanding is that usually in incident management The biggest issue is getting access to the right thing once you've done that Then you're in your comfortable place. You understand the technology and you can get going so for us the The setup was quite intriguing. So we had the vpn to protect access to the two production regions So you need your you need the software and the credentials to log into the vpn Then you need to know Which region are you talking about? So are we talking about region a or region b of production? Then What are the credentials to get into pks? What are the credentials to get to manga db? I cannot which one of the manga db's so it's not so much Looking at the software that's failing. It's just getting to it And you don't really want to be learning that under pressure when everybody's looking at your software on tv and bbc And it's two o'clock in the morning and you don't know who to ping for a password so Probably the thing that built the most confidence in the various teams So generally what we did is we had a fire drill And it was a toy example and we said, okay I'm a cuny you you send us an email to this specific reserved email address. It's going to page Us operations team. We are going to do a mock triage. Okay, this looks like it's a redis instance or issue Or this looks like it's a blah issue We're going to call the people who are the subject matter expects of that bring them into this zoom call that we're going to chat But Ultimately, we're going to practice. So i'm going to say mr. pks person. Can you get me the last lug entry? On cluster x on region y and i'll take my hands off. I'm not going to help you The point of the exercise is do you know how to get there? And if not, let's let's make sure that the software is installed on your machine that you're going to be using that night So basically I want to make it routine So by the time if anything does happen, you've done this so many times that you know exactly who what to get you have the password Passwords and some of the things that were surfaced in these fire drills is that it was a it was a silly as having the wrong Version of a cli that could have blocked you or not realizing that you didn't have the vpn set up on your personal machine at home That could have blocked you that's not Really that difficult to fix but if if under pressure that that has a lot of stress that can be avoided So the fire drills would basically just housekeeping. Can you get there? The other thing is just do it with fire drills is there was kind of a general progression So you kind of start off with your own workstation that you use every day And so you think oh, that's kind of pointless because I do this every day But then how about doing it from your laptop? Oh, hang on. Actually don't have the right version of tonal blink Okay, how about doing it from your laptop on wireless in starbucks? Oh, hang on. We need a firewall rule now and so on and it's Disaggression meant we kind of were able to head off these niggly little problems That as we keep stressing you don't want to find out when the world's on fire um, so I think for me was what was really enlightening was the So we're great believers believers in the concept of observability within our software being able to know what's going on From the outside just simply by looking at the outputs that it's producing. So I think What was really interesting for us was being able to go through a process And so we run fire drills kind of game days quite a lot And then we have days where the role of one person is to go around through the platform You're doing a bit of chaos engineering taking things out of the loop knocking out a red is seeing what happens and getting the understanding under a kind of not real pressure but kind of uh, fake kind of pressure of Okay, when we spot this thing in the app when we spot that we're getting loads of kind of 403s from one of the payment service providers It's showing this and it's appearing like this for the platform team And this is what it means and kind of going through those scenarios so that you get kind of A shared understanding of what a problem looks like what a possible problem will look like for us as an app development team For the platform team and for the pks team and kind of really Practicing that and then practicing the loop between us all of how we then communicate that things You know rather than you know sending somebody sending the platform team like a snippet of a log That might mean nothing to them at all rather than like well What information is of use to you to help you identify what the issue is and so really it's about that kind of I don't like use a sort of term that comes to mind is that kind of sort of human glue of kind of Gluing everything together not that I want to melt people down But you know So and I think you call out the one of the third challenges that you mentioned earlier Which is effectively getting rid of the silos between people so it sounds like the three things were Just making sure that there weren't silos making sure that people actually are ready to walk out the door When we say we're leaving And then the third one was just making sure that we're mitigating for the fear Yeah, that people have So then Out of that, what do you think ultimately were the biggest learnings that you took from this? I think for me just for No, no one team can be expected to be the experts and confidently able To troubleshoot these very diverse areas of software So my team were Because of the amount of work that we all amount of time we spend on pcf We are experts in troubleshooting pcf. We know that Um pks having built that entire tool set are experts in where to start looking and what symptoms are there La la la i'm a cuny. No exactly What mongo db is being used for what what queries they're running all that jazz No one team could be credibly expected to hold that hot potato So we had to collaborate and make and make it explicit and practice that the three teams together are gonna Are going to be available experts from all three will be available on call And we are we are we were actually in the room With our mccuny in london But everybody is available and we are two seconds away. We're all actually Gonna be holding and just the fact that everybody knew that Someone who is an expert in that particular area can get me to where I need to be under pressure. So if I'm like There's a there's a pks thing somewhere and on that is a mongo db And I gotta get their help then there's a mongo db person in the room There's a platform person in the room and there's So just the fact that it was shared was that was a that was a massive thing shared and practiced Yeah, I think um, I think I'd call out in particular. There's there's something that kind of Sounds like it's against the kind of whole principle of dev ops and kind of shared ownership It is the idea that within that having that culture of shared ownership and collaboration and working together to achieve that goal But having a really clear explicit separation of duty so that everyone is really sure about what their Little spaces and not in the kind of their little space like I want to stick my elbows out and protect that little space And not gonna let anyone know what's going on inside of that space But more of that I know what I have to do. I know I'm responsible for when something moves outside of that space I know who I need to talk to and specifically how they want to be talked to Under pressure in this moment what information they need from me in order for them to be able to do their They're part of the kind of magic circle if you want to call it that and then just practice practice practice You can't be you can't kind of beat it I think for me like as you know mentioned this is was the building of trust at like the teams Only worked together once a year for a few months and various changes But for us going through this process we all learned about each other and we built up that trust So we weren't kind of going. Oh, I hope the army Cooney guys really have their stuff together tonight But we worked with them. We got that trust. But the other thing is it also Individuals learn to trust themselves That this wasn't an overwhelming thing I I've done this 20 times and there are lots of different circumstances So I can trust that I will be able to competently do my job If something bad happens under this high pressure and that means a really positive outcome I think it's amazing that You know, you we know that we want to have confidence in our technical solutions But we oftentimes don't reflect on the fact that we need to have confidence in our human solutions And you need both to have a fully comprehensive solution. So that's a really interesting point to make and I think that's a part of the xp manifesto is It's the person over the process as well and focusing on that. Yeah, absolutely So I do want to make sure that we have time for questions and we have a few minutes left Are there any questions in the audience for the panel? No, anyone else What actually happened on the night? Did anything go wrong? Did you have to use any of these processes that you put in place? It was really boring Yeah It's what we like. We like boring So the first year that I ever did this was with the old platform and it was singly the most exciting Even with my life from a work perspective But it's not like I'd ever want to relive again Lots of things went wrong lots of things just weren't tried lots of things under pressure Which we had to kind of resolve then and there with a new solution Which is something that place I'd never want to be again So I learned my lesson that night and so with with a new platform. It's come increasingly more boring. Yeah So the context of of this is with a short event of very high importance I'm curious as to how that influences the team for The non like dealing with millions of pounds in a short in a few hours and the more like Continuous nature because it sounds like there was also a lot of build-up in prep. Mm-hmm that sort of matched the Urgent or not urgency But like the the care and important relevance of of this short event So within our cloud ops team We do the same sort of thing nearly the whole time Especially when we're onboarding new people we start them off with Getting a page to do something really silly or trivial like target a particular Bosch deployment and We run these unclawed people get pretty confident and then we give them a problem which is unknown to them and They use up these things that they actually always talk about muscle memory. They've developed so much muscle memory that when they see this unknown They're automatically almost Setting up the vpm targeting the Bosch deployment Checking the vm health check and then they suddenly realize. Oh, hang on. I've almost got this problem solved And I didn't I still don't know what it is And that that's pretty important And like for the two that we do use it for training so In Dublin and Pivotal at the moment Some of the products that we run Or are developing be that CFCR, which is open source Kubernetes running on Bosch and pks members of those teams have been kind of There's an understanding that they will at some point need to be going on to the pager They will have to be that expert team in this for for our customers And these are engineers that have never really Been in any kind of operational Activity and it's very frightening for them. Yep and legitimately so So we use what we used with our mccuny to train those teams to not be scared of the pager just practice It's not The fear of the unknown is probably the biggest thing here. So if I can just get you past that Then you'll feel more comfortable and I think in a microservices DevOps world. That's probably going to become more routine Yeah, and I think um For me, it's I'm sure debbie and james will want to be sat in a room with me looking For that night every day And sort of I don't think anyone could possibly have a team You could just sit there looking at graphs and tailing logs to see when the arrows are coming up but you can automate that aspect of it and And the more that you practice your response as a team to getting that pager to getting that email or something about an Outage about an issue the better more slick your team are going to become and again that kind of your confidence will increase And then I think once once you start getting that sort of stuff as a routine the team I think then you can start Pushing out as an operations team saying what else can I do to ensure the reliability and resilience of the platforms that are under my control? You know do I can I start moving into a bit of chaos engineering? Can I start actively looking to bring my system down? And then you start I think going to really really good pace and really thinking about the kind of resilience engineering Like one thing just to that point We we are in our team we we prioritize Making changes to the production platform during office hours so that everybody's awake caffeinated another desk and The beauty of it is that when you are in your comfort zone you have the support of the people around you Figuring out an incident like that is actually quite thrilling. It's actually quite like I have no idea what's going on. It's like this inspector gadget kind of moment, but And it's it's fun There is actually an adrenaline bump in it when you when you sort this out so if you can do it during office hours It's actually quite interesting so fire draws like that Take the the boring turning logs thing to this is actually a pretty pretty cool exercise to play with We've signed up in our office. I think it's something along the lines of don't panic when they're What's don't panic when the failure happens because you won't enjoy it and that's something that we we try to push How did you plan capacity and what was plan b if there is too much traffic or data? So we built the platforms Individual what we call a shard so in each instance each foundation with the platform running on it could take 500 donations a second So we've got two of those so if they actually could take a thousand now I read a figure that at any point the kind of peak of actual all credit card transactions across the whole of UK is about 450 a second. So we would if that happened Our finance director would have been over the moon Thrilled but essentially what I think the the platform is designed to cope with failure at every level so You know You can lose front ends. It's going entirely stateless. It's eventually consistent So you can lose the whole mongo db and yet you it makes reporting a little bit more difficult Because you have to pull the numbers straight out of redis, but you can do that You can lose an individual redis. You can lose multiple redis. You can lose a whole you can lose a whole Foundation and the other foundation will still will still work and cope. So you kind of designing for failure Uh, but ultimately if everything failed and it all went down There's also a backup system that another another kind of tech partner provides just in case so But we did do a lot of low testing and yeah stuff like that beforehand as well Yeah We've got an There's an automated overnight performance test not the full thing and that runs over nine kind of let's let you know If there's been any degradation service and then we run regular like In the run up probably three or four times a week we're running low tests against So we are at time, but before we leave I do just want In terms the last question I have is just what's the one thing that you want people to leave with so Twitter worthy. So keep it short Okay, keep it short Yeah, um, but what's the one thing that each of you would want the audience to leave but today Um Failure will happen That is epic. It's like 180 characters. I'm sure practice practice builds trust Practice builds trust. Yeah Remember Twitter Twitter Okay, Twitter takes more than 180 characters Identify your fear that what was it? identify your fears and then Overcome them with practice. So just don't don't don't dread this thing. I'm over 180. Don't dread it identified acknowledge it and Be con be conscious of the people in on the other side of the request that you're making Okay, well, thank you everyone and thank you panel I should have been clapping myself