 Hello, welcome everybody. Thank you for coming here. I know we're the last talk on the last day and people are kind of tired. So we're going to have a lot of fun in this talk. We're going to have some energy. I'm not going to dance or sing. Neither. Those things aren't going to happen. I'll do some quick introductions. I'll start with myself. My name is Paula Kennedy. I am the director for the Pivotal Cloud Foundry Solutions Team Pivotal based in London. There's lots of my team here, which is nice to see. I'll let the panelists introduce themselves and I'll start with Deborah. Thank you. Howdy. So hi, I am Deborah Wood. I am the product manager for the cloud operations team in Dublin. We run the platform for Pivotal Tracker in production. And for the support relief campaign, we supplied the platform for the donations web application that receives all the donations. Hi, everyone. I'm Jane Houston, engineering manager with Pivotal. Also on the cloud ops team in Dublin. Cool. I'm Xenon Hannick. I work for Armacuni, where I offer consultancy specializing cloud native transformation. And I'm the COO. So I do all of the crap. And I'm Caroline Rennie. I'm the product lead at Comic Relief. So we're sort of the client of all these great people. Okay. Thank you, everybody. So I'm going to start with a question for Caroline. You mentioned you work for Comic Relief. Could you tell us a little bit about what Comic Relief is and really just give us a bit of a detail on the problem statement that you were trying to solve for? Yeah. So Comic Relief is a major UK charity. We raise money and then distribute it around the world to different projects primarily in sub-Saharan Africa and the UK, but in other places too. We use the learnings from these different projects that we fund because a girl in the gang in London has very similar experiences to people in a similar situation in South Africa. So we're able to spread the learnings that we get from giving this money out across the world. But first we have to give them the money. So we have a major fundraising event every year. It's either Sport Relief or Red Nose Day and that's broadcast from BBC One. It's seven hours and it raises about 90% of our public donations. So we have quite a high risk appetite Comic Relief because even if you've got a provider who's promising you the five nines, if we have five minutes of outage after a really effective appeal film then that loses us millions of pounds. So that was our problem is that we need a really highly resilient platform which is capable of taking huge amounts of traffic. So usually up to about, I think we do capability up to about 500 donations per second and it needs to work within the seven hour period and if it doesn't then we will not be able to fund fantastic projects around the world. So obviously as a good product and development team we outsourced it for a while. So that's where we got our McUnion. Yeah, so we originally rewrote the platform seven years ago. It's a microservices based architecture, 26 different microservices. It uses kind of a stateless pattern and it's a eventually consistent pattern so we can take the money even if certain elements of the platform fail and we've been doing it for a number of years and so originally we built it on the open source cloud foundry and ran across a private vSphere environment and AWS, ran it across just AWS for a couple of years then did and then partnered with Pivotal over the last couple of years and be running it across PCF and made some changes this year. So you mentioned you made some changes, Shane do you want to talk a little bit about kind of what the actual implementation this year looked like? Sure, so as Inon said last year the applications were hosted across a mixture of AWS and GCP and this year we decided to focus totally on GCP for the night of TV. This was partially because we'd built up a lot of experience between last year's Comerical Leaf event and also migrating Pivotal Tracker onto GCP in the previous year before that. So for this year we actually had three foundations running across two regions on GCP. One of those foundations was a pre-prod environment where we could test a few things out before pushing to production and then we had shared under ZNR production environments. Okay and what else was kind of different this year because you've done it a few times right? We have yeah so we used it as an opportunity to allow it was a bit of an experiment for us as operations. We wanted to allow internal Pivotal teams to directly interact with the users of their products. So typically feedback going into Pivotal products like Redis or in this instance PKS actually which was one of our newly released products. We wanted to allow our product teams to talk directly to app developers who are the users to build empathy and to understand the use cases. So typically there would not be a direct relationship. So we allowed our Makuni engineers direct access to the teams and they were also involved in like FIDALs just to practice incident management. So that was different. Okay so you mentioned PKS which is new for Pivotal and something I want to hear a bit more about so Shane do you want to tell us a little bit more about that? Sure so I'm sure by now probably everyone is familiar with the concept of PKS but it's the Pivotal container service which provides Kubernetes run by Bosch in the system. At the time of the sport relief event PKS was a fairly new product. We were actually as far as I'm aware we were the first place using it in production for the sport relief event. We were using it for this event to host instances of MongoDB. These instances were something that had existed previously. They were used last year as well but there was a difference in the setup between last year and this year in that regard. We were using a Bosch deployment for Mongo previously but we needed to tweak it in a number of ways in order to use it across multiple AZs and also when we came back to it this year we find that the deployment hadn't been updated in a very long time and we weren't so confident in using it so PKS provided us a very good use case for hosting this new workflow. Okay and so Caroline you're obviously a customer of all of this development effort how did it go? Yeah so we got all the money which is great. It's also that we had the confidence that and so we don't just receive online donations we also get text donations and people calling up still and we just had the confidence that if something changed in the system we knew that the teams had worked really closely together to make sure that monitoring was coming back you know if it's a two minute gap that's a problem and we knew that the communications between Armacuni and Pivotal was just great to make sure that if there was a problem it was fixed super quick and there wasn't a problem but we were very very comfortable. I know that Xenon keeps using the term that he was really bored but as it's an entertainment show he was very entertained. I'm not having to think about work how we'd be phrased. I actually got to watch a TV show which you don't usually get to watch in a high profile event like this so yeah so from our perspective it just worked so it's quite interesting because originally we had a single team that was developing the app also supported Cloud Foundry as well so supporting the platform and the app and that worked really well for us and then when we partnered with Pivotal that kind of changed so Pivotal with the platform support team we were kind of dev development team for the apps themselves and so that kind of created a kind of separation of duty and this year with having the CloudOps team in Dublin supporting the platform but also having the pks team involved it kind of created a kind of triangle that we needed to kind of cover across in terms of communications and fire drills and trying and just essentially building up trust between between the teams. It worked fine on the night we we believe greatly in our observability for all of our platforms for all of our apps and we managed to ensure by working really hard before the night that we all knew what was going on in the platform we all had really real clarity of what issues what they look like when they arise and what that might mean and then what actions we need to take off the back of that so it was yeah it was a very successful evening. So just taking a little bit more into the support that the CloudOps team provided and that platform I mean Shane described a bit about having pks and PAS both kind of combined together for that platform as the apps kind of team did you notice any difference could you see like that there was multiple pieces involved in the platform? No so it there was no difference really it was kind of like there's an endpoint that we call for our mongo calls and they just go through and so in the original setup when when we originally set out there was a some small configuration that needed to be done on the gcp load balancer to kind of enable traffic flow through a higher rate but that was picked up probably within the first kind of kind of first load test that we did together on that the change was made and then it was yeah it was completely transparent to us there was nothing there that we needed to know and that's really for me that's almost been it's sort of been the for us it's been the promise of cloud foundry that has been the same for you know we've hardly we've made one change in the in the seven years that we use cloud foundry and I think it was something to do with one year with her between when we went to cloud foundry 2.0 one of the namespaces changed or something we've changed aside from that it's just remained entirely consistent and you can just trust the platform will do what it wants to do and what was quite interesting for us is we expected it to be a much much more effort on our side we genuinely expected right so it's mongo but it's running on pks there's going to be something we're going to need to do differently we're going to have to route something differently we're going to have to mess around with some configuration on our side but we didn't it just worked so it was very interesting for us how low friction it was as a kind of development team to engineering team to support that kind of transition to pks and shane how was the experience of using pks because you mentioned you're one of the early kind of people to use it in production how was it it was very straightforward we did manage to parallel with the pks team in order to get some more experience because on cloud ops we had never used kubernetes before we never used pks before so we did get some information from from those folks um but it was it was very straightforward um and to to touch back on xenon's point it's very much the intention of pks that these things will be perfectly transparent to the end user and for for us in the pks team it was very validating to actually test that out in a production scenario and see that it was that that workflow was was working as expected okay so kamal on xenon said it went okay on the night how was it on the night for you two delightfully boring we we got there uh we they had it we had a war room kind of setup where it was monitors both of us monitoring direct uh our platforms behavior directly we used one of our products called health watch just out of curiosity value it's got all sorts of interesting graphs and it was also a new newish product at the time so we had amakuni's monitoring that was actually showing number of concurrent users at any particular time and you can see the number go up and down based on the videos that were being shown and then it was actually quite like out of curiosity value just to look at health watch and say oh yes they request they've gone up and they're okay but everything's still green so i'm happy um so yeah we were we were able to it was actually very very interesting to see how the two teams were monitoring everything but there was there was nothing dramatic there was no panic situations okay so that sounds good but it also sounds like a lot of sitting staring at monitors i mean is that that usual for your team it's definitely an anti-pattern i would say um we were more so doing it on the night out of curiosity especially to see these kind of peaks in usage of the apps and seeing that things are being handled correctly but in in general uh our approach on cloud ops is more so to have alerting based around this monitoring so that we can find out that there's a problem without having to stare at these screens exactly and it's actually um it's often very difficult to fight the temptation of like watching something like health watch and seeing the graph and wanting to dive in straight away like there's some interesting thing happening needs to know what what it was but actually if it doesn't affect the running apps it really isn't a lot a lot important i think also for sustainability of operations teams you don't want to be burning people out so it's very important to understand what the most important workflows are for your users so in terms of like slo eyes and slo so service level indicators which is your probe and service level objectives which is your um your required uh performance and you don't like you don't write you don't do alerting on a cpu bump or a cpu you do alerting on we could not serve the donations for them so you you alert on the thing that needs a human to interact with to address and to fix so one of the things that we care about what quite deeply in my team is that the the the human ops aspect of operations you don't want to be burning people out so yes hanging out with hannah fox well yeah absolutely so you don't want someone to talk in the morning having to stare at a dashboard stare at health watch um if you don't need my attention i'd rather be sleeping so if you do need my attention you can page me so that's kind of how we try and monitor the attention and the the energy of our our well stuff okay interesting um i'd just like to add that comic relief did supply snacks so that people had to stare at their screens for seven hours didn't go crazy but yeah i think it was you know there was a point where like you can probably stop looking now it feels so staring at the bbc screen as well yeah entertained by that yeah okay so i want to maybe open it up to the audience if there's any questions you talked about wanting to be highly available uh but you also talked about you'd previously worked across two cloud providers and then moved to just one yes cloud providers wondering what was your motivation behind that and how did you ensure that you you kept that high availability but just the one provider do you want me to take that one i think it's just that a sort of logistical idea around cost uh was whilst high availability is obviously super important to comic relief after five years of seeing us you know we're doing multi region as well so realistically we sort of had multi region multi cloud and you're thinking to know what if at the moment if this platform's gone down it's because the entire internet is broken um and so just in a feasibility and you know maintaining a platform across to you know it was just making the platform itself a bit too complicated for what was now becoming our emerging needs um so yeah i think it was also you know finding great tech partners to work with and being able to go and say yep we're happy with all the eggs being in this basket yeah i think it comes down to a risk assessment each year so originally one of the main aims of the platform was to be to not have dependency on a particular technology provider so the old platform like eight plus years ago was built with kind of 12 13 different technology partners i could probably reel off to names but you probably know them all the big players came together used to get kit kit from hp stick it in a data center all kind of get about 35 40 people together and build this build this kind of snowflake of an app and hope that it works um and then every year we had to make an assessment as to what the risks were so when we first did it six seven years ago you know aws was well established but it still was something that was you know there weren't that many regions we had to do us east uh and and us west west coast for the first the first time and so as time went on and then we did eight across aws and gcp and we just found that gcp was was was really solid and every year that exactly that kind of extra cost of running up uh you know a different environment and with a different cloud provider and having to you know low test that monitor that make sure that's okay we just made a decision that practically the risk was such that we could we could work with a single provider i think as well from pivotal's perspective we'd had a year's experience of running track our pivotal tracker on top of gcp and we were very confident in the setup for that for tracker in particular we only have a single foundation on gcp and we've been working very well and as expected with that so from our perspective having these three foundations was already adding a lot of redundancy to this setup for the night of tv and on top of that i mentioned that there was one pre-prod environment and two production environments but we had the pre-prod environment set up exactly like one of the production environments so if there was any kind of outage for one of those or if there was some sudden peak in demand we could very easily read traffic to the the pre-prod set up as well hello i wanted to see if anyone else i don't know you um and so then and you may have explained this but it was a bit fast the the mongo running in pks did so was that just still connecting to the applications through standard service broker interface um just want to make sure that there wasn't anything unusual there is that a fair description uh yeah so it just connected normally like as it did as it did before so it was just a same interface as it would be if it was on pcf so really that's not a lot of change yeah any change from our side okay yeah so zero change the application zero change the application yeah just checking hello howdy hey josh i had a question then again about mongo you mentioned that the the bush deployed mongo hadn't been updated so maybe there was an implicit thing that the kubernetes mongo was more up to date i don't know maybe if that's true or not it would be interesting to know but what what else changed for like monitoring your mongo like uh i'm guessing you know did you have to change anything around how you made sure that service was running um when it was running on different infrastructure uh in terms of the kubernetes side uh i i believe that the the mongo db palm charts were more up to date than the bus deployment um partially because there's so much attention focusing on kubernetes right now and it is a a good solution for these like third party uh data providers that aren't necessarily cloud native by design um we didn't do any additional monitoring from the the cloud ops of us i'm not sure if there was anything it was the same monitoring it has been the year before in terms of that we had the same kind of graphs i said like grafana graphs were kind of reading you know the amount of throughput going in you know cues and all that kind of stuff so but like i said for us in terms of a risk perspective the mongo db can can go away and would still be collecting all of the money it would just fall into the reddest cues and then the cues would just build up and then what we can do is we can we can rerun the whole night again and repopulate the mongo db so you can get the state of time at any at any point so we've never had to do that thankfully but um but that is something that we have in our kind of back pocket it makes reporting much much harder because the reporting comes off the mongo db so we send uh every minute we're sending uh uh we're calling a dashboard that that we have that the people who are monitoring the people who are watching the tv show and the finance team are monitoring the dashboard that's telling them how much money's come in so then because you have to when we're giving out totals on the bbc tv show they have to be validated and complied by our finance team so we have to have numerous checks that we're not we're not giving the wrong number to the to the bbc because that causes them a massive massive compliance problem i think there's some certain levels of sort of assumed failure rates as well just like that because you then say whilst all of our systems are saying yeah this money definitely exists you don't want somebody to turn around in two weeks time and say credit card forward or i'd only meant to put one zero um when they put three um so there's um there's a few other things just to make sure that we because yeah we've got very tight compliance rules with bbc which is great fun to work with technology i have another question um um on when you were doing the risk assessment and you were looking at the solution for this year can i ask why pks was it just because it was shiny and new and you wanted to try it out was it to solve a particular thing that was new like what was the motivation for thinking you'd add pks into the mix i think from pivotal's perspective we were certainly interested in uh testing it out it'd be something that had been worked on for quite some time and this was a kind of perfect scenario where it's a workload that would fit very nicely into that picture for pks i'm not sure if there was more i mean i mean for us um a big part of the journey that comic relief has made is to become a technology destination you know they want that you got their hiring oh yeah they've got a great team they've got they're really making a real difference in the world doing some really really cool stuff though you know they're um doing some really interesting stuff in serverless and all this kind of thing and so for us part of it is always looking at what's out there and what we could potentially use now um for us it comes down to a conversation that we had with pivotal and people saying this is this is kind of ga now we've got great confidence in it we'd like to use it assessing what the risks are within that and trialing it by getting it up early by working really closely with pivotal then we can try and load test it and see if it works if it didn't work then we could go back to you know spinning up in in open source cf spinning up in in pcf and and working through the problems of that but early fast feedback to validate the question really really starts giving me assurance it's kind of something that the project has had all the way through so when we originally did it two weeks after we agreed with our contracts comic relief that we're going to build the platform we presented the first journey and it was just the three pages didn't look very nice but it was actually money flying into comic release bank and the directors at the time the board have gone really for that point we'll make it use a massive risk for them for them to be able to see progress within two weeks be able to see money flowing in was very reassuring so that's something that there's an ethos that we've carried on all the way through you know testing and validating as quickly as possible and that just fits in with all the kind of pivotal practices and xp etc so it's worked really well how did you do um scaling did you plan for the maximum expected load and just scaled statically or was there auto scaling involved we uh we can't auto scale and it's something that's had even from the first solution um the spikes are so spiky so we have the seven hour tv show we show um we show uh films that show the impact of the money and literally the you can see from the second you know the second the second the film finishes and they ask people to donate it just it just spikes and so with you know with you can't really use elbs because then you know the first one will say oh i'm running out of things let me get another one that say oh i'm running out of things i'm getting the other one and suddenly you're waiting and waiting and waiting so we have to pre scale and pre and pre warm stuff yeah that's another nice thing about working with partners who are bought into the cause because you know obviously it's um it's more expensive to do that but it's also worth it and your risk and reward because uh yeah and so it's also very difficult for us to say you're going to have this much traffic because it's just literally like it's broadcast so it's you don't know how many people watching you don't know which films will end so we often will over we're very very much on the cautious side because yeah it's it's mission critical for us hi um seems it was a pretty successful night uh last time round what's going to change for the next time round um we are actually taking um the donation platform in house and we are building it using aws lambda serverless um which was this is our breakup with our mcuny on stage i'm sorry it's not you it's us from my perspective i think it's really amazing because ultimately what we try to do i think what pivotal tries to do as as kind of responsible technical partners is you want to engage with your client and get them up to speed as quickly as possible so a lot of the patterns that we have in the in the application you know the the queuing the eventual consistency the kind of stateless nature of of each of the each of the um of the calls they've taken that on they've taken that learning and now they're building it in house and it's just a nature of of of kind of technology that things move on and things change and for us to be able to you know and we we recently went to comic relief and spent a couple of days with them to help them validate the the application the new application and that was just it was just great being part of that and seeing a team really taking taking that on yeah and it's also sort of because of the annual cycle of its once a year um we said sort of working away where for six months or four or six months of the year you're working on it and then you sort of retire it and then things can get out of date so that's sort of how probably how the mongo problem existed and she says confidently um so the difference now is that we're actually changing our business model as on the hold to try and do more year-round activity and for us to be able to have a platform which can scale when it needs to scale and shrink back down when it needs to shrink back down again very technical terms um that's perfect for us and and it's just been a great experience for our development team and to take all these learnings which have been you know seven years worth of learnings and then go and do this chaos testing day with with you guys and you've been like yeah you've covered off most of things which means that we're we're significantly our team I think we've not had the horror horror experience that Xenon had in 2011 so we've never seen their platform fail so we're like well the platform's always fine so we can do this so I think that we're very much now being like we must make sure that ours doesn't so we're about to get a lot more um I'm sure anyone who's been an engineer has a some scar tissue that kind of reminds them of some lessons they need to keep in their mind so I've got one question then just it sounds like you get to a lot of experimenting and kind of pushing the boundaries on the technology front have you had a chance to kind of I guess give back to the community as far as like lessons learned or tips or tricks or even maybe there's other um you know agencies out there that could use the same technology is that kind of something you you think about um so we've we use Drupal for our main CMS so we give back to that community currently um there's other platforms that we have got out in the open so for example our gift aid claims um are also on AWS Lambda and so we like so when you text donate you need to click a link and give us details so we can gift aid it we're like that's only useful for charities and let's share that out so it's accessible it's out there github.com forward slash comic relief you can go and look at our work you can contribute to our work if you really want to um but yeah so we try and give back to the community a lot and I think from the time that the platform's been in development these comic relief have been represented at probably every cloud foundry um talk so there's a lot of feeding back into the community definitely and I'd say part of our learnings was there was understandable reluctance just with the distributed nature of what we were doing so we had three very specialized teams we had Armacuni specializing in using MongoDB we had my team cloud operations specialized in running a pcf stably and that's that's our domain and we know it inside out and then we had pks with their speciality of running Kubernetes and the one of the one of the earlier talks today was around given that there were three very specialist teams working together to provide this platform there's there's reluctance for any one of those teams to hold the hot potato of support for that thing so there was reluctance just in terms of who can adequately take operational responsibility for MongoDB on the night of tv understandably because while Armacuni was on top of their MongoDB game the fact that we were running it on pks um meant that while they know how to fix MongoDB they might not be able to get there they might not be able to find uh so one of the lessons that we learned that was actually very very helpful was that we decided to share operational responsibility across the three teams and we practiced that so for us to get uh aligned on who handles an incident on night of tv we set up fire drills so we practiced uh let's get everybody on the same zoom call everybody on the same slack channel and we just practice access region one of production get the last line on the log file of MongoDB give me a number of records in this db all that kind of stuff just to build confidence between the three teams and that was a very simple but very pragmatic way to make teams feel less afraid on something as high profile as a night of tv i think in the in the it world that we live in with so many very different specialist niche areas like a microservices world where you have to collaborate with teams that you might not know practicing incident management uh in in in the line of something like fire drills can just help um alleviate the fear factor in operations teams and we learned that and we were trying to spread the gospel of fire drills in pivotal um and anyone who will listen but um it actually worked very well and it wasn't it wasn't a super technical solution it was just um let people practice logging into the right thing and getting a simple thing so they're not terrified awesome so we're just about out of time i'd like to thank all of our panelists thank you very much thank you thank you