 So welcome to this not a technology problem. My name is Colin. I work for engineer better. We're a Consultancy in London England. We do a lot of work in Europe, but have also done work in the States, Canada And I was in Singapore last year we do general consultancy around cloud foundry concourse digital transformation Kubernetes a bit now training delivery really a mixed bag Personally, I spend a lot of my time doing deployment management Automation operations of cloud foundry and helping customers do that I've been working with cloud foundry for about four years now It's originally an application developer and then I moved into the off side So I mostly do delivery of dojos and health checks staff augmentation and we deliver training as well and In working with lots of different customers I've found that Lots of times the questions we get asked fall into similar categories every customer has a diverse set of problems But they all kind of Distilled down to similar things. So an alternative title for this talk is things customers keep asking me on engagements so in general the Broad problems that customers seem to face fall into the categories of cloud foundry is difficult to operate our platforms unreliable and Why hasn't cloud foundry or any technology we've bought made us more agile? So my response to most of these starts off with this is not a technology problem so I'm going to try and go through common symptoms of the problems and Then causes from my experience things that are causing these to be problems And then some solutions that have seemed to work at companies that I've worked at so Without further ado launching into it got cloud foundry is difficult to operate There is some truth to this cloud foundry is fairly complicated But it's not quite as difficult to operate as a lot of people seem to think so a lot of times this manifests as The platform team doesn't have enough time good working lots over time falling behind on updates and difficulty keeping up with platform feature requests So you've got updates to cloud foundry itself and you've got updates that people using cloud foundry want So what causes this well? Most of the time it's a lack of automation so it could also be a lack of understanding about existing automation or Problems with the implementation, but every customer I've worked with who have complained about the platform team working too many Hours too much over time not enough time to do things have had some kind of problem with their automation setup And also no dedicated platform team So it's really hard to hire people with kind of cloud foundry knowledge or pure dev ops or cloud Technologies because it's such a wide range of skills that you need So a lot of times I see company is pull people from different parts of the business Which is a pretty good strategy So someone from the databases team and someone from the networking team But most of the time they still have responsibilities for whatever teams they were seconded from So I worked with one customer where the platform team was effectively 75% of one person and 50% of another and they lived in different countries in different time zones and It didn't quite work as well as they were hoping So most of the time this problem leads to the platform team being CF push as a service So actually at the same at the same company as the previous example The development teams weren't trusted with any access to production So they weren't given See if push access, but they also weren't given access to get logs out or to look at events or to see How their apps running or to scale their apps? So the platform team that was already heavily taxed ended up getting ticket after ticket that went along the lines of Deploy this app if it breaks fix it when you don't know how to fix it because you didn't write the app Then join a Webex call and kind of funnel logs through to the developers and then type into the console what they tell you What does this have to do with running a reliable platform? Well nothing Cloud Foundry needs to be developed in its own right. It isn't something that just exists and then will work forever And the final cause is cutting corners on training. I see this a lot at the start of dojos you have The mindset of we really want to get going now. We've got we bought cloud foundry. We want to get going We'll train up as we go. How hard could it be? Well, there's a thing called Bosch that you need to control it and it's got a fairly legendary learning curve Cloud Foundry is a complex distributed system. So you have complex tooling to to manage it So kind of Trying to learn as you go rather than putting in the time at the start doesn't always pay off but there are some things you can do to set yourself up for success and The primary one is if you're still deploying cloud foundry manually You are objectively doing it wrong and please stop Come talk to me afterwards or talk to any of the other companies over in the foundry on how they're managing their cloud Foundries and how they're pushing updates out Yesterday in the keynotes chip mentioned that across all cloud foundry foundation projects, there's a hundred and thirty-seven releases a month on average If you expect your platform team to keep up with a hundred and thirty-seven updates a month across all your foundations manually It's not going to happen So effectively you need to change the focus of your team from a team that operates a platform to a team that builds tooling To operate a platform So a lot of the teams have been on We're just building a concourse pipeline like that that manages cloud foundry and The foundation itself is a nice byproduct of the thing that the team is working on Another good approach is treating the platform as a software product if you consider a case where you have updates and you have production in The world of application development is it acceptable for your developers to take those updates and push them directly into production without testing them Of course not no one would ever sign off on that But then what about making changes to the platform that the apps are running on? What about we need this build pack in production now or we need Diego cells to scale or something like that That's a lot more acceptable for some reason to push directly into production without testing and without running it through earlier environments But it really shouldn't be Applications and the platform they run on should be treated the same And the first step to treating the platform as a software product is have a dedicated team looking after Software product that is the platform So in the old world you have application developers that make the apps and you have platform operators or sys admins that run the platform They're kind of separated by an organizational divide So you've got they're in different departments or different offices and if one side wants the other to do something It's bureaucracy and it takes a lot of time Fall being there Cloud Foundry allows for a high degree of self-service within development teams where they can manage the full life cycle of their applications I used to work for a major retailer in the UK and I was on the team that made the mobile website and We were pushing our application to cloud foundry and there were five people in our team We took 60% of traffic to the website We wrote the app ourselves. We pushed the app ourselves and we were the on-call support for the app ourselves And it was beneficial both sides of the company It was really empowering for us and it was and the company didn't need a separate team to deploy our stuff And if you're the one that gets called at 3 in the morning because you pushed a sketchy update at 5 p.m On Friday, you're not going to push sketchy updates at 5 p.m. On Friday So I think we had five minutes of production downtime in a year and a half But on the other side You've got platform operations, but you've also got platform development You shouldn't expect the platform to work out of the box whether be it CF deployment or CF. Yeah, CF deployment used to be CF release and Any vendor provided cloud foundry it won't solve all your problems out of the box You need to look at what problems you're trying to solve and mold the platform to do what you want hear a lot of Customers thinking that cloud foundry will reach a done state you install it and then it's done and you don't need to develop it anymore That doesn't really work So what you want to do talk to all the people using your platform Figure out what matters to them and what value they're trying to gain and then take all of that and put it in an ordered backlog Sort it by what's actually important to the people that are going to use your platform and Put the most important things at the top and have your dedicated team work on those So you're always working on what's most important to the people that matter and Well, your team is developing and making these things a reality Who's going to be talking to the stakeholders and making sure it's the right stuff to be working on That's a job for a product manager And this is not a project manager who sits on the peripheral and sort of manages the Day-to-day BAU stuff on the team This is someone who sits with the team goes to all the meetings with the team and is part of the team full-time Who bridges the gap between the technical side and the stakeholders and the customers and make sure that What's being developed is what actually delivers value because that's what all of us are here for at least hopefully So iterate quickly and check that you're implementing features that people actually want rather than spending lots of time on things that don't really deliver value and Finally, it's not enough to just have a team. The team actually has to work together I worked with a large government agency in Europe who had created a team where they pulled people from different parts of the business But it was geographically separated they were in three different cities across the country and They had one half hour meeting a week where they would talk to each other about what they were doing the platform and Result was one person had pretty good knowledge of Bosch and pretty good knowledge of how the platform worked and the rest of the Team didn't it's all well and good while that person's there But what if they go on holiday leave the company get hit by a bus you often hear this phrased as the bus factor Well, then The platform team can't manage the platform and all that learning that person has done needs to be done again You can solve this through pairing and mobbing so pairing two people one computer working on one story Before you discount it and say that won't work in our industry I've seen this work with financial companies large banks energy companies security companies Give it a shot works really well, and it means Even though someone on your team came from the network team and they're your subject matter expert If they pair with everyone else in your team over time Everyone becomes knowledgeable about networking same goes for everything else Mobbing also works three to five people on one machine Works for introducing new concepts And if you want to know more about mobbing look up the video for my talk at CF summit Europe in October So Going to the next misunderstanding Because this one isn't really true our platform is unreliable Cloud Foundry is very reliable. In fact, I've seen cloud foundry foundations continue to serve traffic Even as the underlying I as in storage melts into oblivion We had a situation where the storage on the on the I as was being Broken by another team cloud foundry continued to serve application traffic without missing a beat for about four hours But this usually manifests and the reason people tell me this you see down time during upgrades You see downtime when you're not changing anything the same issues keep reoccurring and Successful staging deploys be it of applications or if the platform itself lead to unsuccessful production deploys Mentioned before Most of the time this is with platform upgrades being tested in production or staging so from the pull from the platform team perspective the Pre-production environment is production because the application developers are disrupted if it goes down I worked with a company in Europe that was running no sandbox for financial reasons and Only stood up their production foundation Right before they had a contractual obligation to deploy applications to it and it didn't work And it was pretty stressful firewall rules were different Some of you maybe went to the zero-to-hero training at the start of the summit This is a really good way to go from zero to a severity zero incident as quickly as possible another cause somewhat self-explanatory past incidents are not documented so You have an instant you solve it You forget about it pretended didn't happen same thing happens the next month someone else is on call Maybe they remember something happening, but no one seems to know how to fix it. So you have to do the discovery all over again Snow flaked environments are another big one. This usually stems from automation problems. It's mentioned before so you deploy your sandbox Manually tweak some things so it works deploy your production manually tweak some things so it works you deploy your app to Pre-production everything's great. All right, cool. We deploy it to production. Something's different doesn't work Why is cloud foundry so unreliable? Maybe it's not cloud foundry Another one big batch releases see this a lot in big companies with big governance models where There's fear of failure. So Rather than pushing small updates through frequently the view is an update will cause downtime So we should do as few updates as possible and put a month's worth of changes And and just deploy that as a really good way to guarantee there's going to be a problem with the deploy Especially on the operation side of cloud foundry rollbacks aren't really a thing so if you get part way through a platform upgrade it Can it ranges from? Challenging to almost impossible to roll back to where you were initially But the general approach is to fix forward if you have a month's worth of changes and thousands of lines of code in an update It's gonna be really hard to figure out what broke to fix forward if you made one change and pushed it through It's almost trivial to figure out how to fix it So solutions have a sandbox environment Updates in complex systems like cloud foundry will break on occasion. You can't avoid it So make it so the first time you try something is in an environment where failures have no impact on users Give the platform team somewhere to test changes before they put them somewhere where it actually matters and With your sandbox environment you can then set up a system of environment promotion and continuous delivery so You take your automation and you take your sandbox environment and you deploy using a certain set of inputs And then you test that it's good to go and it fits your purpose Then you take the same set of inputs and use the same automation and deploy to pre-production and test that It's fit for purpose and now you've deployed twice and it's been fit for purpose both times and you deployed with the same mechanism You should be fully confident. It's going to work in production So you deploy it and you test that it's fit for purpose Ideally the arrows in this diagram are automatic. There's no manual gates Because if you get to production and it doesn't work it means you need to fix your tests If this isn't possible for you for some or at least currently for some governance reason or it's too much changed too quickly See if you can automate some of the governance or the overhead that's making these changes slow Next up we have document causes and solutions to incidents. So you have a problem in production Fix it as quickly as possible. Sure But then sit down look at what caused it look at These solution that you implemented and when you're looking at causes make sure you're looking at it in terms of process and not in terms of personal failure So say there was a problem because someone accidentally deleted the production database The problem isn't that that person deleted the production database The problem is that it's possible for someone to accidentally delete the production database So look at fixing that and document it and once you've documented it Write some tests for it and these can be both automated tests in your deployment pipeline and also game days one of my colleagues ran a game day for the UK government last week where they took a Foundation that no one was using and intentionally broke it in Devious ways and had the operations team try to figure out how to fix it as if it was a production incident I Also worked with a large bank where we developed a test suite That went after every deploy with their every deploy with their automation We ran a test suite which checked every failure case They'd ever had with cloud foundry in the last three years So this caught regressions in the product. It caught regressions in our code It caught underlying I as problems and it was really valuable And And the crux of it is that failure free operation requires fit experience with failure so expet expose your team to failure rather than trying to shield them from it because that will lead to their ability and improve their ability to handle failure when it occurs because it will occur and This brings me into the last section which is a little bit more nebulous. It's not a clear-cut thing but Most of the marketing materials things like cloud foundry and the vendor versions of it talk a lot about speeding up workflows Transforming your business cloud foundry absolutely does this and all the evidence you needs and the other talks and all the keynotes but It can't solve everything by itself So a lot of times I hear we've deployed cloud foundry, but our releases are still big our deployment processes are still slow feature life cycles are still measured in months and Applications haven't moved over to it and many other things like this. The crux of it is that Digital transformation requires total organizational commitment So if the mindset of the organization Organizations a whole doesn't change the transformation won't stick In Abby's keynote on Tuesday, she mentioned that technology is the easy part and changes the hard part so the real Underlying problem here is changing from this is the way we've always done it to this is what we need to do to succeed So I'll leave you with a Quote from Chip Childers from a recent computer weekly article Ultimately, you can't buy devops so adopting a devops approach or a Agile digital transformation approach is all about changing culture Choosing the technology and the tools and the products that make sense given whatever you're trying to achieve at your company and spend all of the time on choosing the right approach rather than latching on whatever cool technology use last saw in hacker news or in in the news So the tool and technologies that you need to achieve transformation will become apparent once everyone's oriented towards the same goal and with that Anyone have questions? Yep Yeah Yeah, so the mic I don't know if Mike picked that up, but it's a Total organizational commitment requires more than just the IT teams and yeah listening some of the keynotes today on on the parts that I caught it Yeah, so it's the company as a whole needs to Embrace it. So I was working with the bank recently where They were trying they were trying to operate in a really agile way, but the higher powers that be still required 90-day plans and sprint planning and everything. It was really hard to meld What they were trying to do in technology with what the bank as a whole wanted to do? so it is about Trying to change perceptions of in the old world where you roll back changes and Kind of Operate in a really waterfall kind of way trying to show that Operating in this new way delivers value and operating in this new way solves the problems that you were trying to solve with the old Framework, but also provides more value. So it's trying to project that upwards, which is the challenge always Yep, so I Yeah, so it's that's true different parts of the company change at different speeds and it's it's trying to Try to make the culture expand out of your team if you have a team that's doing really well try to share it So something we've done before at clients is So you have your platform team and you have application teams and try to find people in the application teams that are Kind of interested in the ops side of things and rotate them into the platform team maybe rotate someone through a platform team into the development teams and that kind of Bridges understanding between them and that at least helps figure out the like application dev application ops and the platform dev platform ops thing, but then solving it Going outside of technology It's even harder unfortunately Product managers are fairly rare in teams even the ones that I've seen. So this is almost a call to almost a call to action that Even teams that I work with it. We have a lot of trouble Convincing the company that a product manager is valuable Because it's not part of the engineering pool, but in a way they deliver more value than adding another engineer So this is where This is where the organizational change thing comes in it's it is hard to do that So the teams that I've seen that work really well Maybe they've changed the reporting structure entirely or they've made like shadow reporting structures where everyone on the platform team is The same role effectively a platform engineer and they all report to a Central project manager or product owner or at least appear to report to them And that way it consolidates everything in that's some of my sources the slides are uploaded So don't worry about taking photos I guess I could throw in that I I learned all of this because I go in and pair with customers Try it now there we go now works. Did you see a flattening of pay scales between so like We said our company the ops team is not paid as much as our developers and Because it's just their different skill sets Did you see any sort of flattening of pay or you know or career tracks? I can't comment on pay because I go in on the engineering level not the levels above it We do see a lot of ops people getting more interested in the developer side and trying to learn the development Tools and I've also seen developers get more interest in the ops side like I started as a developer and moved into the ops side But the first case is much more common So in terms of skill melding definitely I have no idea about pay structure sandbox as platform one of the things that I think Organizations forget about is that your the platform supports the developers and has a whole bunch of customers there You take the platform out for an upgrade. You've killed development. Yeah, so you have to do it in a non Customer-facing thing which everybody else has is dev but for you as platform. Yeah, it's something else So as platform dev and production our production So any change to dev is a changed production So you need somewhere that isn't either of those to test things on a platform level anything else Great, thanks Colin. Thank you