 Excellent, right. Thank you very much for coming back from lunch, nice and promptly. I do realise I'm tearing you away from food. My name is Daniel Jones, and I'm the CTO of Engineer Better, a UK Cloud Foundry consultancy. In the last few years, I've been working with Cloud Foundry for a variety of different organisations, and more recently, one of them was a global wealth management enterprise. Now, if you've worked with the financial services, you know there can be a little bit antsy about the information that is leaked about them, especially if that's not from one of their own members of staff. So, this talk is grounded in the experience of operating Cloud Foundry at a global wealth management enterprise. Unfortunately, I can't give you really specific details, because I'll be breaking NDAs and all sorts. So, we'll just have to be a little bit more abstract. When I'm talking about a global wealth management enterprise, what does that look like? Billions of dollars. There are billions of dollars at stake. If you include parent companies, then we're actually talking trillions of dollars, but again, I don't want to give too much information away. Because the stakes are so high, security is of paramount concern, right? So, there's going to be lots of policies in place to make sure that people's money is safe. Because the stakes are high, there will be multiple data centres. This is something that we're seeing more and more certainly in the UK, with financial organisations adopting Cloud Foundry, that they have their own physical TIN data centres, and they want to be operating Cloud Foundry in each of them. As well as having multiple data centres, they're going to have multiple Cloud Foundry instances in each data centre. So, in terms of architecture, we're talking about multiple data centres with a big enterprise load balancer at the top. I'm sure you can guess what kind of brand. And then Cloud Foundry instances in those with local traffic managers rooting traffic to them. Got a production Cloud Foundry in each data centre that is serving traffic to end users, so customers of the business. And then non-production Cloud Foundry instances as well, which from a PaaS operations team's point of view are still production, right? Because your app development teams are going to be pretty cross with you if your Cloud Foundry goes down whilst they're trying to develop their apps. As we go through this talk, I've got the privilege of being a Cloud Foundry ambassador at the summit. I'm going to share with you, basically, if we were having a chat out there and you're a financial organisation, the advice that I would give you, the lessons that I've learned and the things that I think would help you. Before I go any further, can you hear the squeaking through the microphone? Should I stop wondering about? Because this floor is really squeaky. No? OK, shout at me. Heck, I'll throw something if it gets too squeaky for your liking. So, we defined, you know, the kind of organisation we're talking about. We've briefly touched on the architecture. So, what is the advice that I would give you? If you're going to integrate Cloud Foundry into an organisation like this, or any organisation, really, it's really important to define success. What is it you're trying to achieve by adopting PaaS in this organisation? Are you trying to allow people to develop the apps they've always been developing and get them into production quicker? Are you just trying to reduce time to market? Have you got a load of monoliths on TIN that you're trying to migrate, turn into cloud-native microservices and deploy onto a PaaS? Or maybe you're doing something closer to innovation accounting, where you're trying to enable new types of application that didn't exist before, new types of things the company couldn't do before that are now possible. All of those things are going to require focus in different areas. And if you're not clear about what it is you're trying to achieve, then where do you focus your operations resources? This isn't just something for CIOs and CTOs who are spending their money on Cloud Foundry. This is relevant for the operations team as well. If you're doing a migration from monoliths on TIN to cloud-native microservices, friction is going to be your enemy. The app teams are not going to want to move from their comfortable world of change control requests and putting things on pet servers if there's friction in the way. However, if you're trying to enable new types of application, then functionality features are going to be much more important. You're going to want to have new data services, for example, maybe stick a Cassandra in there that they're not used to. Be clear about what it is you're trying to achieve, measure it, continually improve against it. If you're bringing Cloud Foundry into an existing organization with some heritage and some legacy, there's going to be an existing context, right? Before you came along with your hips to PaaS, they will have been getting along just fine with bureaucracy and rules that they've had in place for years. Those policies were created at a time when they didn't have a PaaS, right? So they're obviously not going to fit anymore. However, the people that you're talking to, the app development teams, for example, will still think that they apply. An example of this was an app development team that we were working with who insisted that they could not use Cloud Foundry, they could not use the PaaS offering in this organization because their developers weren't allowed to push stuff into production, right? So the whole cell service thing went out the window. For DevOps people, that's crazy, right? Because we want to go from ideation to production and looking after the whole thing all in the same team. But if you don't have all of the assumptions that go with DevOps and continuous delivery and CI and all those sorts of things, then actually you can understand why this might be a bit scary, the idea of having some cowboy coder sneaking a backdoor into a financial system and sticking it straight onto a production server. So we were talking to the app development team trying to get them to onboard onto the PaaS, insisted that, no, we can't do that. There's this policy over here. It's been etched into a stone tablet. We can't possibly change it. We went to the security and the compliance folks. We asked them about the regulations that informed that policy. Why do we have this policy in the first place? How can we change it? How can we work around it? How can we have a new policy? What if we insist that all of their code is under source control? Everything is delivered via CI. What if we have static analysis as part of that? What if we lock down all of their production spaces so that humans can't log into them but only automated systems can? What happens if we use a tool like IBM's Urban Deploy? I don't know if you've used Urban Deploy. It's got a big, horrible, enterprise-y GUI, but it serves some really useful purposes in the context of Cloud Foundry. It's got really granular user permissions, much more so than Cloud Foundry itself. Everything's audited, including all the binaries that it pushes. So not a temporary cache like you find inside Cloud Foundry, but actually archiving every binary push to every environment. And more importantly than all of those, it served as one place that we could do secret injection. So the PaaS operations team wrote some tooling to interplay properties that Urban Deploy, the tool, knew about into manifests, which it would then push into the production spaces. We took that to the security and compliance folks. They matched it against the regulations. They were like, yeah, this makes sense. We can do that. We went back to the app development team and said, yes, yes, you can use Cloud Foundry. That policy now doesn't apply to you. You're going to find these policies resist when people try and insist that they still apply, even though they don't. Reinterpret the regulations, come up with new policies appropriate for the time. The kind of policies that you will encounter are likely to involve some amount of manual approval, right? Why do enterprises like manual approval? Well, for lots of reasons, one of them being that they like to have someone to blame. They like someone to be responsible for making a decision. Also, you will find that in large enterprises, the IT organization and the app development organization could be completely separate legal entities, right? And it's the IT organization's responsibility to gatekeep, to make sure that things are done sensibly, that all the rules are applied and things like firewall rules are correct and match up to what they should do. Having those manual processes in the critical path of what your app teams are trying to do is going to prevent your PaaS from being self-service. If your PaaS is not self-service, then it is platform as a service, okay? There should be two Ss in PaaS. Make sure that you've got the right S and not the wrong one. An example of this was in an organization where they had this idea that they wanted a declaration of everything that should be configured for an app team to use Cloud Foundry, and this was actually quite a good idea, right? So all of the spaces they wanted, all of the users, all of the roles for those users, user-provided services, the IP addresses and ports of things they were connecting to, all in one place. That one place did happen to be an Excel spreadsheet, which wasn't ideal, but it was raised through a change control ticket to the PaaS operations team who, you know, tap away on the CFCLI and make it all happen. That meant that there was a stop in the process. It couldn't be self-service. People had to raise a change control request, wait for it to be implemented. That introduced friction. So when you're trying to convince people to start using your PaaS because it's a sensible thing to do, you're taking them out of their comfort zone, but you're not really making their life that much easier. Resist any temptation to have manual processes in the critical path of getting apps onto your PaaS. As well as manual processes for the critical path of users of your PaaS, so manual processes in the workflow of your customers, it's really important to not have manual processes in the critical workflow of the PaaS operations team itself. I mean, the keynote yesterday, it was talking about automation taking over the world. Really important, right? If you have loads of slow manual processes, sooner or later, someone is going to come along and disrupt your company and do everything much more efficiently and more productively than you. That's probably why you were using Cloud Foundry in the first place, right, to protect yourself against that. The problem with manual processes is that they take time. And by the time that you realize that they're becoming painful, you won't have the time to automate your manual processes because you're spending all of your time on manual processes. You get stuck in a trap. You can't get back out of it again. The book in the background there, Scarcity, is how people get stuck in traps when they don't have enough of something. There are psychological effects that have been proven and demonstrated when you don't have enough of a resource and when you're aware that you don't have enough of a resource, be it money, be it time. You can only focus on the task right in front of you. Your ability to plan into the future, your executive control is inhibited. You are less able to do abstract problem-solving and dig your way out of that hole. Don't get yourself into this trap in the first place. You might be tempted to spend the extra time up front automating things. Everybody that I talked to when I was going back and creating this presentation and the different teams that I've worked with over the years really wanted me to make this point to you. Don't get yourselves into that trap. So why isn't there more automation end prizes? Why is it not the default state of things, everything automated already? Well, one reason is that bureaucracy is yesterday's automation. And I'm sure that in 50 years' time, people will be looking at all the automation that we've put in and complaining about it just as much as we do as rules and policies, right? That was the better alternative than chaos and anarchy. Another reason is that when people that practice continuous delivery talk about automation, we're actually talking about automation. In the Toyota production system, which inspired Lean and the Kanban Development Methodology, Taichi Ono talked about automation being automation with the human touch. And that human touch was common sense. It was error checking, right? It was making sure that a machine wouldn't rampantly do the wrong thing, that if pieces came out the wrong shape or size or whatever, that the system would shut down. Of course, as DevOps folks, we're not going to deploy something without smoke testing it. We're not going to deploy something without blue-greening it. When you're dealing with trying to bring automation into an enterprise, sooner or later you're hit upon that person who tells you the war story of, ah, well, we automated this thing a few years ago and it rampantly went out of control and it was a disaster and it spanned up VMs everywhere and we're never doing that again. They don't realize that we're thinking about automation and we probably don't realize that either. When you come up against hostility for automation, demonstrate your automation, show them the smoke tests, show them the safety checks that mean that it's not going to rampantly go out of control and do bad things. The other thing about manual processes is that they're soft, right? They're squishy. They're fuzzy around the edges. They don't have hard edges like an API. If I send a request to the cloud controller and I get it wrong, I'm going to get told very quickly that I did the wrong thing, right? I get an error message. I get some pain delivered to me. I go back to go and I'm not going to collect 200 bucks. That pain is information. That informs me that I need to do things differently. The information I'm receiving in this picture, by the way, was that I should do what the men on top of me were telling me to. You want to be like an API and pass on that pain to your users, right? Anecdote about this. Remember that spreadsheet I was telling you about with all the user provided services and the IP addresses and the ports and things like that? We had 46 user provided services for one app because it's a large enterprise. They've got lots of heritage off-pads databases. 46 user provided services with lots of different IPs and ports for each one of them. Day before a production release, they realized they've got connectivity problems and they want us to debug into it. We say, we'll send us your spreadsheet so we can look at the canonical list of all the things you want to connect to. I'm afraid we can't do that. What had been happening? We've seen sending change control requests in without the appropriate information and then phoning their friends in the Power's Operations team who was trying to do the right thing. He was trying to help out fellow workers. He knew that they were under pressure. He knew they were under deadlines. When the request comes in of, I know we haven't put the spreadsheet on, but can you just add this rule? I know we haven't put the spreadsheet on. Can you just take that rule out? Can you just change this port to that? Before long, it had diverged completely. It was to go through every cloud boundary on every data sensor down to the list of all the use-provided services, all the security groups, and all the firewall rules from the SDM and try and collate that list of them. Our answer? No. We're not going to do that. You're going to miss your production deadline because you've not done the right thing. Now, that sounds really harsh, right? It is harsh, but it's fair. If you absorb that slack, then information gets lost. The whole machine continues to be inefficient. If I was an API, if the PaaS operations team was an API, they would not be able to get away with that. If you're going to move your users towards automation, you need to train them that they can't get away with those sorts of things. You need to re-educate them and their behavior. You can't phone Amazon Web Services and say, oh, with that request I sent you, can you just, you know, that's what I meant. I know I didn't type that, but that's not what I meant. Train them to use automated systems. Pass on pain, be like an API. We can't be all stick and no carrot, right? That would be bad. If we carried on like that with every single request, then nobody is going to use our platform. We want to help people be successful. We're passing on that pain and educating them for the sake of making them more successful. So we need to understand their context more, right? This, when I read this, I nearly fell off my chair and then I wanted to print it out, frame it, and stick it up on the wall in my office. Unfortunately, if I did, I'd be breaking my NDA. So this came about where the PaaS operations team wrote a bit of tooling, which, if you're running Cloud Foundry, and especially a commercial distribution of it, is quite a good idea. It swept through all of the non-production spaces and turned off apps at the end of that unit's working day. If you're an enterprise, chances are you're running Java. If you're using Java, you're running Hello World in anything less than half a gigabyte of RAM. It's true, Spring Boot, the wonderful as it is, you can't get much done without much RAM. You will also have a load of apps called manifest.yaml that you probably don't need to be running, that will be using all the RAM on your Cloud Foundry. So this swept through and turned them off. Product owner emailed his engineers, saying, look, I realize that restarting our apps is scary, but it's part of the brave new world of Cloud Native. I had never thought that restarting apps could be scary. It just totally blew my mind. I didn't even think that this would be a concern of theirs. So why was it a concern of theirs? Enterprise IT is awesome at what it does, but it has a slow rate of change. So they develop an app, or at the end of a long-gant chart, put it on a pet TIN server that never changes, doesn't move, and it stays there for two years without being restarted. It just runs. Now, we're telling them to restart them all the time, so they're not used to it. That made me realize how much of a gap there can be between those... If you're involved in PaaS operations, hopefully you get things like DevOps and continuous delivery, but your customers probably don't, and we need to bridge that gulf. We need to make sure that we understand the context of where they're coming from. If Cloud Foundry is a new thing to your users, then it's going to get blamed for things, right? If there's a problem, it's going to be the new thing. It's not my code, it's that new thing you're making me use. I've run out of the lost count of times that I received phone calls or emails, something along the lines of, well, my application works perfectly fine on my machine, but when I push it to your Cloud Foundry, it doesn't work, so I think you've got a problem with your PaaS. At which point, right, you've got two choices. You can do the enterprise thing. You can throw away hundreds of thousands of years of human evolution where we evolve to be social beings that work well together and have emotional responses to each other's faces and things like that and send a passive-aggressive email. Well, I've looked at the other apps and I've looked at the monitoring. It all looks fine to me. It must be your problem, mate. Go back and fix it. Or you can do something else. You can find out where they sit and if they are in the same building as you, you can sneak up on them without them seeing you. Sit yourself down. Right, I hear you've got a problem with your app on Cloud Foundry. Let's work through it together. Plug in a keyboard and we'll work through this one together. You can turn people around from people that hostile, scared of your platform into advocates through pairing with them. You can educate them, show them all the tips and tricks of how you debug things, explain how the internals of the PaaS work. If you just send people passive-aggressive emails, they're not going to get all of those benefits. Whilst I'm talking about pairing, if in your PaaS operations team you are not pairing, you should be doing it now. Pairing is not just for programming, okay? When you are debugging live production issues, you want another set of eyes on what's going on in case you do something really stupid, like say, delete all the admin users. You want another brain to help contribute towards that problem. What's more important than all of that though is that you're debugging and the exposure of knowledge. When you send an email out that says that, oh, well, there was this problem and these were the symptoms and I did this to fix it, that's declarative semantic knowledge, right? That's like remembering facts for a multiple-choice quiz. I don't know about you. Most people don't remember that kind of information very well. When you're on a journey with someone, debugging, problem-solving constantly, then it's using your episodic memory. You remember journeys. You remember experiences much better than any point of fact. Pair with people in order to expose that learning. Make them remember things. So even if you start bridging that golf and establishing a lot of context with people, you will get people that break the rules. When I talk about breaking rules, I really do mean rules because there's likely a service contract between the operations team and the app development teams. One organization I was in, the service contract was Thou shalt write 12-factor apps. Thou shalt not put monoliths on Cloud Foundry. No, no, no. Unfortunately, one team insisted on writing their session states locally. So whenever users logged in, got written into this Cloud Foundry here, these didn't know about it. If we needed to upgrade this Cloud Foundry, then the users would be moved over here and logged out and lose work in progress. What could we do about that? Not a lot. We're a PaaS operations team. We don't have the levers to be able to go and chastise people. Hell, we'd even got them to sign up and said they wouldn't do it. Used the Simeon Army approach. We wrote some tooling that was run on a CI server that would talk to the global load balancer using its REST API and switch off traffic to production Cloud Foundries during production hours randomly. We told them that we were going to do this. We gave them lots of notice. We were very kind to them, but they knew now that it was going to happen. It wasn't a remote possibility in their heads. It was a forcing function. They knew for the better to make more resilient systems. If you've got people that are new to writing cloud-native apps and are struggling with the idea of designing for failure, check out a project called Chaos Galago. It's a service broker that will restart apps at a random interval. You give it a probability and a duration and it will randomly restart apps. That might help people on that journey. That keeps the app developers honest. What about the PaaS operations team? If you don't take anything else away from this talk, I want it to be this. Continuously acceptance test your platform every minute of every day. What do I mean by acceptance tests? In your favorite BDD framework, whether it's RSpec, GoMega, Lambda, Behave and Java, test that exercise the functionality of your platform from the user's point of view. Low-level infrastructure monitoring is great, knowing that a server has had a problem. If a server falls down in the middle of the woods and there's no user there to notice it, does anybody care? Well, probably, but not at 4 o'clock in the morning when they get paged out of bed. Don't get distracted by low-level monitoring and data metrics. Use acceptance tests to find out whether you've got a real problem. An example of an acceptance test that we were running, a CF push test. It would check out a fixture. It would push that app. It would hit that app. It would check the response. It would stop the app. It would remap the route for the app. It would hit it again. It would stop it again. It would delete it. We did that every two minutes on every cloud foundry and every data center, which also low-tested our platform to a certain extent. But we did the same thing for our data service as well. Rabbit, MQ, Redis, those sorts of things. That meant we had great confidence that the system was working. It was on a CI server, so we had a great wall board with red and green boxes. Published that to the app teams. Published it to your stakeholders. Everyone will know how well your platform is performing. That works well for the day-to-day running of your PaaS. What about when you need to do upgrades? I've talked about having multiple cloud foundries, which is a really great pattern for insulating yourself from failure. Despite all the cool things that Pivotal Ops Manager and Bosch can do, you can still irrecoverably balk a cloud foundry in the course of an upgrade. You need to be able to divert traffic to other cloud foundries if that happens. When you're doing upgrades, you need to check that your apps are still working. Why would I need to do that? You ask me, there's a perfect abstraction between PaaS and app. Unfortunately, it's not that simple, especially when it comes to build packs. Long story short, open SSL issues, lots of upgrades, upgraded our cloud foundry, picked up a new build pack, apps were using the new build pack, asked the app development team, can you test your app still works on this non-prod cloud foundry? Well, we don't have any automated tests. It's an enterprise. They plan things three months in advance on a Gantt chart. The offshore testing team is busy doing something else. So they didn't really give it a thorough test. We upgrade all the production cloud foundries. Next thing you know, somebody is debugging byte arrays in Hex for a week, finding a bug in Kerberos and the OpenJDK. Make sure that your apps have automated tests, that you've got to hook into their CIS system if they do, that you can trigger tests against your upgraded cloud foundry. If not, have a contract with them that they will expose an endpoint that you can hit that does a dummy transaction, something meaningful with all the behavior of their application. If you're going to have multiple cloud foundries, have a development cloud foundry. You should be writing some automation and some tooling, which is changing the state of your cloud foundry. Make sure you're not using the same one that your users are, like the app development teams. Some numpty once wrote a piece of code that was creating some users. That's safe. We're only creating users. And of course, you should have realistic test data, right? So people understand the purpose of your code. The only problem was that the realistic test data was admin user. Have we all recognized that? That looks realistic. And we need to tear down the data we create in the tests. Ah, yeah, don't do that. Have a development cloud foundry. If you're using Windows, apologies to any Microsoft folks in the room. If you're using Windows and you're trying to develop Ruby on cloud foundry, you're going to have a bad time. You will lose weeks to various issues with things like knock-a-girian compilation. If you can use Linux or OS X, that's what the rest of the ecosystem uses. You will be more productive because you can leverage more things that the community has made. If you are using Windows, or you find yourself in an enterprise that's using Windows, you can be pretty productive with MinGW, which is minimum GNU tools for Windows, which gives you bash and con emu, which is a bit like iTerm, but with completely different keyboard shortcuts. Use that in the past worked really well. I was going to tell you about a load of bash scripts that we wrote to make targeting multiple cloud foundries easier. And then we did some stuff to print out CF home to our bash prompts, and we always knew which one we were targeting. Instead, I'm going to tell you about CFplex, which we've written and is a simple shim that will allow you to manage multiple cloud foundries and run one command against all of them. Still works with all your favorite plugins. It optionally fails fast, so if you're deleting something maybe you can use it on others, it will cope with that. What's more exciting than that, though, is CF Converger, which I've been hacking on while I've been over in Santa Clara. This is a system that will take a YAML file, and it will make your cloud foundry look like you described in that YAML file. So, hands up quickly if you use Bosch. Cool, enough people. With Bosch, we know that snowflakes are bad. We say, make the world look like this Bosch, and it goes and does it. With a cloud foundry, though, we all think it's fine. You just fiddle with your state and change this and change that, and it will be fine. Someone will back it up. Converger will take a YAML file, create orgs, spaces, users, user provided services, and that can be run against multiple cloud foundries. So, if you have to make a cloud foundry look a certain way, or set up an app across multiple cloud foundries, this will help you. Another tool that is handy if you've used a software defined network made by some friends of mine. It's called Virgil. It will sweep through Bosch, look at where your Diego cells are, and will dump out a load of JSON describing firewall rules based on your security groups. And so, hopefully I've given you some useful links to tools. Talked about some lessons that I've learned, hopefully you can apply. Let's move away from the challenges and look at some of the successes. This is something that I tweeted about last year. OpenSSL issues had lots of upgrades to be done. This work involved a pair in the PaaS operations team occasionally leaning over, going to ops manager, next, and then carrying on with their work. So, it took a couple of days, but minutes of people's time rather than hours. The folks that were there were embedded in this enterprise estimated two years of scheduled weekend maintenance, every weekend for two years. This is several orders of magnitude better than the status quo, right? That is why you're all here. That is why we are here, because Cloud Foundry gives people that orders of magnitude better improvement in their productivity. I really wish I could share with you the story of this. This is from a product owner who emailed out to his engineering team, to the IT directors, the PaaS operations team. This product release took us weeks instead of months, because we're using PaaS instead of physical TIN servers. Again, that is why we're here, not because it's a bit better, because it's a lot better. Define success. Make sure you're clear on what your objectives are. Measure them and continuously improve against them. Reinterpret regulations rather than reapplying existing policies. If it's not self-service, then it's platform as a service. Demonstrate automation when you find hostility to automation. Do like an API. At the same time, build empathy through pairing. Understand their contacts. Educate them. Help them learn. Shake out rule breakers by using the chaos monkey approach. How testing contracts. So you can make sure you can upgrade your Cloud Foundries in a timely fashion. And most importantly of all, do continuous acceptance testing against your platforms all the time, so you have great confidence that they're working. I'm Daniel Jones from Engineer Better. If you have any UK Cloud Foundry requirements, please talk to me. If you want to ask any questions about Cloud Foundry generally, I'm an ambassador and I'm more than happy to help. Do we have time for questions? Yes, yes we do. It's about 10 minutes until the next people are in. So if there are any questions, please stroll up to the mic. Tumbleweed. OK, well thank you very much then.