 It's difficult to get there, but it is really worth it when you do. And if you have a model for your automation, then it means that other people can contribute more efficiently, more quickly, and it just feeds on there. So getting to a certain tipping point can be difficult. But if you've got the right tooling in place, then it can be very effective. There are some cases when you probably might not want to automate. And you have to ask yourself, is it worth the effort? If there's a task that you only do once per year on a handful of servers, it might not be worth trying to automate that because you don't do it often enough. And the efforts that it would take to automate that task may not be worth it, so that's worth bearing in mind. But things that you can automate, they're not just unit tests. There's lots more things, particularly from a DevOps point of view. You can automate your development environment, of course, but you can automate the integration of that software through the QE cycle into production. You can automate the provisioning of your infrastructure using tools like Chef and Ansible and Puppet. You can automate your change management processes and your release management processes. Subtle difference between change management is change management could be changing the memory footprint or the thresholds for a certain setting inside the platform where release management is more about doing deploys in a continuous delivery fashion and the managing the releases coming from the engineering team through into production. Automation is also very, very valuable in a scheduled maintenance context when you're dealing with the cloud because you have the ability to take the tried and tested techniques and deployment processes that have already been exercised by the developers and by your QE team, and it should, using the same tools, then just go into production in the same fashion because you've already automated all of the processes. So I could spend all day talking about automation what I want. So as I said, change management can include release management, configuration management and data management. Release management, from our perspectives in the DevOps team, is managing the various software releases and the cadence of your releases but making sure that they flow through from development into QE and into the SaaS platform smoothly and that you have a repeatable process for doing that. If your code is going out into production, ask yourself, what's the impact of deploying this? Is it something we can dark deploy? Should we dark deploy it? Or is it a major release, a big bang release where it'll be launched on the same day or are you drip feeding some of the functionality into production? Be aware of how your code is actually going to end up being deployed into a production environment. If you're deploying some new replacement technology, do you need to do any data migration? Could that mean that you need to do a scheduled maintenance window or a change window to deploy your software? Think about these things and talk to the DevOps team and ask them what impact that might have. It could just mean that you need to give more advanced notice to customers about a certain software release and that's fine but know about these things before you consider timing around the deploys. If you're deploying into a scalable platform onto multiple VMs running the same software, it's nice to think that you could do a deploy in parallel to every VM at the same time and that would take far less time. But you shouldn't do that because if you deploy to all of them at exactly the same time and you're bouncing EAP or something like that, you're going to take the system off the air. So be aware of that. Is there a way that the DevOps team of device where you can deploy software to VMs in series, take a method of load balancer deploy, open shift and lots of tools have nice ways of doing that as well. Do you have a rollback policy? Do you have a process for doing rollbacks? You can have a process but not have a policy. Policy could be if it's not done well in the first hour, we're rolling back because it's clearly not working or you can set time limits on these things. Again, talk to the DevOps team, be aware of these things when you're developing your code because the people deploying the software don't like surprises. So configuration management, again, it's another form of change management. It's tweaking settings inside the software but if you do have a change you need to make, is there some way that you can eject that property into the platform using API? That would be good. Or do you have to restart some components inside the platform to do that? That's okay too. So as long as people know about it and you have notes in your, your Gira tickets and struck people that that's a requirement of the release. So you're introducing a new feature. How much data is going to consume is going to start turning out gigabytes of new data. That's okay too. So as long as people know about it, you can do capacity planning. If it is generating a lot of data, maybe some of it's temporary data, throwaway data, will your software self-prune that data after 30 days? If not, it should. If it doesn't, you need to tell somebody and you need to work with people to manage the impact of that in your production environment. All of these things can lead to surprises that the on-call team or the production team probably won't thank you for. So it's good to talk about them. Capacity planning is another interesting area. Again, as I mentioned earlier, if you think that your software is going to consume additional memory, disk space, CPU, let's have that conversation. You could be driving up the cost of your platform, which are shiny new feature because you need a terabyte of storage or gigabytes of memory on every server or some new piece of Java that doesn't exist on that particular layer of the platform. Know your market. Know the context in which your software is going to be used. So how many end-users, think of a mobile app, how many end-users are there going to be? Is it a driving testing system where there's really only 100 driving testers in a particular country or is it a full-blown airline app with hundreds of thousands of users every day? And what's the impact of that? Have you considered the resource implications on that and discussed people? There's a very good article, capacity planning on the back of a cocktail napkin, look it up, and it helps you understand, okay, so you've got 50,000 users at peak hours, only 10,000 of those will be awake, and then only 5,000 of those will be coming to work and the rest will be on the tram. And it helps you understand, okay, the user pool is huge, but in practice, how many people are actually going to be using this application at the same time? Do a little bit of research into that and come up with some reasonable assertions on it. If you get it wrong, at least you've had a discussion about it, and if your customer gets upset, you say, look, we did capacity planning, here's our logic, this is the data we had, we're sorry, well, we didn't just put it into production without considering this. Similarly, what's the impact of your software going to be on the existing environment? Are you going to blow the RAM requirements in production or not? Or what's the impact on the system going to be when all of the masses of people come along to use your new feature? Understand these things and have that discussion. So, at the end of the day, when your software and your code is in production, somebody's going to have to support it. So what can you do to help that team not be bothering you to help them debug your software? So there's a number of things you can do. Because I'm a cloud guy, monitoring is top of the list. You can never have enough monitoring. Automated monitoring is even better. I'll have a slide on that in a moment. Similarly, it goes without saying that your product documentation should be accurate so that either the customer can self-manage their own issues, good KCS articles or knowledge articles in the support system. So when somebody is searching for an error string, they'll find the right article and somebody's already written the answer. It's tried and tested approaches. If you've got some particular angle of your software that needs to be administered and you know you can do it from the command line or you can tweak the Apache configs to do it or something like that, that's great. But the software engineer that might be having to support your software or the customer service, this team, if you will, it would be great if they could administer your product for you and that way you don't have to do it. And you don't want to be launching software into production. That needs a rockstar DevOps guy with some command line tools to be configuring the software. That's not a good idea. So an ability for other people in the support organization to administer your software, that's very important. Logging, again, if you're producing logs in your software, think about who's gonna be looking at the logs. Sometimes it could be you, sometimes it could be a support engineer. And what's the quality of the logging like? How useful is it? Are the timestamps the same? Is the information helpful? Is it an access log? Is it a debug log? Who's the audience? Think about these things. Most of the time, if you're developing, you're probably gonna be putting in log statements that will help you to debug the software during development. That's great, but there's other people who may be trying to do similar with your software after the fact. And it can be very helpful to think about some lighter weight information that you could put in there that might be in the headspace of the person trying to support the software. Again, tooling for support. So what tools can you provide to the support team? So they can help support your software. Command line tools are great, but there's lots of other kind of options there as well, monitoring dashboard access and things like that. Regression testing tools. And you will need to have a sustaining function in your organization that can support that particular release of your software for a period of time after it's been released. Accept that now and life will be better for you because escalations can happen and you haven't really put in place any processes or resources to deal with that. In particular, in light of a new release, the adoption can be quite positive, but there can always be feedback coming back through how to use the thing or maybe feedback on the documentation after the fact. So monitoring. One of my favorite topics. You can never have enough, as I said. Types of monitoring that we have found invaluable over the years in Red Hat Mobile would be system health. So is the system working correctly right now? Are the dashboards green? How many monitoring checks do you have? Well, the more you have, the better. But how accurate are they? Did you test your monitoring checks during pre-production and did you force them to fail to make sure they go red? Because you can write a monitoring check with a bug in it and it won't turn red or it won't send you an email when that situation happens. So have you tested it? If you're provisioning new infrastructure and you've got really good monitoring dashboards with 50 different monitoring probes per server, when your dashboards all green, you're done. Your dashboards that are used by the support team to analyze the health of the system are the very same systems that will tell you, say, right, we can close the change window. We've reached our all clear criteria for doing this work. So can you contribute something in your software that exposes a little API where somebody can interrogate the health of your component of the software? Maybe you could return a simple JSON object that is in a consistent fashion where all of your software components conform to that. And that can be integrated into the monitoring systems and they can trigger alerts when the status string is not OK or something like that. So again, think about these things. When you're doing schedule maintenance, absolutely essential before you do your schedule maintenance, look at your dashboards. They may not be all green. There might be one yellow in there because there's a backup or something happening or there's a disk space pruning going on. So many times we've done schedule maintenance. Looked at the dashboards when we're finished and there's one check that's orange and we spend about two hours trying to figure out what did we break? Turns out it was orange before we started. So we didn't actually break anything. We didn't regress in the window. So it's crucially important to do that. If you're integrating to customers' back end systems as we do in Red Hat Mobile over a VPN connection, have you got a way to determine whether that connection is up or down? Monitoring, yes. If the customer is telling you about a VPN connection, the chances are you're having a discussion about their various subnet ranges and what systems in their network you're going to want to talk to, what port, what protocols. Instrumenting those into a monitoring check, expose that to your support team. Bob rings up and says, hello, my name's Bob. My app is broken. Click, click, click, view. Hey, Bob, you know what? One of your systems inside your own network is down according to our monitoring systems. Thanks for calling, Bob. That's a really helpful thing to make available to your support team. We've developed a number of tools to warn us about SSL certificates when they're going to expire. Do they have Poodle or do they have Heartbleed? So once you write that check once, you can instrument it to anywhere in your cloud environment that you have security certificates installed. So when, I think it was the year before last, there was a lot of, maybe last year, security scare with Heartbleed and Poodle and whatnot. So our reaction to that was to go to the internet and get a little Python script that helps you determine whether a given system has Poodle, deploy it into our automation system, and about 10 minutes later it's coming back and telling you, okay, you've got Poodle in here and here in your pre-production, but you're fine in production. Issue a security advisory on our status blog. Hey, everybody, we're done here. Nothing to see here, move on. Log rotation, simple little things. We have monitoring checks and production that tell us if any one log file grows above a certain size threshold, usually one or two gigabytes. And if that happens, chances are, somewhere in that system, the log rotation mechanism is not working because that file should be getting rotated every day and it shouldn't be growing that big. Those little things are invaluable. And then if you find that it is, log a ticket and have somebody correct the automation system to make sure there's a log rotation configuration settings, configuration management gets fixed in the next release. Taking inspiration from test-driven development, we like to do test-driven ops. And one of the other favorite phrases I have is giga ops. So, armed with an army of monitoring dashboards, and what we call a cheat sheet, which is a how to fix most of the items listed under the monitoring dashboards, we've coined the phrase of giga ops and giga means get it green again. If you're on call in the middle of the night and a system is broken and you look at the dashboards and two or three checks are red, they shouldn't be. Chances are, if you follow the instructions in cheat sheet, which is usually maybe restart a service or try something like that, if you can get the dashboards green again, chances are the system's back on the air. That's enough to be on call and get the cow out of the ditch is one of my favorite phrases described there. So, I mentioned earlier about product administration. I like to express this in the context of the three I's. Any product should expose an API, have a CLI, and preferably be available in a UI. That's ideal. You don't always have the UI, but the other two are absolutely essential. Similarly, logging. Your logs probably should be centralized, so you can look at them all from one place. But are they consolidated into one stream where you can look at all of the logs in a context of each other, or do you have to fish through log files belong to this system over here and then swap over to another system to look at the timestamps and try and trace through the flow of a certain issue that you're trying to troubleshoot? And are your logs consistent? Do they all use the same timestamp? Do they all use the same kind of number of fields and things like that? If they're developed by different people, they might not be, and that can be very hard to support even though you may have consolidated and centralized logging. Each log entry looks different, it has much harder to debug, and you can't use crazy bash one-liners to snarf out of the interesting information as easily. Again, tools, of course, in the context of support, if you can give your support team access to the regression test tools and they can run them, maybe they'll have a reference app in the platform, and this app is known to work all of the time. And Bob brings up and said, I'm developing an app and it's not quite working. I said, sure, Bob, here's one we wrote, it works. Would you like to take a look at that one and maybe look at the differences and there's a selection of apps available if you're doing one relating to a camera functionality or something like that or best practices ones. So if you can give your support team some sample code that they can use as a reference to help customers fix their issues, but also to reduce them a number of time that we spend repeatedly doing the same thing for those customers, that can be a big help as well. Sustaining is difficult to quantify, what is sustaining? So in our world, sustaining is more about not new features, but it can be bugs, enhancements, or documentation deficiencies. Again, your documents can be incomplete or they can be inconsistent, wrong, or you might not have a feature documented at all. Again, have a mechanism and a software process that enables you to correct these things efficiently and you'll be fine. So a couple of slides on some other elements of running a SaaS platform, which you might not have a lot of visibility into from a developer's perspective, but when you're running a SaaS platform with real enterprise customers paying a lot of money, business continuity, keeping the show on the road, uptime is all very important. If your system can't tolerate faults, then you won't be able to have a highly available system and you're going to have service availability challenges and issues which could impact on your revenue. If you don't have well documented standard operating procedures, and these could be born out of talking to development team for these things where this is how it's meant to be done. If you do it like this, it should get your dashboard screen. It should achieve the desired result. If it doesn't, let's fix it. Let's update it based on some deployment note or release note that's come through in a release. If you don't have an effective way to communicate with your customers how there's an outage happening, but we're working on it, most customers will be happy to know that somebody's awake and working on their issue and they'll leave you alone to do the job. If you have an effective way to communicate that to your customers, which we do in a form of a, most companies have a status website where you can go and look and see something's wrong here, is Amazon having an issue? Oh, they are, great. Now we know it's not my code. I'll wait for them to fix it. Similarly, if you don't have a good infrastructure partner with good resiliency and scalability elasticity features, then it's gonna be very difficult to have all of those things. So for types of fall tolerance that we have coined, essential maintenance, if you will, is Amazon rings you up and says, that VM over there, we think it's on a server that's about to die. You need to move it. We'll move it for you on this date, but if you wanna move it yourself, before then, that's fine. You need a mechanism to deal with that. A software patch comes out or a security defect is announced. You need a mechanism to deal with that. It's not quite a fault, but it's an issue that you might wanna deal with and change. And then of course, the unexpected. Stuff happens all the time, network peering issues. If your system isn't able to tolerate these faults, it won't be highly available. You also need to be able to deal with schedule maintenance as developers to say, hey, this release is gonna require one hour data migration. How are you gonna do that? Communicate to your customers, set the expectation, arrange a change window, get it done. Some schedule maintenance can be service affecting and some may not be. Generally speaking, if it's not gonna be service affecting, you don't need to do much about it. But sometimes it can be helpful to put up an article, say, we're going to make a change. It's not going to be service affecting, just letting you know. Then if something bad does happen, you can at least say, well, we did say we were making changes and we're sorry. High availability, absolutely key. I'm often asked, what's the difference between resilience and redundancy? My answer is resilience is really about keeping going at the application there when something goes wrong. Now, you can't take a piece of software and deploy it onto multiple servers in the cloud and say it's a cloud application. It needs to be architected and designed to know that it's going to be deployed across those. Therefore, it won't be scalable. That's really about having a stateless computing model where there's nothing shared across the application there. Redundancy is more about preventing data loss. So it's database replication, Mongo replica sets, that sort of thing. So that's my definition of the difference between the two of those. Disaster recovery, all school disaster recovery is have a second data center in some other city. You're doubling your costs. The more modern model in Amazon is to use their cross availability zone features where your platform is actually horizontally scalable but distributed across those highly connected data centers. Doesn't double your costs if an entire data center or Amazon availability zone goes off the air. Their load balancers are capable of balancing across these regions, sorry, across the availability zones. And you can still achieve the same levels of uptime because your horizontal scalability and resiliency and redundancy should keep you on the air with the other availability zones that are still there. Quick slide. Note the difference, the striking difference in the amount of downtime per week you're allowed just by adding one nine. It's staggering, going from two nines which allows you just under two hours per week which is healthy enough time. Just by adding one nine, that goes on to 10 minutes. That's a striking difference. Be aware of this. If customers say, oh, could you do four nines? Perhaps not, but it may increase your cost. Communication strategy. Key elements here are if you can provide notifications in your UI, you should, and present a status dashboard where you can communicate these things and people can come and say, oh, look, there's a red. I'll come back in an hour. Status blog. We're having an issue. There's a security advisory. We're doing schedule maintenance. That will prevent your customers calling up. Get them used to looking at your status website and your blog for interesting information and they'll probably not log so many support calls or not be phoning up so often. Finally, security and governance. Again, it's an area that's quite important in the enterprise industry. The main ones that I would focus on here is what are you securing? Are you securing your product? Yes, are you securing the infrastructure it runs on? Yes, you're doing both. What type of encryption are you doing? In transit, SSL, at rest, encryption on disk, EBS encryption on Amazon. How do you manage your vulnerabilities? You can do vulnerability scanning or you can do penetration testing. They're different. Customers might want to do one or both. You need a way to deal with security issues and issue a fast response. Know how your data in your platform is going to be inserted or managed and who's controlling it, who has access. Where will it be? What happens when the customer leaves? How do you prune their data from the platform? Again, not quite developer stuff but stuff to be aware of in case you may be able to make a contribution or come up with some new ideas on how that can help. So, finishing up to recap, what makes DevOps happy and what makes DevOps sad? Well, some of the things that makes DevOps happy include fault tolerance. If you have good fault tolerance, your DevOps guys are going to be happy. If you've got high availability, they'll be happy. If you've got automation, they'll be really happy. So will I. If you've got security, of course, your customers and everybody else in the organization are going to be happy. Monitoring, monitoring is king, can't talk enough about it. Logging, I mentioned it earlier, it's the unseen hero but it can be absolutely priceless to have good logging and be thinking about who's going to be reading your logs and the ability for your product to be supportable by those who will have to support it. Anything you can do there is going to make everybody's life easier. Things that will make you, ooh, that's a little bit smaller. The consequences of not having these things could be if you don't have fault tolerance, your service levels could be impacted. If you don't have a high availability, you may have to issue a service credit to your customer so your revenue is going to be down. If you don't have automation, then you probably can't do continuous delivery in the way that you want. If you don't have security, there's a brand and a trust issued as a consequence of that. If you don't have monitoring, you could have serious problems with your quality because you're not catching fault before they go into production or you're not enabling your quality team to detect when a certain port stops listening during pre-production when you restart a service or it doesn't come back online correctly. If you don't have good logging, your support team won't thank you for it. And if you don't have a good support organization, you're going to pummel your sustaining team with all escalations from customers. So really working together to achieve the items on the left can eliminate a lot of the risks that would make people very sad. And I do thank you for your time and thank you for listening. Made it just in time. Is there any questions? Repeat the question. Yeah. So you're asking, is there any tools I recommend for certificate management? To be honest, the OpenSSL command line tool has tons of support. So with a little bit of creative use of that tool and with the grep command and orc, you can actually tease out the information quite easily. Yeah. That's what we use and a couple of lines of bash and you're done. And there's usually not a lot of variation in between detecting the certificate expiry time or is the certificate chain still valid or is SSL 3 supported or still enabled or not or is a certain site for enabled? You can usually detect a lot of that using the OpenSSL command line tool. Yeah. Okay. I'm very excited by Amazon's new certificate management service where the certificates are free and there's APIs to create them. That's really exciting because one of the things that is difficult to automate today is the creation and provisioning of your SSL certificates. But Amazon, of course, I come along with a great solution to these things most of the time. Let's encrypt. Let's encrypt. Yes, that's a new one as well. We saw that one before Amazon released their one quite recently. Yes. Yeah. Okay, no problem. Thank you. The problem in the beginning was very good. That's okay. I really like it. You have my slide now. Can you please export also the PDF? Oh, sure. In case there are phones or other. Yeah, yeah. What we actually did, the release of the new applications, we had this automated load balancing stuff that it automatically checked that like 30% of the servers will still be in the load balancer I mean, I'm fine with that, but yeah. And we did it out of the peak hours. Yeah. So it was all automated. Okay, so we have your presentation. This is the presenter. Oh, I actually have my own. Okay, that's good. Very similar one. This is the connection ace. You can choose either HDMI or the VGM. Okay, so that one. Would you like to give out this card for the session? Oh, yes. Of course. Is it on? Yes. It was an easy one. It was an easy one. Easy one. The blue. And if we can have this slide then after the presentation just be done around the way. Bye. And you are Jay, right? Yes. Okay. We will start again. Thank you. Thank you. Read your messages. Remember, we have a telegram group for the afternoon where you can get the last information on the conference. Also finally, someone from director. You're on. Okay. Just do not slam the door, please. I will introduce. Yeah. We're still on. We still have seven minutes. Yeah. That's okay. I'm very glad that we have seven minutes. Nice. Are we going to do a hit demo on fly? Okay. Yeah, we will do some examples, right? Yes. The microphone is already... Did you walk out? The microphone. Okay. We need to... Oh, hey. Hey, how are you? Fine, and I'm just holding the slate. We're only three. Only three? Only three so we won't die. Thank you. Okay.