 Welcome to the second talk of the third day of GPN-21. CD, site reliability engineering explained an exploration of DevOps, platform engineering and SRE. CD is a software engineer and technical lead from Germany specialized in resiliency engineering at Microsoft Azure with a passion for distributed systems, site reliability engineering and resilience engineering. CD focuses on building adaptable systems that can withstand failures while monitoring performance. Thank you. Thank you and good morning to GPN. It's the second talk of the day, so I'm quite glad that everyone showed up and the room is not completely empty. Quick introduction about me. I'm in senior SRE. I work in this field for a couple years now and I do a lot of things even outside of work. I really enjoy analog photography. I have my camera with me on this event. You can find me on my socials and I want to talk about real quick why I'm doing this talk because I'm working in this industry for a couple years now and it never really caught on the term SRE and even if I talk to guys here at the event, they're sometimes a little bit like, what's SRE? What's that job title about? And because hope is not a strategy and we say this all the time in SRE, I thought let's do a talk about it. And when we talk about SRE, we always hear those terms about DevOps, about platform engineering and before we get started on SRE, I want to clear up something real quick. I think everyone's seen this graphics, right? It's pretty much in the name of DevOps. It's the intersection of DevOps and DevOps, but it's completely false. DevOps is actually not a job title. DevOps is a methodology. It's pretty much the same thing as we are not called agile engineers or waterfall engineers or even scrum engineers. We are software engineers, right? And the agile methodology, the scrum, the waterfall, it's just a process. It's the same with DevOps. It's a methodology and not an actual job title. The DevOps methodology focuses a lot on communication and collaboration because part of the methodology behind DevOps is we build it and we run it. We are responsible for a service end-to-end. We develop a feature and we ship the feature to production and we also do incidents response for that feature that we shipped. And that sometimes contains that we have to talk to operations teams. There are SRE teams involved, maybe. And we have to talk between the teams and we have to make sure everyone feels heard. So as I said, DevOps integrates software development with classical operations and we do this to improve collaboration. So we build feature teams that have the software developers and at the same time have people experience in operations so we can actually run our microservice. And more often than not, if a DevOps team is implemented properly, it looks something like this. Another product and the product consists of multiple microservices and each microservice is owned by one or more teams. Usually it's one-to-one or one team managing multiple microservices but not one microservice managed by multiple teams. So you have a feature engineering team that's responsible for one microservice and they go everywhere with this microservice. They do hand-holding for everything. They develop new features. They fix bugs. They release things to production. They make sure that the production infrastructure is running for that microservice. But I hear a lot of you guys now saying or thinking, well, but my job title is actually DevOps engineer. So it doesn't really fit with what I'm saying, right? Your job title actually might be DevOps engineer but in fact you are likely to be a platform engineer then because platform engineering by essence is working with automation, working with infrastructure as code, working with all those fancy automation frameworks that we know of like Ansible, Terraform, Chef, Puppet, Solstack, all those tools. And we use them to automate infrastructure deployments and build infrastructure for the DevOps feature teams to release their software on. And we help the developers actually doing this by running the GitLab instance for them and operating and maintaining the GitLab runner so they can run CI CD jobs. We also usually host monitoring for them so they can actually see what's going on in their services. So if your job title right now is DevOps engineer, you might consider looking into what platform engineering is all about and you might realize that DevOps isn't really about a job title but the methodology and platform engineering is what you're actually doing. Platform engineering comes with a set of principles and they usually evolve around developer productivity because if you're building infrastructure you do it for someone and that someone is usually a software engineer and you help those software engineers releasing their software faster, quicker, more reliable. And sure you do that by writing reusable components for example in Terraform module that you can reuse over and over again. And last but not least let's talk about S3. S3 in its essence is pretty similar to platform engineering. We do kind of the same thing about automation and also monitoring but we additionally focus on availability and monitoring a lot more and also incidence response and incidence response automation is a big part of S3. So S3 really and if you go by the definition of S3 is applying software engineering principles to operational tasks. If you never heard about S3 before and this is really your first time hearing about S3 and you want to learn more I highly recommend the book that I show here. It's written by Nile Murphy and the other guys from Google who invented the S3 discipline. And as I said, S3 in its core principles is really a lot about automation. It is one of the core priorities of every S3 is eliminating toil. And toil is defined as every repeatable task, everything that you do multiple times over and over again manually and could be easily automated. The benefit of automation is also the more automation you have, the less human error you have because if you have a piece of code that actually executes a task you can check in your code and get, you can have code reviews on that code, you can put in error handling and you don't have an engineer sitting in front of a screen typing the wrong command causing an even bigger outage. But when you're doing automation there's this fallacy of getting caught in applying Band-Aids all the time and I have a small example for this. The feature team that we work with recently released a new feature and ever since this new feature is in production there's a lot of servers that are down. So our job as an S3 for example might be look into those servers, why are they down, figure out the root cause of why they are down and make them available again. So what we eventually figure out is these servers are down because the disk is full and we hate that generally. So we look into why is the disk full and we figure out oh there is a new lock entry in the locks that causes the locks to just be way too verbose and that fills up our disk space. Well the fix is quite easy, right? We just truncate our locks and we're good again and it should be like 20 lines of shell script automation but it doesn't really fix the issue. It's not fixing the root cause, it's fixing the symptoms. The symptom in this case would be the lock running full. What we should do instead is go to the source code of this new feature, find the offending lock line and reduce the severity of this lock line for example. So the actual root cause of the incident is fixed and not just a bandaid applied. When we work as S3's another core principle and big part of our job is doing risk management and evaluating risk. And when we talk about S3 it's usually about very large systems. For example I work with millions of servers worldwide and if you have a couple million servers then a couple ten thousand servers are down every second. It's inevitable, right? If you have like an error rate of one percent that means four million servers there are at least a couple thousands offline for whatever reason. So we have to accept failure as something normal. It's not bad if something goes wrong. It's just important to not have it down again for the same reason. So we generally define error budgets and we talk about error budgets in a second but what I really want to show you is this triangle of sadness down here. If you really optimize something for cost then sure you can make your infrastructure super cheap but it's not going to be super reliable and you aren't innovating that much because if you don't have budget you can't innovate. If you make it really reliable on the other hand it's going to be fucking expensive. And at the same time you're not going to ship new features because every new feature, every change could introduce a risk so you can't really release anything anymore so it's not optimal either. If you go to the other direction you end up in the other extreme you're always living on the cutting edge your system is going to be majorly expensive and not reliable at all. So you have to find some common ground, some middle ground and you do this using... you use risk assessment criteria and do like general error budget measurements and for error budgets you usually use service level objectives and I talked about service level objectives and service level indicators in my presentation last year about incidents and alerting and if we look at the slides from last year I literally pulled the slide from my presentation file from last year and service level objective is a target value or even a range of values for a service level that is measured by an SLI or service level indicator and the service level indicator is defined as a carefully defined and quantitative measure of some aspect of our service. So to make it more graspable and more understandable I prepared two small examples. Let's say we are pretty much at the end of the month and our feature team wants to do a release next week and our job is to figure out should we do this release? Is it safe? And what we could do is look at the audits that we had this month and then overlay it with the releases that we had this month and we quickly figure out that we had four releases this month every release cost approximately 10 minutes of downtime so we already burned 40 minutes of our error budget and if we would do a fifth release next week then we probably breach our error budget of this month so we're not doing that anymore and we have to work with the feature engineering team to make sure that their releases don't break production that often or that long anymore. The other example that I want to make is the quite opposite of that we are at the middle of the month we only had like three minutes of downtime and we have 43 minutes so what do we do? Do we call it a day and be like, well, we did it? Sounds like an idea but it's actually bad practice either because you don't want to train your customer to expect something more. If you consistently exceed your service level agreements and you're always better than what you say you are a customer might expect a higher level of service than what they're actually paying for so what can we do? Can we release this super critical, nasty release that we are pushing ahead of ourselves for the past three months? No, we don't do that we try to make the releases safer and don't use our error budget to do unsafe releases what we could do instead is run a chaos experiment and we can use our error budget to do some proper chaos engineering and we do chaos engineering to build confidence in our services and this is what I do, what I work on on a daily base we intentionally break production and I don't mean staging, I mean production and I see often that people are very reluctant break to break production but use it on production so you can actually assess what the impact looks like on a real shortage and we do chaos engineering by building a hypothesis like we have free availability zones in three different data centers if one of those data centers ever fails there are still two availability zones left so our service should continue working we evaluate this against real world events like how could we lose a data center we could have a power audit or we could have a fiber cut between the data centers or someone accidentally unplugging the interconnect cable from the two data centers then we also evaluate how can we implement this the power audit might be a bit outside of what we can do we can't go to interaction and tell them hey could you shut off your data center please it's not really practical right so we might go with the network audit we simulate a fiber cut by literally unplugging the fiber from our server we run the experiment we break production intentionally and I guarantee you you learn a shit ton about your service because last year in my presentation I was asking the question are incidents a good thing and we came to the conclusion that yes indeed they are a good thing because incidents help us evaluating or finding the delta between how we think our system breaks and how it actually breaks and we can use chaos engineering to artificially create incidents and we can even go into more details how we think the system might break and learn even greater detail and this goes without saying that we have to learn from failure right if we run chaos experiments we expect something to break and if it breaks we have to learn from it and we have to do better next time so a really important thing and something that's very dear to my heart is blameless and sanctionless post-mortems the sanctionless part is often overlooked I hear a lot of people saying we do blameless post-mortem and that's a great start a blameless post-mortem would be hey we pushed a bad code path yesterday it caused an audit no one is really blaming any engineer the feature team might be released it but whatever and then management steps in and management is like well that's great let's make sure that never happens again how do we do that well next time you want to release something here's the list of things you have to check before releasing that would be a sanction and now no team wants to do any more releases because every team that wants to release something has to fill in this very extensive list of questions to verify that they're actually safe to deploy the sanctionless post-mortem on the other hand would evaluate even further asking the question how could this bad code path be pushed to production what led to it and maybe they figure out well there wasn't very good communication about the feature release and the timeline of it and maybe there was too much pressure put on the devs the feature quickly so they released something into production that wasn't properly tested and maybe we have to improve our communication between product teams and management another thing that I do on an almost weekly basis is reviewing old post-mortems and finding common themes across post-mortems because you might work an incident today and you might work an incident tomorrow one the day after and then in two months you work an incident and you're like man this failure this looks awfully familiar but I can't really remember what's going on so it's a good practice to review old post-mortems and learn from older failures and you can also use this to identify common themes maybe you identify for whatever reason every three months something majorly breaks in our caching infrastructure and maybe it's then time to look into your caching infrastructure why there isn't continuous audit every three months and the last big topic that I want to talk about is reducing organizational silos this is very often overlooked and sometimes really hard because this is your product and your SRE engagement is work with those two microservices those two microservices they run on some platform but the reality is the two microservices are operated by two different teams then there's your platform engineering team running the infrastructure then there's the customer the customer works with your support team and they're in three different business units and now you're here and you suddenly have to talk to salespeople and no engineer ever wants to do this right so it's really important to emphasize on a continuous learning across multiple business units you have to feel confident about talking to people in other business units and departments and sometimes even geographically different locations just as I said promote a collaborative culture is important and something that's also often overlooked for especially engineers is use empathy it's you're working with people it's not just servers you also talk to people you have to work with people a lot so be nice to each other be excellent to each other if you look at the similarities between platform engineering and SRE it's pretty much the DevOps philosophy so not much to talk about there are some certain differences I don't want to go into the details about everything here but to wrap up this presentation I want to go through a set of misconceptions that come across to me often and first of all DevOps is a job title you wouldn't believe how many recruiters reach out to me and ask me if I want to join their new fancy startup as a DevOps engineer I'm just like what do you mean what's DevOps engineer do you mean SRE do you mean platform engineering what do you want so no DevOps is not a job title another common misconception is SRE is just rebranding of operations yes SRE is a lot of operational work and you have to be comfortable with doing operational work but it's not just a plain rebranding using software engineering practices to run production it's using your software engineering mindset to approach your production infrastructure engineering it goes into the same direction SRE only fired fires no I don't do this I am rarely doing incidents response not a common one and I don't even know where this comes from so many people ask me I'm working as in Kubernetes SRE and I'm like oh that's great what are you doing yeah well I'm developing new features well then you're not an SRE I guess for some reasons people think Kubernetes equals SRE I don't know why it's not true and last but not least I think it's the last one platform engineering is just for cloud infrastructure and only needed in big companies also not true cloud platform engineers yeah we touch cloud infrastructure a lot but we also do other stuff like developer productivity and I guess that's it and now we have time for some questions CD thanks very much for the interesting talk who wants to have the microphone I know it's early I can't believe nobody please guys be confident that would be a first or maybe everyone's just turned off because I told them their job title is not real yeah there is one yes thanks for the talk you mentioned the not to use the job title deaf ops what about the job title deaf sec ops what do you think about that well the deaf sec ops is if I'm not mistaken deaf security and ops and I think security also should be a common practice in future engineering teams and so I would say it's also more of a methodology than it is an actual job title if you're developing software and you're not aware about security then good luck finding another job we have one question over here it's traveling salesman optimized so one or problems that we often faced with our production system were actually caused by the way we built our core product so my experience it's not always possible to like build a reliable platform without actually building the product right how much time do you spend like looking at the production code and figuring out oh we are keeping HTTP requests open for too long and that like stalls all our request handlers I would say I can't put a percentage on it but I would say that's like a huge part of the job I would say like half of the job might be fighting fires fighting incidents and the other half is actually looking at the code and trying to improve it and make incidents go away thanks if you don't feel comfortable asking questions here I put my socials on the slide you can always reach out and DM me maybe don't DM me on Twitter try it on muscle on Twitter it's like unusable