 Ladies and gentlemen, Richard Hartman will now talk about the social aspects of change. Thank you very much. You might be wondering why we have a social talk in the monitoring deaf room. The main reason is, it will be online, you don't have to take pictures, but you can. So the main reason is because I actually learned and realized most of what I'm going to talk about during a transition phase, which I had recently. So I did switch jobs last year. No, well, now it's even longer because it's 2017. And I came into a very, very old and organically grown organization, in which a lot of teams had different tools, they had islands of data. The monitoring tools didn't really talk with each other. And this just kept burning time and work by people. And this was another good thing to have. So I did look at Prometheus at the time, shortly after joining Spacenet. And all of these selling points, some of you have already been in talks where you saw some of the advantages of Prometheus, just to reiterate really quickly. It is built for highly dynamic stuff, but it can also do static quite nicely. As you don't have a graphite or a sabbix or whatever model where you have trees of stuff, which are then locked into those structures, you just have labels. So you can do a query by data center, one by production or testing, one by customer. And it's all within the same data set and you can just slice and dice the data however way you need it. If anyone knows R, drum QL is basically R on steroids and it does automatic label matching and basically does what you wanted to do by magic. And it's quite efficient and quite simple. So when I looked at it, basically it went this way. When doing Debcon, we did Libre NMS. Then Fosdom last year, I also wanted to do Libre NMS. Someone from Google told me, OK, you know what? Try Prometheus. I did and it was bliss. I tried it and I was just, yes, this is the future. We need to use this because this is better than anything else I've ever seen before in regards to monitoring. Thing is, do you have to speak up? OK, yeah, feel free to. Is this better? OK, sorry. You are more than welcome to tell me this. I look funny a little bit, but if it works for you, that's fine. So the biggest change or the biggest challenge in all of this was not the technical part. Because technically speaking, most of this is really simple. You have a few exporters. You write a few scripts. You get people to use and get used to PROMQL and to Grafana and things just work. But this is not the hard challenge. The hard challenge is to get people to actually accept this change and to actually realize that's in their own interest. And how I did this and how this worked. That's what this talk is about. The most hidden but also true or hidden and obvious thing is most of the time in a classic operation, the incentive runs counter to change. Because if things break, people might have to do overtime. They might get less bonus. They might get costed by their boss. So change really is hard precisely because it is somewhat hard and because people resist it because their incentives are counter to it. Unless you actually have processes which embrace the change and just take it as a fact of life. So you don't have those holy cows anymore. You just realize that all the infrastructure is in constant turn and you have to change all the time. And you just automate away all the icky bits of this. And of course, there is always the trade off. There's a few people who say, OK, it's working now. Why do I have to do extra work to get to there? It has worked up to now. Why change anything? And often this isn't really on the technical level. It's on the emotional level. Those people perceive their standpoint to be right. They perceive their operations to be good. And it's valid for them, even if it's not valid in the objective case. For them, in the subjective case, it's still valid. And also, if you change your monitoring for whatever is making your company money, be it stocks, be it customers being up, be it rubber duckies coming out of production, doesn't matter. There's a few things which really, really impact your business, your co-workers, everything. And these systems and the monitoring for them and all what keeps them running. Basically, unless you have huge teams, you have to have a certain phase of running in parallel. Because if anything breaks, you need to change back and quickly. You can't say, OK, next week there will be a scheduled change. We'll do it then. Because your company might keel over and die on that time. So you really have to have the ability to run several critical systems, both yield, and to new in parallel to have less resistance to the change if you're proposing. For the operations people, the largest currency, the best currency, you can actually pay them for is their time and sanity. Who here knows the things they know, the definition of toil? Except for you guys. OK, so toil is as defined above. Basically, toil is any work which you have to do manually. You have to do it repeatedly. And it scales somewhat linearly or differently with your actual size of deployment. Be it the rubber duckies, be it virtual machines, be it physical service, doesn't matter. It scales along the same lines. And this you have to automate away. Because computers have exactly the same income and exactly the same output. And humans are not good at this. And if they're always busy and the pager goes off seven times during the night and they still have to come into work because there was no one else to do it, they're ill, they still have to come into work because there's no one to do it. They're always busy and always running from A to B. They will not have time to sit down and actually engineer the systems which they're supposed to do engineering on. Obviously, you have to keep the old stuff working, sure. But yet, while maintaining the old stuff and introducing the new stuff, this is the critical phase where you either gain momentum for whatever you're proposing to do or you do not. And if you don't gain the momentum, you're pretty much starved and people just won't accept the change. So at this phase, it's really important to keep the overhead low. Because you can't just say this is the new system and now you've got to double the work. That doesn't work. You may be able to get away with like 10 or 20% extra work, but ideally, you should have less work pretty much immediately. Choose a few low-hane fruit and attack those first and show people that there's less work for them. And the best thing which you can focus on, again, are all those manual tasks which are just repeatedly done every single time. If you've got puppet or Ansible or SaltStack or Chef or it doesn't matter, you know what I mean, at least to some extent. You have to automate these things away because computers are better at this. And once you're able to show that you can free up time and once you're able to show that you're reducing Fetigu and everything, then people will start to listen. Something about driving back into the monitoring thing, but this is really important. Brian talked about this a little bit and I think Fabian as well, but I wasn't there to talk. If it's not actual, it's not an alert. By definition, there will not be an alert in your alert system, period. So if it's just, well, this one system has high latency and nothing to do about it, who cares? Sleep. Don't care about this. If it's not urgent or imminent, it's not an alert. Because if you can do it next week during business hours, you better do it next week during business hours and not at night or on the weekend. So you need a way to actually be aware of this, but not handle it now. Just put it on the stack and at some times you'll do it. Obviously, you have to do some sort of planning because if you're whatever, your disk space, your server capacity, whatever runs out, you should be aware of this well before that time so you can order a new hardware, you can get budget. You can, I don't know, have a new room in the data center set up for you, whatever your scale is. Do this well in advance so you are not in a panic and have to do it on a weekend once you get to that limitation of whatever resources you have. Do it well in advance. If there's no playbook which tells you what you have to do at three in the morning where you're still asleep and half drunk, it does not go into production and get buy-in on this from your superiors. And if the, so who does know what an SLO is? Not you guys. So basically difference between SLO and SLA is service level objective and service level agreement. The agreement is the actual contract which goes out to a customer or maybe even internally if your company is large enough. The SLO is just the objective which you want to achieve but there's no actual penalties or anything if you don't achieve them. So if you can actually define or whatever team is creating this stuff cannot actually define what the SLO should be and what you have to alert on and what's good running and what's badly running. Sorry. This also does not go into production because this enables the people who actually wake up at night to be really quick about their job, fix it, write the postmortem and get back to sleep. And everyone wants that. But, oh no, we also have a slight note. So there's an incident which this guy is involved in. A coworker, not him, set a wrong flag in one of the server config things. All of a sudden one of the servers started accepting outside mails. The spammers stated SLO ramp up and was really good so they did a few emails, so they did more emails, more emails, more emails, more emails. And once they were sure that the system was able to handle the load and it was well connected and it had good mail reputation and it could really go at volume and nobody noticed, then they started spamming. And obviously they did it on a weekend at night for the local time zone because they were really doing it well. So from the technical point of view, that was really impressive what they did. Anyway, it took them less than 30 seconds to figure out what was wrong as opposed to ages. And this is something which spread within the company because all of a sudden you've, several people who are involved with this and they see, okay, this is actually a benefit for me. If I'm on call next time and the same thing happens, all of a sudden I'll be quicker and I'll be able to get back to sleep or to watching video or whatever more quickly. And this plays to the incentives of people. So there, someone I know once said, you can talk for hours about technical stuff, be the power supply or a function or whatever with an engineer, but try the same with the CEO. He will literally not care, unless it's a very weird CEO. But in the general case, they will literally not care or try it with marketing. They don't care, try it with accounting. They don't care, and why would they? It's outside of their world, they really don't care. But what do they care about? Your managers, they care about revenue and they care about process execution because everyone has to follow the process and it's important that everyone does it in the same repeatable way because if someone dies or quits or whatever or gets ill, you have to have good processes which are executed well. The architects, they care about clean design and that the process is actually defined in a way which makes sense. The manager doesn't care if it's not that good as long as it's executed well. Product and service owners, they want powerful dashboards, they want insight into what they are actually trying to achieve, how many people are using their stuff or whatever, this is what they care about. The team leads, as some of you know, they care about morale and about quick execution of whatever has to be done right now. And the actual operators who are in the trenches fighting fires or doing engineering, they care about toil and they care about sleep and that's valid. And each of those things is valid for all those people but they basically look at all this from a different angle and they all have a different part of the big picture and in theory, most of them actually agree on what they would be willing and able and needed to do but they can't really talk to each other. So if you want to go change into the whole system, into the whole company on all levels, you actually have to tell people what they care about. Play to their intrinsic motivations, tell them what they need to hear but never lie about this. Be open about it. I mean, I went to my company and told them, okay, I'm doing this once I did it and they were like, yeah, good. So I'd be really open about how you do this because people will start noticing that you tell everyone something else but as long as it's all the same picture and just from different angles, this is fully okay. So yeah, put a big picture on the proverbial wall or take an actual wall and put the picture there where you want to go, where you need to go, show people what they care about so they actually see their own motivation in there. Get buy-in from your managers and whoever has the budget or the time or whatever accounts under their control because obviously you need their buy-in and in the future, when you have to make decisions, align them with this big picture and if everyone has the same big picture and if everyone agrees on the same big picture, they will actually align with this without even consulting with you or anyone else on your team because they see this and they think, okay, I'm here, I need to go there, how should I go? And if they have the possibility to go like this or like that, and here's the big picture, they'll go like that. And this really, really works. And it's amazing to see how all of a sudden people from completely different teams, let's say, UNIX versus Windows, actually agree on stuff and start working together without even realizing it and both just try and get to the same point and do the same things because they agree that this is where they want to end up. There's also the thing about leverage, which again is somewhat specific to the Prometheus use case but try to find your own equivalent and try to build this into your how you do the change and how you approach the whole thing. So for Prometheus, what one of the advantages is you have one system which has all the information about pretty much everything. We can literally graph or correlate or whatever power usage against service node. Also server but also service. We can do outside temperature versus whatever. So in the old company, we had a case where we had one line and this went down time and time again and we didn't know why. After month, we figured out that if outside temperature changed very quickly and we had high load on the 10 giglings, then they had packet drops because some part of the fiber just had strain during becoming colder or warmer and that was causing the outage. Try correlating this when you have 10 different islands of data and 20 teams and no one talks to each other. Yeah, have fun. It was fun. Yeah, and I released a new feature. My whole data center takes more power. That's kind of fun to see if you can see and there's a lot more. And this one thing also allowed us to have an oracle. So you don't only have the technical overview of what is broken right now. I don't care about what's broken five minutes ago. Maybe for context to see my power to the data center was down, the diesel didn't work and my services are down. Okay, in that case, I might want to know what happened five minutes ago. But then in general case, if the line comes back, I don't really care about it. I want to know what's broken now. Not what's broken an hour ago, now because that's what I have to fix. The same system also gives us dashboards for drill down for looking further into the system or back in time. And now we come to something which was relatively unexpected as first. We are actually able to generate PDFs for customers from the system and they see exactly the same graphs and exactly the same information as the operations guys see. So you have for the service managers have the same PDF and they hand it to the customer. Well, it's done automatically, but doesn't matter. And all of a sudden the service managers, they see their toilet reducing because they don't have to copy and paste random crap from 10 different systems and make screenshots and send them by email. It just happens without them having to do anything. Also sales is pretty happy because all of a sudden I can say, okay, I have this type of product, product X and we had no single SLO violation for any of our customers over the last year. And I can do this by customer. I can do this globally. I can do this by data center. I can do this by time. Doesn't matter. So all of a sudden sales has a huge bouquet of stuff to choose from and just tell the customers or prospective customers. Okay, we are this good because we can prove over all our operations that we didn't have a single SLO violation in the whole company. This is a really good argument for customers because they see that you actually care and can prove about stuff. You can export usage for accounting. What you can also do is for example, we had a case where we needed file system usage against email accounts for pricing for our email offerings. And normally the product manager, he got some Excel sheets from one team. He got some user stats from a different team and he had to mush it all together. It took ages and then he could calculate the price. Now it's literally one query within Prometheus and he has all the information he needs and can base the pricing of the next product on this. So you get more at least in this specific case that you have to try to find these in your cases as well. You get more and more cases where you can use the same source of change for more and more people. And more and more parts of the company actually start caring about this. And once you see someone from product management or accounting become actually excited about monitoring, that's quite interesting. And this is something which I'm really convinced I could not have achieved if not for really trying to get everyone on board and showing everyone, okay, this is your personal best interest because reasons. Reason being X, that's not supposed to be in there. And thank you. Questions, I hope so. Yes, yes you do. Or I can repeat. Okay, it looks really optimistic. What if you inherited the project and totally late and you know you can improve a lot, you can automate a lot and you just don't have the time. You have to deliver something and at some point you have to do the manual labor and you know you can automate like 90% of it. What do you do then? So the question was what do you do when you get a project and you're way too late already and you have to do manual work and you have to just power through? If that's the case, then the answer is you have to do manual labor and power through. But still if it's possible, you should at least try to talk to other people, get their input on what they think is useful for them. You might have a project which touches some other team. Maybe if you tell them, okay, I'm trying to do it this way or that way. What do you prefer? They tell you I'd like this and tell them, okay, I'll try to do this and I'll even give you that feature on top. Maybe some of those start to work with you and give you input or data or whatever to help you. And yes, that's optimistic. If it doesn't work, it doesn't work. But either you stop trying or you quit your job. Next question? Anyone? Going once, going twice. Thank you, Richard. Wait, no use.