 Haj. Prijezam Daniele. Prvojo na Noutan-Di Astrite v Londonu. Noutan-Di Astrite je vse vseške vseške vseške vseške vseške vseške. V Rijčmanu jeonez so najbolje izložite v Londonu. Zelojne vseške vseške vseške, pahovje, rizv. Se nekaj nekaj Londonu. Nukaj leži očasno vsi bi bila ampasador in vsi bočnih vsi. In tekak. Jeste srojo pručnje stranje. Irej se tudi našli od znati stranje, kako se na convexi, v pod stoneske, 24vši, nekajšelji, našli, našli o to, našli odnah, nekajšelji, nekajšelji, nekajšelji, nekajšelji, nekajšelji. in však je zašlat nr. Treba se naša, ki se zelo še. I mi se zelo, ki se zelo, ki si zelo, ki se zelo i se naredi naša izgleda, ki se zelo naša izgleda. Kaj je izgledoš jaz? Čil je nekaj, čil je izgleda, čil je nekaj? Čil je nekaj, čil je nekaj, čil je nekaj, čil je nekaj, čil je nekaj, On je način, da je tudi dobro. Zvonimo z delovim definicijom. Zvonimo, da je bilo vse viziv, način, da je bilo vse viziv. Zvonimo prevenšnje. Zvonimo, da je bilo vse viziv. Zvonimo, da je bilo vse viziv. In v spriju totalneh vseče, imam se povrstv, da se povrstv, in nekaj nekaj, da se je od svojtega. Samo je vse včak se neko, ali je očaj, da imaš telefon, je to paženje, in kako je, nekaj se povrstv, pa se je zelo, bo je tudi prav. nekaj ne odpočke. Zelo drugi je bilo vzelo v te, če je bilo bilo vse, nešto se bolje, je bilo vse, zelo vse. To bilo vse in zelo vse, je bilo vse, vse, je bilo vse, tudi. Zelo vse je bilo vse, tudi vse, kaj je zelo prišel način, kaj je izgledaj, način je zelo prišel in naredaj sezat, nekaj, kot vseh, da se predpravil. Vseh se zelo prišel, da se prišel način, prišel, da se prišel način. Vsleda tudi, mi je zelo prišel na 1.00. 1.00. In before you get to your computer, you want to make sure you are awake. You are fully awake, so make some tea, make some coffee, work off the sleep a bit. A transition, like going from your sleeping self or your working self is the, if the incident is during office hours to your incident self. So you don't want to be the guy who is developing on one window and fixing production on the other window. That's just the incident deserves your full attention. And the first thing you do is you are going to read the alert that woke you up. Like really read it, at least a couple of times, possibly more. You don't know where to look. Which system broke? Where can I find the error logs? Where can I find the monitoring data? And you want to gather as much information as you can until you can basically be sure of why did this alert trigger? Like what did wake me up? And at this point you probably have enough information to assess the impact. What is going to happen because of this problem? Who is not going to be able to do the job? Who is not going to be able to know something they want to know? And be nice. If people are impacted by the problem, inform them. Most websites have status pages for internal systems. You probably have an email tool. And there is a question you might yourself find yourself asking a lot which is why? What is the real deep cause of this problem? And this is not the right time to answer it. If you are going to dive into with your developer mind try to find out the real root cause of the problem in a very long amount of time and you cannot quantify it. So don't do it, it's not productive at this time. And as you find out informations, as you start acting on the system, log what you're doing. This is on a chat app, it's Slack, but you can also just open a blanket email to your team and start typing out what is happening. This is invaluable information especially if you are preparing at all, being on call. And once you have enough information you probably can come up with one action or a few actions that will limit the impact as much as possible and are safe. If we scroll back a bit because I'm not sure if it's readable you're going to find out that I did not follow my own advice and last week I ended up forking a library at 4 a.m. and trying to patch it it did not solve absolutely anything and that's because instead of focusing on taking the smallest piece of action I actually started asking myself why? That was not useful at the time. The action I should have taken which we only took in the morning was just to skip that data integration step. True, we wouldn't have known on the day after what the sources of our traffic were but at least we would have had the old rest of the data in time. Once you have taken enough steps to limit the impact then be nice inform the people who are impacted and then get back to sleep because you want to be fresh on the next day. Usually you wake up the day after and your first thought in your mind is this is all pretty stressful I don't want it to happen ever again and at this point you can really ask yourself why this error occurred and the best way to do it is an RCA. There is best practices and extensive literature on RCA's so I'm not going to dive too deep just one slide you want to put your detective hat on gather all the information what actually happened during the incident and before the incident find the root causes and be sure to leave enough time at the end to decide on some actions that will mitigate those root causes and it's very easy in this case to try to blame someone, don't do it so I'm going to tell you a story it's about a nurse in a child care hospital she gave the wrong drug to a little child and the child almost died an inquiry was opened and there was a proposal to fire the nurse but then the commission on the inquiry dived a bit deeper and they found that the drug she should have administered and the drug she actually administered where one next to each other in the same cabinet and they had similar labels and they also found out that the nurse had been working 10 hours straight and there was nobody to double check what medicines what drugs she was administering so don't allow yourself to focus on the fault of one person always look at the context and from an RCA you usually get some useful lessons for the future you probably are already familiar with this be really careful where your systems talk to a third party and because communication is more scarce and more easily ignored and watch out for point of friction in internal communication as well so the root cause for last week's failure was that the GA reporting API as a time on site field and they renamed it to session duration and the old name was deprecated in 2014 but they actually started enforcing the deprecation and failing on API calls last week another insight is to really care about your error messages keep them up to date make sure they include everything that can help you during an incident so checklists, lesson learned encouragement so we took free actions we fixed the root cause of the problem, we renamed the time on site to session duration we scheduled some time to go through all the GA fields that we are using and check if any other of them is deprecated but we also included in the alert message a specific suggestion not to do what I did so just skip the step don't try to dive in two causes too much so let's take a step back to 2015 and let's take a broader view Downton Abbey is a TV show about a wealthy aristocrat British family in the first half of the 20th century it is wonderfully acted the scenery, the costumes, the settings are amazing and it's really on target for not on the high street it's very British then when we placed a TV advert during the first episode of the season on a Sunday evening in October 2015 we had so much sales and so much additional traffic that the site went down it recovered and then it went down again when the same advert aired on the plus one channel on one side I was sort of lucky because I was not directly involved with the consumer website but on the other side the situation on the depth infrastructure which I looked after was a lot worse so we had basically no data since Saturday morning because the replica of the production database we used to read from was offline we were in the process of migrating between hosting providers and there hadn't been enough communication on when the replication environment will stop and the only person who could really fix all of this mess was our databases and networking expert, DevOps and he was on a plane back from Russia as soon as he landed late Sunday evening his phone rang so many times that by the time he got home his battery had drained down why did we care two incidents at the same time the consumer website one on the data infrastructure that we would have used to evaluate the impact of the consumer website incident and it was a Sunday evening so the next day is a Monday on Monday is our data infrastructure is the most usage because it's both the busiest trading day and also the day where we planned for the week ahead I do not like Mondays and this was not a normal Monday at the time we were a lot more inexperienced than we are today and we learned a lot from these events we learned as an organization we learned as a team and we changed so let's talk a bit about the changes we made how do you answer this question well you look at consequences of an error who is affected by the service or the data being unavailable or wrong how much service how much how much will it cost how much time can they can they wait for the information in 2015 we were realizing that our coworkers and our colleagues were increasingly dependent on our data infrastructure especially for decision making so if you have a public facing service or public facing website you probably want to consider some kind of on call policy because you cannot control how much the external people depend on your service and you might also have a contract in place or your revenue might depend on the external service in 2016 we started offering to our partners the people who sell on not on the I Street access to a rich dashboard with sales figures and at that point we didn't have any space to roll back the on call policy but even if you just have an internal service you should consider on call because you want your coworkers to spend less time and worry less about checking and double checking that the services are available and if they spend less time doing that they will get benefit in their daily work if you take a step back to the Anton Abbey and you sort of know the characters it's not about just keeping you also want the assistant cook to be as best as you can be at their job and you might pay this increased interest with a little less control on your priorities and a little less agility as you need to react to incidents but in the end it will be worth it because you are enabling others to rely on your tool and your stability will enable their success as they build more and more on top of your data and your tools in the brilliant creative way ways in which they can use the service you provide enabling others nothing else matters so it's worth it we decided it's worth it how do we make it work what did we do in the days, in the weeks and the months after the Anton Abbey debacle to make sure that usually the very first basic is getting an email when a certain program fails the real basic and that's what we had at the time then you can build on this email you can attach certain tools that will phone and wake you up and you can even do it yourself if you want and then you can also send lower priority alerts and messages to your chat or to your internal communications so that you have a timeline of there was this low priority alert then there was this high priority alert and this is what happened and it's only in one place and in this phase you also want to be making sure that the person answering the incident in responding to the incident is able to do it so make sure your logs are accessible make sure there is documentation in place consider training people with fake emergencies and fake incidents the next step up in the chain is moving from gathering information just when bad things happen to gathering information all the time so you can start with very basic information CPU usage, RAM usage disk usage and then you can move up and take a broader view how many web pages are we serving how many jobs are running how much data are we moving and then you can even move even higher how many customers are we serving how many orders are being placed and at this point you can plug your alerting system on top of your monitoring system rather than just getting an alert and getting paged when there is a problem you can say ok I have a CPU all CPUs at 100% for 10 minutes maybe it's time for an alert or I have only left on my hard drive maybe it's time for an alert and even higher step up the chain is looking at your monitoring at your business data and monitor that data itself so you're looking at questions like is 20,000 customers on the site normal for a Sunday evening and we receive the data we expected from google analytics do we have a high rate of traffic that doesn't have google analytics identified source and this works really well because it's basically an alerting system for both your business and your systems there is a lot of literature on data quality sort of tainted by association with some well known big software vendors they discard the big software vendors these concepts are actually general and they don't depend on specific technologies so now we have lots of checks and alerts some require immediate attention some require attention on the next day some require attention on the next working day and you maybe start ignoring some of them don't get comfortably numb read each alert make sure the team reads each alert respond to each alert and also examine if that alert was useful can you improve it should you silence it should you measure something else and then classify alerts classify them by system by kind of problem, by business area by priority ideally every new feature is monitoring and alerting attached and over time you can use information you gather this way to guide your decisions guide your technical decisions guide your product decisions so you have a very opinionated selection of resources blog from Julia Evans a conversation on twitter with charity majors the nurse stories stolen from this course on business ethics I cannot recommend this course enough even if it doesn't almost cover on call it's a really good course and the last one is a 2013 book on data quality which still holds well enough I just want to say thank to all the devops and engineers and developers who have been on call for years and make the internet work thank you for your presentation and now for questions and answers you've mentioned training people with fake emergencies and stuff like that how do you simulate that do you actually break something on a staging environment or what do you do I actually break production on purpose during office hours that's how I do it nice I'm not telling you to do it but that's actually how we do it in my team when you're continuing on this when you break on production do you have also an alternating backup system or it depends on the breakage if we are causing a breakage on purpose we tend not to do something that will actually go to the customers so maybe we put a wrong string for the connection database and then we make it fail before it deploys we tend not to do anything that will actually impact people and I also look after a lot of ETLs so you can make an ETL fail in the middle of the day and the data will still be the same data you gather at the beginning of the day so that's another way to do it that's mostly how we do it actually any other questions it's kind of unrelated but are those your cats yes the grey one is Estia and the white one is Philo thanks and I know what you feel when they call you and which kind of software do you use for monitoring your system or your data so the this layer is Datadog they have a booth on the right outside for this layer we use a data democratization tool called redash it's a wonderful tool and I strongly encourage you to try redash it's in python by the way obviously there are alternatives there's as much as alternatives as you can think of ok do we have another question if not and thank our speaker