 Alright, it's 1145. We're about to kick everything off It's a good attendance after what I saw yesterday last night. So thank you for making it in the morning Thank you for making it to the keynote. That is that is awesome Today we want to talk a little bit about log management and incident management But I think what people want to see more of is how do you start to deal with your logs in a more reasonable manner? how you manage to To distribute them to get a better visualization of what's going on I think part of the hard part right now with logs is that people just see them as this stream of data The one gentleman was telling me it's almost like a fire hose and they can't distinguish what's an application problem What's a server problem and which server is having issues? So for the purpose of this talk We're going to cover quite a few things. I think we had the vote for more practical So we're going to try and look around some of these products But first of all or is anyone using log Lee page or duty or New Relic? Oh Wow, everyone's using wait wait, let's let's try that again. How many people are using log Lee? How many people are using New Relic? Okay, cool. How many people are using page or duty? Very very very good. I did not expect that and Anyone using anything for public communications or internal communications like slack hipchats and Everyone's hands go up perfect So what we want to talk about is what do you do when things hit the fan? In other words, how can you make your logs work for you when things aren't going well? You know, I think one of the things that we can all agree upon is that this is stressful It's not easy and it's hard to deal with so When we look at this problem, I think it's important to look at some of the other people who are out there So Rackspace in 2014 did a study and that study was entitled How does downtime impact your bottom line? And so you people were given a set of criteria They're given a few questions and we heard particularly about two of those questions the questions of multiple choice and They were asked firstly. What do you think are the greatest causes of downtime? 30% said weather environment related. I think Amazon went out to New York when they had the blizzard. So that seems fair 33% is IT or equipment. So server failures That seems fair 34% which is an increase is cyber tax So if I think people need to be more conscious of this going forward The internet is a scary place and there are scary things happening there And if everyone remembers Drupal get in I think it is an ominous sign of the things that are out there So we need to be prepared However, I think the thing that surprised me the most was the fact that 48% was a result of human error Mistakes that me as a developer you as developers engineers project managers We're all contributing to that 48% It was the most selected thing for things that bring people down and its human error So how do you account for this? The interesting part was 52% of them said that most if not all the downtime that they experienced from unplanned Unplanned outages could have been avoided The question is how can you avoid them? One of the things that I think we've found is that time is of the essence when you're responding The longer you take to respond the higher the cost you have to face as a business as a result of it The other question that they asked people was What does downtime cost you so we're talking about the cost of tech support to get the system back and stable The cost of going through risk more of a log analysis to do some route calls Debugging to find out what the problem was to see how much time did people lose getting pulled off a project If people have worked in support and maintenance you kind of understand this that while you're working on something It's pretty easy to get pulled away by an emergency or a fire But out of all of these things I think there is one thing that is extremely important to highlight the most The most terrible thing that could happen to you as a result of downtime that everyone could almost agree upon Well 37% which is pretty much everyone Who was there was that the cost associated with your reputation and the damage to your brand is more important than the money that you'll lose I Think as service professionals and service providers It's important to understand that people have an expectation for us And as a result when things go wrong they expect us to be there You don't just build a site and it just doesn't go away. There are cyber attacks There are DDoS attempts and there are other injections that may occur Which brings us to the next point Who am I and actually why am I here? My name is Tamani Tundewani, and I'm a support manager at Pantheon. I've been there for about three and a half years when I started I was the first support engineer and Basically, all we did was help people's sites get back online constantly it was a Ongoing and never-ending process I think over the course of that that that three and a half years we went from like 2,000 sites to over a hundred Thousand sites so you can imagine the scale at which down time Considerably escalated for us the volume was was quite interesting to see but The reason why I'm here is that we started to see that it's harder for people to fix things when they don't know What the problem is a lot of the times people start to panic because you're not prepared for it It is a natural fear that people have when you don't know something and After doing this for a long time. I can safely say that I think we know what works I think we know what we would like for people to start doing so that it works for them And I think we would like for people to have a better experience around some of these unplanned outages and reduce the amount that they have to face So we're just going to go through a quick agenda Basically, we're going to cover an overview Then we're going to split it out into two parts part one will be log management And basic structuring of your logs. We're going to take a look at the systems involved which includes Drupal We're going to be using composer the composer manager module and the monologue module Once we're there. We're going to take a look at New Relic See how these new modules can impact us how they can improve our experience and things that we can do now to improve our logging and Then we're going to look at incident management. We're going to take a look at pager duty We're going to take a look at slack and then we're going to do a Mock incident so that people can actually see as you can how you can do it I think one thing I would like to highlight is this is not for pantheon. This is not for aqua This is not for rack space. This is for you if you're a product owner if you're a business person If you are a support team or if you are a developer These are things that you should be able to do to help your sites in times of need So what exactly happens when a website goes down? Here's the typical flow People see their site down and how do we get it back up? That's usually the first question and Then the next question is what exactly is going on and the usual answer when this happens is I don't know and Then after some haranguing and people start to email each other and things start to spiral down and the temperature start to rise It comes down to one of two things. Is it an infrastructure problem or is Drupal sad and again I think people need to be more realistic with the expectations of software software fails We all use modules these modules have deficiencies which we all need to contribute to improve actually and Then things start to go even worse the all caps email starts to be sense People are yelling at you via technology. It's not a good time. You know, it's it's pretty terrible and then finally Usually people start to reach out to us and say hey, can you help us fix it? and I think We've started to Understand why is because in Drupal and in PHP We haven't had a history of establishing good best practices and how we log within our applications and maximizing what we can return from them Every second that you don't know why your site is down It's another second that you are losing part of your reputation and your brand is taking a hit So the classic Now that you know your site is down I'm a business owner. Who's the first person you're gonna call? Your project was only commissions to end at launch and usually that's when people do the most care You check your checklist twice. You check your DNS three times. You make sure your update dot PHP and brushes working But after that what happens? Who do they call? There is no real good answer and what happens next is Panic, right? So I'm gonna just send an email to everyone The website developer, you know, I'm gonna send something to the support guy He doesn't know the some people don't even know so now you're wasting even more time What you need is a plan and we really need a plan Like I don't think you should have any projects that do not have this as part of the process You should sit down with your customers your product owners and your product managers and start doing things like risk assessments Come up with a simple matrix. What's going to happen more frequently? What's going to happen less frequently? What is the impact and what is the impact on our business? And then you can start to prioritize. We only need to care about four things Whatever those may be So we're gonna start taking a look into application management and how you can start doing this There are really five ways you can start to bring this all together and utilize Drupal to do this first You need to standardize your logs You need to get them in a format that can be consumed by other systems like loggily spunk Pager paper trail and others You need to centralize them. I think if you Have gone five servers logging into one server going to the MySQL server going to the PHP server going to the Redis server It takes too long again. It's more time. You're wasting You need to aggregate your logs I think it may have happened to all of us when something breaks in the morning So you find out about it in the afternoon you try to do a watchdog show and what do you end up with? 15 minutes worth of logs The problem that existed before is no longer present. You need to aggregate your logs if you're going to be doing anything meaningful with them Next you need to analyze your logs. You need to sit down and start to dig into the data that is there I think people usually see it as a stream But you need to look at it from a different dimension from top down and finally you need to alert So you need to have some way of saying hey, I've exceeded this threshold. We need to do something and it's a problem There are a few problems with the watchdog One is it's in a semi-arbitrary format. So in Drupal 4.6 came there were like three parameters And then Drupal 4 5 came and they were like five and a half parameters Well, not half but five and then finally it's get to a point that it's set in stable However, in case you don't know the watchdog is no longer going to exist in Drupal 8 And it's going to be replaced the question is with what and we'll take a peek later You can't save the watchdog logs. You can search them. You can filter, but you can't search You can't have safe searches. You can't do any reporting You can't do any post mortems and it's going to be difficult again. You get there and it's 15 minutes cron has run It's truncated your logs. It's not going to help you An important one for debugging and optimization is you don't get any stack traces So you don't see what caused the slowness or the problem when it occurred and Finally the watchdog is not very portable I'm not sure if anyone has actually dug into the watchdog, but this is what the table looks like a Hodgepodge or variables and and replacement things if you saw a million of these on the screen at one time What would you do with them? The answer is nothing So we're going to look at how we're going to standardize point number one In case you don't know PHP fig is an organization the PHP framework and drop group We're started by a group of people who work in the PHP community at a conference called PHP tech and what they said is that the PHP landscape in general is out of control and we need to standardize and we need to have a way so that we can communicate Within lot within various frameworks, so we have people from Drupal So Larry Garfield is our representative if you don't know you should talk to Larry Garfield about this We have people from Laravel Aurora PHP and basically every modern PHP framework is adhering to these standards Drupal 8 again is using one of these standards in the form of psr4 for auto loading We're going to take a peek at monologue, which is going to be handling the translation and the standardization process And then we're going to see it in action So don't worry about that We'll cover that later so there four five accepted standards psr0 Which has now been replaced as of last year December with psr4 if you haven't tried out psr4 There is the Drupal console module which allows you to generate Drupal 8 modules and themes Please go check that out because that will give you the boilerplate of what your code should look like PSR 2 psr1 is a coding standard as people are working between Laravel and Drupal and all these frameworks They need to have a standard that doesn't just jar everyone every time they look at code Another thing that we need is a coding style guide I think within all these different platforms in CMS is we're solving the same problem over and over and they want a way to Standardize it so it's the same for everyone and we could start to port these logs and other and other interfaces there We're going to look specifically at psr3, and I've already mentioned psr4 So psr3 gives us a logger interface that allows us to write ways Logs in a way that is extensible and can be shared with multiple pieces of code and framework It's a really simple class with like eight methods Each method maps to one of these these are the standard error porting levels Again, so in Drupal is kind of arbitrary in Drupal 7. We now follow RFC RFC 5424 and if you don't know what these logging levels are you should find out again That's the reality that's coming in the near future But the reason why you should is it allows you to standardize on a common language Which is one of the most important things doing an incident. So for example, you may say it's an error I may say it's a warning and we may split hairs until my customer fires us That's the unfortunate reality Monologue allows us to send our logs into various places Particularly the watchdog So what happens is it's similar to an ATM where you go and you put in your credit card? After that you get the first question put in your pin code if that passes You allows you to bubble up and move to the next stage the next question the ATM asks you hey Would you like to take out some money? You then agree or not and it moves on to the next stage You carry on until you finally get your money Or you get to a point within the transaction that you say no, thanks I won't be paying the five dollar fee for your ATM today And this is basically what it looks like a series of operations for you to get your money at each point It can stop the processing if it needs to depending on the severity of the errors Monologue comes with a few components a logger a logger basically allows you to have a Channel which you can get information from and there can be multiple So for example, we're going to be using the watchdog for this case And then it allows you to have handlers if you haven't seen monologue or tried it before the power is in the handlers It has a handler for almost literally everything Sys log watchdog to to galf so I can go to gray log. It has a mail handler It has a file handler. It has slack integration It has flow dock if you need Pager duty if you need new relic and if you need log glee so the module itself didn't actually have log glee and New relic so we wrote it and it's really simple and really easy to do The four matter does the processing so it can go to the various services as we ship the information off and Finally the processors will help us with that final pros transaction and get them sent off So now we're about to get our hands dirty. We're going to take a look at the log system that we're going to build First application performance People have already used new relic. So I'm going to skip over this. We know the features and this is how it's going to work Drupal to composer to monologue to new relic We'll skip the features as people are well-versed Next we're going to take a look at log glee Logglee is a place where we can centralize point number two Centralize our logs. It's also another place that we can aggregate our logs Analyze our logs and then start to configure alerts and This is what the trail is going to look from within Drupal We're going to skip the features and then we're going to talk a little bit about incident management There are various frameworks that exist for service management ito ISO 20,000 I think some of these are overkill for what we're doing You just need a basic framework for you guys to get started So there are few in particular the incident management system is one that we use at pantheon But you as developers and organizations can use this as well Heroku is refined this a little bit and you can actually go to their blog and and they have got some documentation about it We're going to see a bit of it in action But I think more what we care about is getting our logs into a way that we can manage and stream them in a sensible way When an incident happens, it is important that you establish some goals First you need to verify that an incident happened. Remember as we were talking about language We need to standardize so we can actually say hey this broke. We need to do something Now the next one should be to restore business continuity Your brand and your image and your reputation Were the things that people valued most and there's no mistake that that's over there We need to reduce the impact of the actual events. We need to determine how the attack or problem happens How do we prevent future incidents or attacks and how do we? Improve our security through all this First you have to form a team. It doesn't matter whether you're one person or you're 15 people You need a team. There are certain functions within the incident command system Namely the incident commander the one point of control the one central focal point where decisions are made It is a very well-known fact that once you start to distribute that it becomes difficult to manage And it's hard to get people to get to consensus sometimes hard choices have to be made and sometimes you need to be responsible for making them I think that's one thing that people need to become more accustomed to and it's okay Sometimes it's not the right choice, but a choice is better than no choice You need to understand the situation you need the information so you can act on it and this is where the logs come in You needed to determine the goals. You can do this internally. You can do this with your customers You also needs to prepare plan and review each of these things you need to iterate on you need to have regular practice for downtime Incidents you need to have regular reviews of what data you're collecting and you need to make sure it's useful So for our incident management configuration system, we're gonna be using pager duty again We're not gonna cover the features But one of the most important things to remember is this is your incident command center no emails No texting people randomly you talk to the people who need to be notified immediately It is effective and we also use it at Pantheon at scale We have people on each continent in the world and this is how we keep track of the websites from Shanghai to St. Louis, Missouri We're gonna use slack HQ for communication and now time for live demo If anyone has a live demo deity that they like to talk to before, please let me know That's a no. All right cool, so We're gonna take a look at our friend the watchdog and understand the classic problem If is that okay? There you go. I like a question. Okay, so Here is the classic problem. Let's say I wanted to filter down and find out all the errors that I get of a particular type easy, right What if I want to find out the IP address that generated all of those entries? How would you do that with watchdog? It's a trick question. You can't So that is part of the problem Additionally, if we scroll down We'll notice that we only really have Two pages of logs, which is about a hundred logs This site has been online and this is actually the demo site that we use at the booth These are all modules that are just generally installed from the community and this is what the the basic logs look like So we're gonna use the composer manager module. This is gonna manage our dependencies If you haven't seen Greg Anderson who may who is a maintainer of drush He's doing a he did a session on composer and how you can leverage it within your application We also see another dependency we have is this PSR log This is the common interface that we talked about earlier the PSR standard that was established by PHP fig The PSR standard that is actually in Drupal So it is no accident We can actually get the benefits of this while we're using Drupal 7 because it hasn't arrived yet Next we'll take a look at monologue itself Monologue allows you to define channels channels that you can get a stream of information from in this case We're using watchdog What's important is because our development doesn't exist in a vacuum anymore We need to have a pipeline that we develop in and in each environment. We need to manage the sensitivity of our logs We're gonna go ahead and take a look at some logging profiles and see what these logging profiles look like So here we go perfect So in our development We have a series of handlers like the ATM transaction where we're gonna log to New Relic If the error meets a certain threshold in that case, we're gonna if it says bubble will bubble and will go to log Lee After that we'll log it to a stream handler. This is an appropriate level for development You really want to see anything that's going to break before it gets to production Next if you have a production system, you may choose to limit this down You have to understand there is a bit of a trade-off, but when you actually when you actually end up doing this for performance If you go and take a look at the the profile itself here is a list of handlers Remember how I was saying you could export to almost literally anything with monologue and here we go So now we're gonna use the PHP API to push this off into New Relic from Drupal We're gonna use the rest API from log Lee to send our logs to log Lee from watchdog And if we wanted to we could add an additional handler as you see on this page Here is the list, which is only a subset because not everything has been ported So let's go ahead and see what it looks like to actually log one of these events So we saw our watchdog log and it didn't have much action in there It wasn't quite as exciting But one thing we can do is also store some extra meta information or context things that actually give value to your logs This is why the one-dimensional view of the watchdog is difficult We're gonna collect the message We're gonna collect the user ID for who did it We're gonna check the request your eye the page that referred it and all of this additional made of information Now let's head over to New Relic This is the classic view as I said it before this is the mission bicycle company website that we're using for the demos So there are no no secrets to hide here Oh It's not a very good Resolution Anyway, so we're gonna take a look at this New Relic allows us to get a better visibility of our logs It gives us x-ray vision. It allows us to see code level at what's going on and what's the problem It also allows us to get a longer tail view. Remember watchdog had limited us. We could go to a 12 hour view and Bam, it's as simple as that As you can see we can start to track our application performance over time. How did this change? Over on the left side, you'll notice the transactions Whenever we're debugging downtime and you have New Relic available. This is one of the first places I'm going to look why if this is grossly In excess or you know x times larger than this don't bother with what's down here You get the lowest return on investment because as the lowest impact fix what is on top first and then go down Now you'll start to notice we have this error rate Traditionally a New Relic you may not see this but we start to see these spikes and that is what's going on within lot watchdog itself Again, we get time filters that don't limit us to just what is in the database We can start to do things like filtering by URL sorting the events by count or message and Even finding out where the event was generated within the watchdog or Within the application itself. I was sorry within drush. Is it a background process or was it a web transaction? So let's start to take a look at the power here Now that we've got these things logging into watchdog they happen all the time for your commerce transactions If it's slow, we start to get an extra layer of visibility I can see going from top to bottom a stack trace something we do not get before from watchdog and When I'm optimizing and debugging I can start going up. This looks like core. This looks pretty standard. This looks okay This looks this looks pretty normal Let's see if we can get a history of other events as they may be going on We've also got the URL nice So now we have a history and we can start to see things like oh that doesn't look good No, that doesn't look good at all So we can start to say hey, what's going on over here as we scroll down again? We get a nice Stack trace and what we see here is someone wrote this custom module That was me by the way And this is the line that is slow So now when you don't have to when you have to debug things no more guesswork no more best-guest estimations You know known with some sort of watering stick for the watering hole This is exactly what's broken, and you fix it faster, and you fix it as a P You can start to get a better visualization of what your application is doing by logging more information New Relic also comes with some handy features. So if you wanted to do some debugging on But it's still up so you can see our slowest average response time And we can start to get a breakdown if you have a module such as views that is killing you This is an easy way to find out who the main culprit is and track it down again get the guesswork out of the equation So when people are down, this is how we find out I don't think I think people think there's some secrets to what we do and all we're doing is looking at this information You can go back to your list of transactions and start to get more detailed information So if you have the pro version of New Relic, it's really useful So over here we have our most time consuming transactions if we scroll down You'll actually be able to see a historical trace So notice how yesterday this problem didn't happen and then it started happening what changed People don't ask this question enough, and I'll tell you what changed I installed that module that broke if I wanted to fix it. I would fix that module and And over here we get insights into some slow transactions in detail So going back to how do you know which part of the application is having issues and which part broke for you? This is a clear way of doing that now With New Relic it actually tells us with each function with each step within the bootstra within the stack trace What was called the count that was called that and the duration and again always go for the highest value return Go for what is at the top because again, that is probably what is crippling you if you need to do things like get deeper New Relic gives you the ability to use stack traces so you can actually start expanding on this stuff like okay. I See here. Oh Well, if you start to see like hey most of the time is spent here. That's where you should start Things like SQL statements. I think at Pantheon. This is one of the hardest things that we deal with people generate views Views does what it wants to do with your query and then it does what it wants to do with your websites so a lot of time is spent telling people Optimize your queries and optimize your views There are a number of ways to do this There's views query alter hooks. You could write your own. Don't be shy. It's definitely a possibility So now we have got a strategy in place We've standardized our logs and they're heading over into Yep, they're heading over into our centralized location where we could aggregate them Log least power is that it can allow you to have dashboards So for this site, I can get a view of what's happened in the last hour or I could get a larger visual and I could start to see what's happened in the last day You can build your own custom dashboards. So for example, we have some say search I wanted to see when crown events are happening when the error rate is beyond what I find acceptable I also am able to start to get deeper insight into the watchdog table and filter down into these various context rules that are given Let's go ahead and take a look at the actual search interface So I'm gonna go ahead and close this. I want to start going from left to right and we're gonna go through it Now let's say you want to start get more detailed information What percentage of errors are you getting that are a certain type with? Logly it basically is taking our watchdog and nicely formatted into json and each of these json fields become filters So if you've used solar before this is pretty much solar for your logs I Then want to see for example What is the breakdown of percentage in a pie chart of? What my errors are I? Start to get this visualization that I'm getting a lot of PHP errors and Cron is definitely killing me I Think this is something we should add as a widget and put on our custom dashboard So people actually know what's going on if this pie chart starts to look crazy you should go ahead and fix it So let's call it error Distribution watchdog error Distribution and save So now when we head back to our custom dashboard over here, I'm gonna refresh so that we can actually pick up our new widget perfect and So now we can click here Click on my custom widget I've got one for my error distribution. I want to add it to the dash Excellent actually not so excellent because it's in the wrong position. Let me go ahead and move it down there So now I can start to build a real-time Analytics framework out of Drupal no rules don't force Drupal to do things it shouldn't do use plus some of these tools that are available Going back to the actual search interface. This is nice But we need to filter down and find out what our problem is Loggly allows us to do that by using these quick filters and all I need to do is click there And as you see now I can dig into multiple for multiple dimensions. What's going in watchdog? I'm down to 90 events Next let me oh, that's type. It's the same. So next I want to see hey Are we getting this error from a particular IP address and it looks like we're getting it from a range So it's not a user having a problem. It's a system level issue Let's switch from that view and we can go back to the collapsed events. I'm gonna expand this so everyone can see What happened to that information when we sent it from Drupal? Well, this is exactly what it looks like It's formatted JSON the formatter from monologue has formatted it in this way so we can leverage it So I'm gonna start just eyeballing it and starting to see okay. I show a sequel error Warning file puts I really care about this sequel error Let me find out Exactly what's going on? So I'm gonna drill down again and Now with this level of granularity. I can see when did this problem begin and how often is it occurred? Snaps all right cool So now I'm really starting to get better insight into what's going on if you ever want to drill down specifically into a range All you have to do is click and drag and it will do that for you Let's say I'm starting to see a particular event spike I can go ahead and see the events that are surrounding it Is there anything that's happening 500 times before this happens? If so, I should fix that first It was so it's gonna go ahead and think about it and it's gonna generate a nice Graph with all the data that we need and boom Now finally I want to see is this related to a particular IP address so I expand it I Go here. Ah My range is shrunk considerably This is the beginning of the event. This is what the people experienced This is what needs to be addressed if this was an outage while your life side had just gone down This is how you would find out Standardize centralize analyze aggregates and finally and most importantly alert You can do even more things for example. I want to save this search Now I'm gonna call it Drupal con sequel error Suckage It's a technical term sucks someone said And there we go. You'll notice now that I have a custom dashboard tab This is persistent if someone else came to start working on this they can see the same thing They can find the problems. They can fix the problems if they're required to Logly also has a powerful alerting system So I think when I see this cron error I really should do something Chances are I need to run update dot PHP or I need to do a data cleanse And this is how we can figure out when this has reached a point that I need to do something If we go ahead and click on on the error rate SLA that we created earlier What you can see here is that? We're given various options on what we're going to alert on so for example, here's the name the description Remember those safe searches. We're just generating we can use these to alert on now You can start to drill that even further if the count of errors reaches a certain number Before a certain amount of time. We should let someone know if it's Christmas Eve and you're having a fire sale. I think you'd like to know before January 1st that something is wrong So we also have the ability to alert and send an email or in this case Send it to one of our page of duty endpoints and I'm gonna have this check every minute Just so that we can actually get something for the demo It'll be a bit aggressive for you to do this every minute, but you know, it's up to you and what your use cases are So I think that basically covers the first part which is log management And now we're going to go into a bit of incident management and how you can use page of duty as your incident command center Here's page of duty. Let me take your step back So here's your dashboard This is where you're managing all your incidents the incidents that you have that are active the incidents that you have acknowledged Incidents that are resolved. We also get again another quick overview of all of the incidents that have happened So all of the times that someone broke something on the website while they were doing a demo has alerted here And this is the history of it. We're looking at that website. Would you have guessed this was what was going on? No It's impossible. The only way you know is now that you're armed with that information It also allows us to define an escalation chain Now we can go from one level to another level You don't need to manage your schedules in the Google spreadsheet You're better off managing them in page or duty because you can do things like override and have other people take over Next we can take a look at different screen see who play Perfect you can manage your contact information which I'll flash quickly. So no one is sending me text messages. Hopefully That was fast enough We also have some additional features like team management as I said So you can group things together a lot of different people as I said, we have people in different regions across the world We have a US team. We have an EU team We have people in Asia as well, and then we have our friend Kit Wong who just gets all of our emails And then finally one of the most powerful pieces is reporting The instant command system The project management life cycle agile XP whatever you're using Defines iteration and you should improve with each turn In other words, you shouldn't just let things go down and close a blind and turn a blind eye You should start to find out reports How is my team doing? What is the average time that it takes for them to resolve an issue or to address it? What is the time to first response and what is the time to resolution? We can see an alerts report. Is there a component within our system that is failing with some regularity? Is it our server? Is it our custom module that's failing? Is it waking people up at four in the morning? And we can get an incidence report an overview of everything that's going on as We see it and we can export it We can filter by day by week as you see down here We have a list of events again if you're going to be talking to your customers if they're experiencing issues You don't want to start filtering through emails and having anecdotal conversations People should know what's wrong. It'll allow you to be a better developer and produce and give better services Again, things really went wrong after I added this module. Actually, that's pretty terrible Well, all right So again, I think it highlights the site itself was crawling it is in good pain And it's having issues Now we're going to head over to Slack We set up a custom organization so we could show you something similar to what we do and what you could potentially do So here's the Drupal con la room where we hang out. I think there is some chatter over here Suzanne is talking during my session with Kit for some reason Guess and most importantly the incidence response room One of the key principles is to centralize command and communication during an incident when this happens It'll allow you to instead of having to phone someone email someone text someone and send them a career pigeon To actually just have one place to do this if Suzanne were to join in today Or we had Conrad sign up you'd be able to see the history of the event as it happened And one of them all the longest pieces and hardest pieces is keeping people up to speed with things that are going on As you can see we already have some existing events So what we're going to do next is go ahead and trigger an event and see what the ideal process is What should ideally happen is that someone should get an email? Someone's phone should ring and then we should get a notification here Additionally it should go to the right person the person who can fix the problem the fastest So part of when you develop that risk matrix is figuring out who should get what notifications and when and that way when you build your Escalation chain, you know exactly what broke and you'll know hey Dave you're taking too long to answer our customer our SLA is 20 minutes or Dave. Do you even know what's going on? So let's go ahead and fire an incident off Now Dave doesn't know what's going on Suzanne is now talking again in general chat during my session Live demo everybody live All right, cool. So here we go. I'm going to trigger an error On my phone. I developed a nice little drush app. So what I'm going to do now is via drush Make things go horribly wrong so I've defined my escalation chain and So so oh hello Something is happening I've been alerted that something broke if you are a support team Everyone gets to see this everyone knows something needs to be fixed. It's your high value property It's your high value customer. You should go ahead and fix it at this point. I'm going to call Suzanne and say Cool So you've been hearing some beeping and I guess you've been wondering what that is That's my phone going off every time that isn't there's an alert So I'm gonna head over to page duty. Oh, there is my error and there's my alerts. I'm gonna click here And I'm gonna reassign this I'm gonna say hey, this should probably go to Suzanne No, don't escalate to me because oh, oh man. I Guess the right person is getting the phone call now. I guess the event was escalated. So no need for me to take action here I'm on the train. I'm on the subway. I'm on a boat Maybe so I can't do it by going to my computer But because we're committed you were communicated and we're connected Immediately that happened there. Suzanne's gonna get it. She's got it covered. So now we can go ahead and resolve it Because we know what the problem is we can go ahead and say hey what's going on over here We're seeing some unusual activities and you've set off one of our SLAs We head over to the log dashboard and see we're starting to see some unusual activity and how cron is running and bam What's going on here the problem that we said we should fix needs to be fixed ASAP So now we can reduce the time that it takes to resolve issues But we made a mistake. I forgot to resolve the issue. Go ahead and resolve it boom So now oh Someone's changed it. Let's try. Oh because I resolved on my phone now. It's resolved Suzanne has worked her magic the site is back online We used our playbook which we had established when we decided what the risks the business had By doing it this way Suzanne is doing community incident commander. She's basically taking care of everything She's updating the status page and she's communicating within the two of us. We've managed to control an incident Address an incident and now resolve an incident While we're in this demo So I think the takeaway here is that you can be in charge of your own destinies You can do this now these services are available Stop building too much into Drupal and having it do all of this for you Offset it let your application focus on what's important if it's a commerce site Focus on commerce. I Think one thing that I would like to highlight about this is that it just doesn't end at your watchdog If you're using Drupal commas I'm sure many of you have the commerce reporting dashboard that brings your site to its knees each time you try to generate a report To avoid this you can send over Nonsensitive information to the logs and then you can start to do things like find out Hey, if I'm having a campaign, what is the the turnaround from it? Where are the IPs? Where are the most visitors to my site coming from? Do I need to concentrate my marketing efforts on the EU? Is it better for me to have my campaign during the morning or at night a? Lot of this is how it is bound together with the information that you have and you need to use it It's sitting right in front of you on that note. I think we are towards the end I would say this before I conclude Please please please please Start aggregating your logs standardizing your logs centralizing your logs Analyzing your logs and alerting it is extremely important for us to increase the quality of our Drupal sites the performance The expectations and most importantly maintaining your brand and your reputation when things matter most Thank you very much So any questions I was asked to direct people to the mic or you can shout out the question and I can answer Hey, how's it going? Sure Yes You sir deal with logs for yes, so you're more seasoned than most so the question was How do I standardize my log so you actually touched upon something that's interesting So first point separate your application logs from your server logs Why that's important is because agent lists collection I can use the HD PI to at least get my application logs. Then you have your server logs There are various tools such as log stash, which he mentioned which allow you to do the transformations and normalize your Your your logs into a standardized format I maybe we'll talk a bit later about your actual log problems and how they relate You know, but it's definitely an issue part of the reason why you need to standardize is so that you're not doing this dance on the Fly when you encounter a new log format, that's why standardized log formats are so important And then your second question was for real time It really depends. I think what you would like is definitely our syslog to be doing it And you know to having agents we have a script that you can actually run that'll just Trigger our syslog and like go grab this vice FTP or our syslog Yes, right, so I think your best bet is definitely setting up a log stash servers and intermediates It'll then be the broker between the communication and do your transformation of logs and then you can ship them off. There's no Yes, or you can have our syslog on the servers themselves But it takes them it takes some work Especially if you have a distributed system to kind of wrangle them and tell them Hey, this is the canonical place for you to drop your logs Hey, so I think you mentioned earlier that you were using log least rest API to push. Yes Have you Found any like performance problems with that or like run into situations where that would actually Exacerbate a problem. Yes, so Eric brings up a good point one of the things that we looked at earlier was development and logging profiles and Sensitivity to performance is one that you should take pretty seriously So I don't think you should be running this in production as loggly itself You should probably be using the our syslog agent if you had or if you have access to it So you should avoid that where possible what we generally recommend is if you're doing this during your development You're doing this during your integration and testing Then you shouldn't have any problems when you get to production and you can turn it off and just use the syslog handler And that's that's why I had it like that, but yes, you should consider that. Thanks Just like to thank the live demo gods for allowing us to get out of here alive All right. Thank you very much