 OK. So I will be talking about monitoring applications and evolving from text logs to extensive automatic monitoring. My name is Sven Finke. I'm working at Shopware in Germany. I'm one of the developers for Shopware Connect. Yeah, you can follow me on Twitter, maybe. One of my main project is Shopware Connect. I will be using this as an example because I'm mainly responsible for DevOps in this project and taking care of the servers, keeping them running. And if anything bad happens, I'm the person who will be asked and I should at least give them some intel of what exactly is happening right now. Yeah, so just a little bit about what Shopware Connect is. Connect is an application where suppliers and merchants is an e-commerce application where suppliers and merchants can connect to each other. Suppliers can supply products to the platform and merchants can subscribe to the products. The order is then if somebody orders something in a merchant's shop, the order is directly submitted to the supplier. For me, that means that I have to synchronize product data, including prices, descriptions, availability, everything, images. I have to synchronize orders. I have to make sure that if an order is being placed, both supplier and merchant are on equal terms on this order so that the prices are right, availability is right, and everything is fine. So the system really needs to run at all the time. And if anything, fails, that can become pretty bad. One of the things we are using, and that is all the images are cropped, damn. One of the things we're using to make sure that we are finishing all our tasks in time is beanstalk. I'll be talking about this throughout the application a few times because it's hard to analyze and find out what's going on if something crashes with some beanstalk jobs. Beanstalk is a message queue for everyone who does not know that. A message queue can be used to make tasks run asynchronously. So you put them into the message queue, and some task handlers pull them from the message queue so they will be handled. That way, you don't have to wait for long-running processes inside your request or if you start a command. You can just create the tasks, push them to the beanstalk, and the responsible handlers will take care of it. In Connect, we have seven individual components, indexer, updater, specialized components that take care of several things. We have four servers I have to manage. Two of them are in the life system behind the load balancer, so they should always be equal. The other two are used for staging systems. Staging one and staging two. The one is used for QA purposes. There is some stuff going on, and often things break due to do testings at the demo environment, which can be used to show the stuff to customers, potential customers. I have running 22 crown jobs on each of these machines that I have to keep track of, which is also sometimes not that easy. 33 workers that are managed through a superman. And in total, we have currently 2,300 users roundabout and 690,000 products. Just to give you a little scale and reproduce around 30 gigabytes of log files each month. In case there goes something wrong, I have to dig through these log files, so that might be a problem. Background monitoring. Through these, we have a lot of data that I have to manage, and so what is background monitoring? Background monitoring. Monitoring is a term used to say that I want to find out what's currently going on. Monitor show me the status of my current processes or whatever is used, and managing it. With background, I mean that I want to keep track of all the stuff that is going on. Yeah, in the background, that I don't have track of through any UI or anything. I don't have a web interface to call my crown jobs, to see what are my crown jobs doing. So any kind of process, demon, worker, whatever you may call it. Anything that runs in background and is fired automatically, I want to track that. But how is it working right now without some monitoring? You probably get feedback, because you don't have any measurement to fix something before a real crash happens. You will get feedback probably from the customer. So yeah, you get a message. The sync is not working right now. Bad. Maybe there's just some data wrong for the customers so you can look into the database and hopefully fix this very fast. Maybe some products are not updated. Also something you hopefully can fix very fast without having too many issues. Or maybe the site is down. This will create chaos. This is not a good thing to happen. You don't want your site to be down. And in case something like that happens, you need to fix it. And to fix it, yeah, you probably need to gather some intel. This is a little screenshot of the log files we have for Kinect. This is by far not all of the log files and you see that there are some filtered away by log rotate. So these are old files from the last days. This is a current file. If there's something happening, I need to dig through these files. If it happens right now, that might be easy. Just open the latest log file and tail it and try to cause the error again and you may find the exception. If something happened at Friday afternoon and you're looking at that on Monday, you probably have to go a few files backwards. Depending on how many logs you generate, maybe you just have to look at the log two or maybe you even rotate them more often. That depends. If the exception happens just in one component, that might also be quite easy. Just open the log file and everything is fine. You will get a big problem if an exception is not occurring just in one simple component but in several ones. So it spreads throughout your application. Then you might have to open two or three log files to really see what is happening, what caused my problems and how can I fix them. If that's a few days ago, you have to find that exact time in the log files and that might be a pretty horrible task. But what's the problem? The errors already exist. If you get feedback from someone that something's crashed, yeah, it already crashed. So yeah, that's bad. The error already persists. So if the synchronization is not working, it was once not working, it's not working right now and that's not good. Customers recognize errors, so they lose trust in your application. If this happens more than once, they will probably become pretty angry. And if they have to tell you that something's wrong, that they can't have trust that you can fix anything beforehand, that you have control over your application. Finding stuff is hard like I just mentioned with log files and time. Like I said, the errors persist. If you have an error that is not easy to fix, it may take a whole day or longer to fix the issue and that whole day, the customer can't really work depending on the error that happened or your site is down for about half a day or a day. That's not good. Yeah, monitoring to the rescue. I will talk about three different components. Log aggregators, metrics aggregators and performance analysis. These are the three components we're using Connect and these are three components that really saved me a few times when something was going wrong or made me realize that something can go wrong in the next days. So the first thing, log aggregators. What exactly is a log aggregator? Or more precisely, what does a log aggregator? First thing, it gathers logs. So all the log files we've seen in the past will be gathered by the log aggregator and centralized. It will index log files. This gives me the ability to search them. It will add a full text search for all the log files and it will add a chronology for me. So I can jump to the Friday afternoon and see all everything that's happening there. I can filter that log files for that exact time and see what happened, what caused my problems. I can group stuff by environment. Like I mentioned earlier, our life system is made up of two servers. That means if I have to find something, I would have to jump to each server and look at the log files and find out is it just happening on one server or on both or anything. The log aggregator groups stuff by environment so I just look at the production environment, at the life environment and I will see, is this happening just on one server or on both servers? What's going on? And something that's maybe more subtle, I can make logs available to more people. I don't want to give everyone access to my servers. I don't want to give the trainee access to, with SSH to my servers to look at log files. Maybe by accident he breaks something on the life system, not a good idea. So with the log aggregator, it's basically a website where I can visit and I can grant more people access to the servers. I can also grant access to the log files to people who are not familiar with the terminal, who wouldn't even know how to open the log files on the servers. Another cool thing, I can alert some messages. But let's look at the log aggregator. Oh, the quality is great. This is the overview of our production environment. You'll see a lot of log files and here's something I had to jump into and fix it because we have a critical error here. But I can immediately see that. We have two, okay, it's not visible. We have two different environments. Some of the messages come from the back end one and some of the messages come from the back end two. But we see them in one big overview. I can filter the messages down below and here I could select production and the environment I want to use all the time. By the way, this is a tool called paper trails I'm using. There are plenty of other tools out there that can be used, also open source tools. This one is a paid tool, a paid service. Which has the advantage that I don't have to set up my own log aggregator service or anything. But yeah, just look for that and you will find different solutions. Here I can see that I am displaying two different log files. In fact, there are a lot of other log files in here but not currently visible. I can also filter them out so I just see the log files of a specific file or a specific environment or just one specific server. So I have plenty stuff I can do there. Alerts, I have quickly mentioned that earlier. I can define alerts. We just have basically two different alerts. They are duplicated so we will receive a message in hip chat if something is going wrong if one of these alerts triggers and I will receive a mail about the issue. So that even on the weekend if something really critical happens I will get a mail and maybe I can try to fix that or say no, I can wait till Monday. This is how an alert can be configured. You can basically write a query and just what you've seen down in the search field you can just copy that and paste it there. You can limit this to a specific log file or anything. Typically you can define a frequency so if something is going wrong I want to be informed in this case every hour about this error and you can define a threshold. And sometimes if the filter is getting like five messages per hour that's fine, it's okay. Things can go wrong that might be okay but if you suddenly receive 60 or 70 of these messages something might be wrong. So yeah, you can define that. Some other things here are some definitions for our hip chat. Yeah, that's how you can define an alert. Of course this is a bit different from tool to tool but basically this is how things can be set up. But how can you define this? How do you set up such a log aggregator? It's quite simple. For UNIX or Linux system logs you just have to do these three steps. You just have to detect your system log at first. The different ones out there so yeah. Then you have to change something in your syslog and gconf file. I had a few lines there and yeah, that's. Can I kind of scroll there? It's not everything. Yeah, I can scroll. Ah, the third step was hit. And in the third step you are going to kill the service, the old service and make it accept the change. That's how we do it for system logs. But in my case I don't only want to use system logs. I have my own custom log files I want to log. Most of the system files don't really, really interest me at all. But there's also a solution for that. I can push any existing logs to that. Here is remote syslog. That's a tool for my thing provided by PaperTrails. I just download that one, install it, and then they provide a configuration file. This is the standard setup for the PaperTrail. I can define different log files here. In fact, our definition is a bit longer. And I can set up the destination. Not only for PaperTrail, this also works for other tools. Yeah, after I've done that, I just start the demon from for remote syslog and the log files are being pushed to PaperTrails. Yeah, so the setup is really quite easy. We've automated this with Ansible and there are just three commands and you're good to go. Very simple. This is also not only limited to PHP, obviously. You haven't seen any line of PHP code here yet. This can be done for anything you're running on your machine. Yep. The second part, the metrics aggregator. What is a metrics aggregator? I'm guessing by the name, it's pretty much similar to the log aggregator, just uses metrics. But what is a metric? Metrics are basically just numbers, but in the context. You can see here, this is a quick overview of our metrics. This tool is called Libratto. Same here, there are lots of tools out there, depending on what you currently have and what your needs are. Pick the right tool for you. In this case, we have some metrics here. We have, up here you can see, we have 164 metrics pushed to the service. And the metric is really just a number. So you can see here, we have both environments, back and one and back and two. And every peak or every point where this line changes is a number that has been pushed to the service. We define a name for this that tells us how, what kind of data this is. This are really just integer values, float values, depending on the context. And without knowing what they are, it wouldn't be very helpful. With this data, you can create dashboards like this. This gives me some information about the current status of the application. In fact, this dashboard, I have that on a TV screen on the opposite side of my desk. So I'll always see this one, and if something goes wrong, I can clearly see that. A little example, you can't see that example, damn. Okay, this is a view of the past four weeks. Down in the left, you have the beanstalk jobs that are currently in the queue. And it should look like this. You should have these peaks, and everything is fine. Jobs are being pushed to the queue, they are handled, and everything is good. What you don't see here, we have, I had a very high peak to the left here, so you basically can't see these peaks from before. There I had around 40,000 jobs in the beanstalk queue. It was not that bad, because nothing was really broken. The memory was not super full, no processes has been killed, so without monitoring, I wouldn't really have noticed that something's going wrong, but the queue was filling up more and more. If this would have gone, yeah, was going on, after one or two days, the memory would have been full, and I would have had a real problem. The issue for that was simply that the jobs couldn't be handled as fast as they're being pushed to the queue. If I wouldn't have reacted to that and seen that in the right time, process may have been killed, or anything else that would have happened, and yeah, that way with the dashboard, I have seen that, and I can react to it and fix the issue before anyone notices. In this case, we also can define alerts. So this is a definition for this. As we only have numbers, you can just select the metric you want, so the context of the number and what the trigger is. So in this case, if the average is above 50,000 for 30 minutes, then there's something wrong and the alert will be thrown. In this case, this is the indexer, so the component that takes track of our products being indexed to an elastic search database. And yeah, how is this being done? It's even more simple than the other one, it's just one curl. I have my user data up there, and I just give some some data, so I have the name for my metric, I can define some tags. In our case, we have defined what environment we have, so that's back at one or back at two or whatever. And I just have the name of the metric and the value. This is just a curl I fire in the PHP code. We have written a wrapper for this so that I just have to pass the name of the value into the function, and you can call this at any time in your code to push the values. In fact, we have written one command in symphony that collects lots of metrics like the current amount of users and products and stuff like that, and pushes that to the Brato. Yeah, so far so good. The last component, performance analysis. This is not only useful for making your application run fast. Yeah, why performance analysis? This is not only useful for making your application run faster if you realize that it's quite slow. You can also recognize irregular changes. So you haven't done a deployment or anything changed and suddenly your response time of the live environment goes up. Then there's something wrong that might be a bottleneck, but you won't realize it at the beginning. I'll come back to that later, and you're able to gather some intel. So if there's something going wrong, if there's some regular changes, you can easily identify which exact function call causes the trouble, or if it is a slow SQL statement, or if you have very much data in one response or anything. Yeah, so this is an overview of typeways. Again, this is not shown completely. We have two environments located here. This dashboard can be used to see what changed in this application over the last time, so I can see, hey, the response time dropped by 33%. This is okay for me. If it would have dropped far further or if there, it never really fluctuates in that area and it drops really to an amount that is not normal to your application, then you also should worry. So not only if the response time increases, also a sudden decrease of the response time could be bad. So this gives you a good insight what happened. Also some requests per minute and some other data. But how does this look in detail? This is the overview of our social network. So the main front-end, the customers use. We have an average response time of 100 milliseconds here and we can see that there is some peak in the requests for the site and yeah, how the site reacts to this stuff. But we can also see down here how fast some requests are. Like some of them use just 80 milliseconds and we have one here that's a bit higher. It's around 1.3 seconds. Maybe we should take care of that. What's currently not visible on the right side to the memory, there is also a row which indicates the impact of that request. Because even if the one action is needing 1.3 seconds or anything, if it's just called once a day, it might be okay. It really depends on the situation. In fact, they are ordered by impact. So the top response time, the 80 milliseconds, has the far most impact on the results from all these other results. The other ones are just around 5% or something. So yeah. But let's look at the start action because that's quite high. We can also get a bigger overview of the last calls and how things change. And here we can also see that some of the calls just need very, very low time. Okay, very low is lie, but 500 milliseconds and sometimes the database call takes over two seconds. So yeah, you can investigate further. You can also see here with these labels, they indicate slow queries. Not so slow, sometimes not so slow that they will be recognized by your hardware monitoring or anything or your hoster. But they are regularly slow for your application so you can handle them. You also can get stack traces and more information if you need that. But this is really a good tool to give you an overview of something changes because a change from 100 milliseconds to 300 milliseconds is an increase of 200%. But you won't recognize that. Nobody will recognize a response time change from 100 milliseconds to 300 milliseconds. But this tool will show you how this being done. We're using tight ways here and you can simply install PHP extension on your system and the demon. In most cases you don't have to add the extension to the PHP any, but what you can also do is tell tight ways which framework it uses. You are using in your application. So in our case instead of WordPress, we have Symphony two in there. So tight ways can guess a bit better what's happening in your application if it knows what framework you're using. Yeah, so we have these three tools. They give you a good insight and you can use them all together. So yeah, hooray, awesome. But is this really all necessary? Do you need all these three tools to handle that stuff? Yes, I have taken some of the issues we had in the past and one of them I've already talked about what the Beanstalk you've been running full. That's one of the issues I would have never recognized without the background monitoring. This way I could solve it without it becoming very bad. The response time increase. We actually had a case where the response time increased to 300 milliseconds and we've just realized it through the performance monitor. The cause of this was a bottleneck. A bottleneck that was not really bad right now but if we would have suddenly had a few more customers that would have uploaded several 10,000 products, it would have gone bad. This way I could fix it before the response time goes up to one, two, or three seconds or any tasks run into timeouts or anything and this was a good thing. Without this monitoring, I would have never realized that. Supervised the task being killed. One of the things in the overview of the metrics monitor actually was how many of my supervisor tasks are currently running. If one of them fails, it was killed due to some reason. I probably won't notice that depending on what kind of service it is and yeah, maybe it's a synchronization for a customer doesn't work anymore. Because of the metrics, I will see if a task is killed. I will see if a supervisor task is killed. So the supervisor task where all the workers running in the background, they are doing stuff like synchronizing, fetching updates for products, starting index, the indexing process and anything like that. And something else, specific tasks crash due to some exceptions. The exceptions, I can pretty easily see them in the paper trail. If there are currently exceptions being thrown, I will clearly see that due to the stack trace that will be visible in paper trail, at least in our case. And they are clearly distinguishable to other log entries. So they're easy to detect. We also had this a few times and yeah, so that's it so far. From cases we had where really, we needed all these tools and every single tool helped me a lot. But now to every single one of them. We have the logs. The logs are very useful for complex applications with many components. If you have a more simple application that has just one component and you just have one big log file where everything runs into, then you probably don't need a log aggregator. Also if you don't have multiple servers in one environment, it's also probably not necessary to use a log aggregator. Another thing where a log aggregator is useful if many people will benefit from the access to the logs. So if you have the Q&A team or a support team that would benefit from being able to read the logs, if the customer comes towards them and says that, yeah, the shop synchronization is not working, that he can't log in anymore, then it might be useful for the support to be able to open the log files and see if something with the customer is wrong. And sometimes log files are located on remote machines, you can't quickly access. This is not a problem we generally have in web development, but I had this in the past that I would have to call somebody to give me access to a remote machine because it wasn't an internal network. And that can be a pain in the arse if you just have to wait for them to react to you if there's a serious issue right now. Yeah, that could also be solved by this log aggregator if he is pushing the messages, of course, if it's an internal network and the machine does not have direct internet access, you have to think of solutions how to make the access through the log aggregator possible, but yeah, that might also be a case where a log aggregator is useful. Metrics, when are they useful? First and foremost, do you have data that can be tracked? If you have no data that could be tracked with metrics, you probably can't use them. If there are a lot of stuff in the background, even if you don't have that clear numbers, you might be able to create some numbers by just, we have things like for the beanstalk, if the task is pushed to the beanstalk, we increase the counter and if it's finished with the success, we increase the different counters so we can clearly see how many tasks have been pushed to beanstalk and how many have succeeded and how many should be still in there or failed maybe or something like that. But to do that, you need clear workflows. Without clear workflows, clearly defined workflows, this is not possible. You don't always have them. In most cases, you will have some starting points, some endpoint and you're good to go, but you will not always have these clearly defined workflows. And one other thing for the metrics cells department, sometimes they just want some numbers. They like fancy dashboards, but if you have no clear workflows or anything, if they can't see any numbers there, it's pretty useless for them too. Yeah, and now the last component again, the performance analysis. You will probably need it if, or make, you can use of it if you have complex tasks that use a lot of performance or are potential to use a lot of performance. One of the things in development for detecting bottlenecks is pretty hard, bit hard because we're working on machines that don't have that much load on them like the life service finally have. You could simulate that of course, but in many cases that simulation is not close to reality or, yeah, but if you have complex tasks that potentially consume a lot of performance on live server, performance analysis and monitoring might be a good idea. Also is your application likely to run into bottlenecks? This is a bit hard to determine beforehand, but if you guess there are a few functions that might be a bottleneck in the future, so if there is a sudden increase of products or any other data in your application and they all need to be processed, it might be a good idea to use this monitoring just to find bottlenecks before they become bad. And of course, is your performance already bad? If you already have a slow application, it might be a good idea to use performance analysis to get rid of these slow parts of your application, fix them and then keep this afterwards to identify other areas of the application that are becoming slower over time. Yeah, that's so far from me for the talk. Any questions? Yes. Which tool did you use to show us the dashboard? The dashboard, that was a libato. Okay. Yeah. Yeah, so thank you for the presentation. I have a quick question. So, you mentioned that you used green store Yes. So the question was I have considered using things like Kibana or LogStash, things that might group some functionality for my application instead of using all these different components. Yes, I've considered that. I'm already, right now, I'm looking into finding more centralized solutions for this. But even if it's centralized, these three components remain. So we have, even if metrics and log aggregation is put together into one application, I still have the benefit of separate, I still have the log aggregation and the metrics that I can use separately in some cases. We're currently not using things like LogStash because all these tools we have shown here are projects hosted somewhere. I don't have to take care of any servers. I don't have to take care of another server that needs to be running. As far as I know right now, LogStash is a server tool. I have to set up by myself. Oh, they're provided as a service. Ah, okay. Yeah, but there's one of the reasons why we are using this right now, these tools right now because they are hosted solutions. And yeah, this has been set up quite a few years ago. So I'm not sure if LogStash, they already provided the hosted solution and I was not involved in today picking these exact tools. But yeah, I'm currently also searching for some more combined tools. Yeah. Do you have some blocking with the platform on GCO relative to standing data? Because when I work, we use also data to play, but we send it in the box and then extract it from the list. So we don't block our process by sending it to some other servers. I'm wondering if you have some issues with that. No, this is just like, yeah, fire and forget. We just fire that to the service. We don't really care about the response of that thing. If it potentially, the service is unavailable. We don't really care. And it doesn't really affect our performance. Yeah. I forgot to repeat the question, I think. The question was if the matrix aggregation, when we are collecting this data, if that affects the performance of the application in any way and it doesn't for us. Yeah. Any other questions? Okay. Thank you. Yeah.