 Hello. OK. Hey, guys. So this is room E112. And the session is DevOps. So if this is the case that you are in the wrong room, then this is the time to switch the room. And we're going to start right away with Philip Cran and his 360 degrees monitoring of your microservices. And just one more thing. There is an Android app for DevConf. And you can rate all the speakers that will be here on DevConf. And there will be some, I think, prizes at the end and so on. So please check the app. All right. Thank you. Yeah, that's working. Good morning, everyone. This is slightly yellow. And on my screen, it's more pink. But I guess you'll manage. So let's get started. Services, microservices, all the rage. I'm actually from Elastic, the company behind Elastic Search, Elastic Stack, Elastic Stack, all of these things. And I'm part of our infrastructure team. So we provide Docker containers, AWS services, automation tests, that direction. And that here, then, is kind of a UNIX pipe. I pipe the infrastructure work into developer directory. So I'm out at lots of conferences like today. So the starting point of this talk was basically that tweet that thanks to microservices, every outage becomes a murder mystery because you don't really know where to start or what is going on. And that was kind of the starting point and why I wanted to do this talk and just put all the different things together. And this is a bit of a trial. It's the first time I'm running that talk. So let's see how it goes. The general point is just logging and monitoring all the things, like everything you have in your distributed infrastructure, we want to collect and then put together. So generally, show of hands, we're using all the latest shit, like microservices, containers, Kubernetes, whatever. Yes. Not sure if it's even the latest stuff anymore, but yeah. So highly distributed things, I guess. Then the question is, how do you log and monitor? Who uses SSH plus tail? You can admit it, it's fine. No? OK. Who uses our stack, Elastic Stack, Elk Stack? Anybody? OK, that's a few. What are people using otherwise? Splank. You must be rich. It's always nice. We always have a dear place in our heart for you. But yeah, I know Splank is very popular on the upside, though it is often pricey. But I'm not going to bash Splank or go into the Splank details. Right now, we can do it afterwards or over beer later. So let's take a look at what we tried to build today. This will be live and probably will fail miserably, but let's see how far we get. I'm starting with a simple, the simplest probably, a Spring Boot app you can do. It's just something that tells you back hello world when you call. And we basically want to monitor that. And if it calls different things, we want to monitor that as well. And kind of get all the information from the entire thing how far we can get. So I don't call it simple. I would call it stupidly simple. So we are not doing any of the fancy stuff like auto discovery, load balancing, all of the things that I know Spring Boot has and many people are using. But we don't want to make it too complicated. We really just want to focus on the logging and monitoring part. So we'll leave out all the bells and whistles. But for that, the code will be very slim and easy to keep in your head. So what we want to start is, we start there with the Spring Boot initializer. By the way, who knows Spring Boot? OK. Who is developing in Java? OK. Let's see how that goes. I mean, it should be really simple. So it should be easy to keep track of what's going on. So I'm starting off with the Spring Boot initializer, which is kind of awesome. I remember back 10 years ago, when I started projects, it was like a week long effort. And just to put together all the dependencies that would work together and get all the packages you would need. And thanks to Spring Boot initializer, it just gives you a curated set of stuff that is known to work well together. And you can just type in, I want to have these features and it will package you the right base package together for you. And you basically only need to fill the hole with your own code. So it's very minimalistic. So what I've already done is I have fetched the most basic Spring Boot application. We can actually show that. So once you have downloaded it, this is good. So what you will get is I'm using Gradle. Don't like Maven anymore for all the XML. So Gradle just gives you the base dependencies. I'm using the upcoming release 1.5.0, where we have release candidate at the moment. Is that big enough? Probably not. So better. It's still kind of weirdly stretched, but OK. Whatever. Now it's not weirdly stretched anymore. It will be interesting. It was too much. Thanks. I'll try not to touch it anymore. So I'm using Spring Boot. Never mind. We don't have any real dependencies. The only thing that we have here is we have one base class that doesn't do anything right now. It's just running the application. So this is just what I have downloaded so far. And you could totally do the same thing again. So if you go here, you define a group by the artifact name, and you just generate project. And that is what you will get, what I have loaded into IntelliJ already. So totally simple. So that you don't have to watch me type all the time. I have just put all the steps I'm showing you into branches. So I would simply switch the branches, show you the difference of the code. But you don't need to watch me for 20 minutes type stuff, mistype, debug, stuff like that. So when you run this in your main, the main in your IDE, it will just run the program and show you what it can do. At the moment, it cannot do much. So we will actually not really run it in the IDE that much, since it's DevOps topic. I have a virtual machine with all my dependencies and stuff where it just deployed. And run it, basically deploying in that case is very simple again. It's just copying the Java file in and running the Java file. So it's playing Ubuntu 16.04. And it is running the entire elastic stack as well. So all my monitoring is also going in that VM. I could have done it on the internet and highly distributed, but I wasn't sure how good the internet was. So I kept everything local. So hopefully that's working reliably. So for those of you who don't know the elastic stack, elastic search is the thing that's storing the data. That is what we'll try to throw in as much data as we can. Kibana, then, is the visualization where you can actually explore your data and graph stuff out. Logstash is to parse stuff. We're not using Logstash today. We're relying on beats. These are lightweight agents written in Go. So they're kind of like shippers. They will just chip the data into elastic search when needed and we'll take it from there. So first step is we want to add some HTTP interface so that we can actually call this somehow. And then we can actually package this already. So what we need to do is jumping back to the presentation mode here. We have added or we have changed it to start a web so we can actually run a web application. And what we then do is we simply, until you didn't refresh, but never mind, we're simply on the base URL, we're simply throwing out the hello world. This is nothing fancy yet. We can still package that together and make it run. Oh, come on. So we're just building a package together and I'll then just copy it into my virtual machine and let it run there. Yes, I'm just copying it over and then I can simply run it. And I don't need you and I don't need you right now. If you've never run its Spring Boot application, it is super simple. It's just Java, Java, whatever we have built. We can simply run that and then curl the thing. So we're starting up Spring Boot thingy with nice headers and it should automatically bind to port 8080 and it has initialize. And when I do a simple curl, it says hello world and it's already running. So first off, the first step now I want to monitor is, A, what is going on on my system? Do I see that I start my small jar? In that example, it's so small I don't really see it on the system level but we will see that. And secondly, what web requests or HTTP requests do I have in general? So, yeah, we have, we've built it with Gradle, we've deployed it, we have run it. Yeah, test is like, call it. And then we want to dive into the system metrics. So what is actually going on on the system? And for that, we will use metric beat, the beat that monitors your system and packet beat, which is like wire shock. The only thing, wire shock is very cool as long as you have a single node. But if you have multiple ones, you will need to collect in multiple servers and then combine the traces. Whereas packet beat will collect the stuff on lots of servers and then combine it already for you. So what we have actually running here, I have started the beats in the background already. So what you have is, it's a simple DVM package for example, or RPM you can install. So for example, for metric beat, I'm just showing you the configuration. What I'm telling you is, I'm interested in all the system metrics. Load course, this network IO, I want to collect that every 10 seconds for any process I have. I don't have containers now, but if I had C groups, it would automatically get the C group information as well. And simply store the result into elastic search then. And I'm doing exactly the same thing for the network traffic as well. So there I'm saying flows is any TCP connection, so I see who is actually communicating with who. And then I'm collecting DNS information, which might not be that interesting in our example. And what will be more interesting is I have an application running on port 8080. Protocol is HTTP. And everything that gets in there, I want to store in elastic search as well. So let's see what that data actually looks like. I'll just do quickly two more curl calls so we can actually find the curl calls. If you've never used Kibana, this is Kibana that's kind of the window into elastic search, all the data we have stored. And I'm using the so-called discover tab. Discover is just like what data do I have and can I see what is going on in there? So here I have all the indices where I have collected data already. The two we're interested right now is the network packets and the system metrics. So let's see which network packets have we collected over the last 15 minutes. And it's loading and yeah, there is some traffic going on, but we should be able to actually only see what HTTP calls we've done. So here I have all the TCP connections that are going on on my system, which is pretty noisy. I don't really care for that. So we have to feel type at the bottom and you can see probably you cannot really see. But it says 93% of my traffic were just TCP, whatever flow informations and only 6% of all the network interactions I had were TCP. So I can filter down on that. I'll just there's a magnifying glass with the plus, which will automatically limit all my traffic to that. And then we can actually see what is going on. So we have something with an HTTP response code 200. It called port 8080, the path was slash. And yeah, the protocol was HTTP as we have defined before. And it was called on localhost. So something is very continuously calling my interface here. And you can see it's in every time slice, which is I think half a minute, it's calling that three times. So every 10 seconds, I have something that is pinging my HTTP interface. And I did some more calls manually. That those were those ones here and those one here. If you had real traffic on your site, you would actually, without instrumenting your application or anything, just based on the information you have on the network interface, you would see how many calls or requests that you actually have to your web application. And that information also includes stuff like, how long did one of these calls take? So we can then dive into how long do your calls take and what was your most commonly called URL? Which is, yeah, since we only have one base URL, it's not that interesting which URL, but you can easily monitor that. We'll jump into a dashboard in a second, but we'll go to the metrics here right away as well. And obviously the metrics don't have a type HTTP. So don't be, or yeah, don't be confused if nothing is showing up. Either you have tried to filter down your results where instead of the star, I could search for stuff or I filtered for specific types. So in my example, I filtered for HTTP. Once I remove that type, I will get kind of a lot of results. I will pause the auto refresh since this will automatically really reload. And you can see, okay, this was enriched with the C-group information. This is actually collecting for this service what is going on in my system. Or for example here, my process name Bioset was sleeping and you shouldn't refresh. Yeah, so it's collecting for all the processes every 10 seconds, it's collecting all the information for each one of these processes. So think of it a bit like top, just you have stored it and you can visualize it afterwards. So we have predefined dashboards for that. This one is for network traffic. For example, you can see how many connections did my server have over time? Which is especially if you have highly distributed stuff and you have network issues, very nice to see, oh, suddenly the network connections have jumped up. Like, I don't know, I'm killing my database because there are too many connections. That should be very easy to see. You can see which are the top hosts. Might be interesting if you have one bad visitor or a denial of service attack who is actually causing that traffic. In my example, since everything is local host, yeah, it's kind of easy to keep in your head and you can see, okay, those were my top URLs. And here you could actually say, I'm only interested in the traffic that came from that one IP address and I have now filtered again. So in pretty much every view, I can just filter down to the information that I'm interested in and just get that one. And based or for the web information, you can see those were my HTTP calls. In total, I had nearly 90 calls. And my top URL was, yeah, get slash. Since I only have one endpoint, that's the most common one, but if you had more, it would shoot you a host of network of endpoints you had hit. I didn't have any errors, so we could just trigger the 404 and when we let it auto refresh, it will tell you, okay, you had over the last 15 minutes, you had one 404, I've just triggered. And that, since you can see slash foo, if you have some URL that is a 404 or a 500 that is very common, that is also something you might want to take care of. And all of that information you have just extracted from the network headers, we're not even storing the content of the packets, it's just the network headers. So what the network headers can give you is, you know, this was a request, this was a response, took 20 milliseconds. The protocol was HTTP, the return code was a 200, and the URL you hit was foo bar. And all of that information we have stored and you can visualize it and see what is going on. So yeah, most popular URL is slash, second one is slash foo, whatever you call here. And we have the same metrics for various things on the system metrics. Let's just go to processes, since that's the easiest to show. So you can see, okay, that is the virtual machine I'm running. It has 87 processes running at the moment. Yeah, this number is pretty stable, so we don't have any high jumps here. Guess which process is taking up most memory? Unsurprisingly, even though you cannot read it, that one is Java here. Actually, oh no, sorry, that's the CPU on the right-hand side, that's memory. That is the bottom area here that is Java. That is Java as well, unsurprisingly. But it's always good to get the reassurance and see what is actually taking up memory. We have a node process here that is also taking, that is the blueish thingy over there, that's taking also like 5% of memory. And you guess what node might be in my stack? That is Kibana, the visualization tool I'm using here. That is a node app. So whatever process you have here, it will show you. Kind of the downside, why is Java taking up so much memory? That is on the one-hand side, elastic search itself, and my Java application that's also running as Java jar. So both of them will be lumped together as one Java process, so you don't really see that. If you have a highly distributed system, you probably only will have one or two services running on your server. There you will know for sure what process is doing what. And we have some other stuff running on the server, but it's mostly not that interesting. Okay, yes, as long as you run them as Java something, Java, Java, whatever, yes, they will be lumped together since that is the process name. Yeah, but I guess if you just look at top, you would also need to go down to the process ID then. At first, you also see only Java. I think we have the process name actually, so you could build your own visualization on that and split it up. So you, for example, you could, even if you're just running Java processes, you could say, I filter down to Java, just give me all the Java processes, and then I'm show by Java ID, and split that up as well. Yes, give me a second. So we're interested in Java. I've switched back to the discover view, so here I can see which documents I actually have, and I'm pausing the auto refresh. So I have, so yeah, Java is taking up one gig of memory here. Do I have process name? I have a process ID as well. Okay, fair enough. So we can actually build our custom visualization for that. Let's say we have an area chart, we want that to be on metrics. I want to, the x-axis is a data histogram, so it's just over time. The y-axis, we just want to, we want to count, no, we want to sum something. We want to sum something like memory, right? For example, since we have load system, CPU, memory, system memory, no, actually, we will need to go down to the, we might count the process IDs. So these are all the process IDs that I have running, and then we could split that, and then I can just limit it down to Java. So we have up to seven Java processes running on our system in total, and then I want to sub-aggregate that, and I want to, let's say we want to split that up by the process ID again, just, ah, yeah. True, we can totally do that as well. So we go for sum on system process, the system process memory something, system process memory size, and we can do that down to system process ID. If I find my process ID in here. This is process ID. Okay, so that should be just Java, and we have two processes running there. And since I have, I think I have set globally the GMS and GFX, both of them are the equal size. Since both of them are now equally sized, so one will be elastic search itself, and the other one will be my microservice, and why is my microservice now so big, because it was not, it's very clever. I think I've globally set, for all the Java, I have set Java ops, and set GMS and GFX for all the processes there. That's why all the processes will be the same size, just to avoid the confusion there. Yeah, I've just said, I don't know, I locate like, I think 800 megabytes or something for process, and which makes sense if you just have elastic search running, or something like that. But if you have microservices, probably that's not a good idea. And you should do it on per process ID. But that's also, you can already tell that from that graphic. So here we have just the Java processes, and split up by the process ID, and then you will need to look up the process ID, what is that actually doing. But you can totally do that, and yeah, do it as far as you want to go. Okay, so let's see what else I can show you. Application logs. We can add generic logging, which is not all that useful. What we want to do is we want to log to JSON directly. So our application does not just produce a string, and then you need to parse the string into the relevant bits and pieces again. But we want to parse in a structured format right away. So we simply instrumented via lock back, it's one of the common Java logging frameworks, to actually output a JSON, and I'm jumping to that branch right away. Let me quickly, we're building that now. So what I have done here, just to give you the 10 second overview. Oops, no. So what I've added here is lots of XML, unfortunately. The main configuration I have done here is I've loaded, I want to do JSON logging, and it will automatically roll over that and do it for you. So we have packaged that together. I'm throwing that over again. Sure, it's okay, you sure? So what I'm telling it is in JSON, I want to log out the severity, the service name, the process ID, the thread that is doing whatever it's doing, the specific class that is logging something, and whatever message I have in there. And all, I want to throw that into JSON. Yes, and it will be auto-rotated. It will be auto-rotated. And I have, there is something called FileBeat that collects the JSON file and inserts it into Elasticsearch directly there as well. So you can centralize all your logs then from all your services easily. So I have just rolled that over. I will need to kill my application and start it again. And it should produce a JSON log file now. So that is what the application is logging out. It's just JSON. And I have already instrumented Elasticsearch to actually collect that. If we jump over to the beats, I have something called FileBeat here. FileBeat is just to collect log files. There is nothing with Java in files. I want to see everything again. And ah, maybe I should, okay. And if I've done it correctly, it should, no. Okay, let's go back. I think I've done that. Sorry? Yeah, maybe either it's the wrong version. No, it's running. You called home. It's even logging. Joseph live demo. Anyway, so we have logged a message with home. So previously I had 1,800 log messages and here I can just search whatever I'm interested in. So my log message should contain home. And if you look at that, that is what is actually stored in Elasticsearch. So it will know that was the process ID, log level was info and my log message is calling home. That was my log message. So whatever you put in there, you can search it and now you could also graph it out. You can show like, what is the, you can split up the graphs between what is the log levels? Do I have suddenly more errors or more infos or more debugs? So you can then drill down into that data again and see what is actually going on. For example, that is kind of, so we have collected that. There's something called file bit we'll quickly jump over that. File bit is just you can define an HTTP endpoint. You ping that and the result of that ping is stored in Elasticsearch as well. And it will just store up, down and the response time. And you can say, I expect an HTTP return code and it needs to be a 200 or it needs to contain a specific string. So you can do all the pinging of your services you want. You can do that with heartbeat. That will actually come out early next week. So the document, the software I'm showing you right now that is only an internal build. We wanted to release yesterday. We found some bugs we will release next week. You will get all of the goodness next week. Okay, next up application metrics. Those of you who know Spring Boot, it exposes a few endpoints to actually show you information what is going on in the system. For example, health and metric. Health is just up and down. Metrics is how many requests did I serve? What is my general status? There is a community beat. So this is not written by us but somebody from the community thought it's useful to collect. Yeah, that's this. Go code is there. You need to compile it yourself but it will just collect these rest endpoints every 10 seconds, for example. Since we're running pretty late on time, I can simply show you what the result of that is. There is no home. So for example here, that is what I've collected on my health endpoint. My service is up. There's a disk space of my node. Yeah, I had 233, 34 requests and that is also continuously collected from your application. So whatever rest endpoint you have, it doesn't need to be Spring Boot, of course. If you just modified it beat, you can start collecting your own endpoints and just storing them over time. And then we add the actuator. The actuator is actually what is exposing that. You can ping those health and metrics and we've just seen what it ends up and you can run that thing then. Next up, request tracing. So especially if you have a highly distributed system, you want to have something to actually once a request enters your system and then propagates through your system and collects all the bits and pieces it needs to show you the final result, you want to have some ID to actually know what is going on. And in Spring Boot, their implementation or component is called, I have no idea how to pronounce it. Sleuth, I don't, I have no idea. So what I'm doing is I'm adding that thing to my application and it will then generate one unique ID for each request in total. So request comes in at the front, propagates through your entire system and keeps on this trace ID. And then there is a second fragment which is per segment. So for every single component you call, that will get its own unique ID. So what that will look like is besides, so that is the lock message you have and again, it's all output to JSON and we will have it in Elasticsearch. So if you have the trace ID of something that went wrong, you can simply look up the trace ID and you will see exactly all the lock messages for that specific message. And what you can see here is, my application is called Microservice Monitoring. You can give that global name. That is for the entire trace, the ID that is the same as for the previous request. Then that is per request and I'm not forwarding the result to something to actually show me the timing of that. So yeah, that is to actually know for each request how it's propagating through the system and you can actually see what is going on there. So the first one is application name, trace ID, the span ID, that is basically like a method call. And if you want to send it to SIPKIN and the SIPKIN is then the final piece, you can start your own SIPKIN process. You need to enable that. We're building our whole thing again. And SIPKIN is its own application then, which I have opened here. And in SIPKIN, ah, crap, I have killed my process. That's not helping. So what SIPKIN can do is since it has the ID of each request, it knows how long the entire request took throughout your system and how long each single step took within your system. So SIPKIN can be run as a Spring Boot application as well. So this is just the second application that is running and your first application is just every request it's doing. It's sending over to that server and it will store that information automatically calculate how long did each request for you take. And it can visualize that by default, it's storing everything just in memory, but it has a backend for Elasticsearch as well. So you can store everything in Elasticsearch there. So now I have it up and running. And yeah, there it is. So if I say, I don't know, between nine o'clock and 11 o'clock, let's see if we have any traces. No, let's see, eight o'clock. I would need, since it's only in memory, I would need to generate a new request. But what you will get is you will get a call history where you can actually see, this was the entire thing and this watch each component of that call and you can see what is actually contributing to that call. Yeah, since I'm out of time, we're not doing that. So the example I'm generally showing is I'm adding a random timeout, like up to two seconds for one call. And in SIPKIN you can actually see then in which component have we added that random timeout or random weight. And how long did it take for one specific request? And that is a very easy way to actually pinpoint if you are calling lots of services, what is actually slow? Like only knowing your entire request is slow doesn't really help you. This is really showing you which of your specific method calls is slow. Yes, it's, yes. So what it is doing is it's calculating the difference. So it knows the trace ID and then knows, okay, this took that long, this took that long, so it's extracting that. You could build that yourself, but SIPKIN is providing that for you and it actually visualizes that as well. No, no, no, SIPKIN has multiple backends. Default is in memory, which is useless, but easy. It can do Cassandra and Elasticsearch as the backends. Yeah, it's just a specific visualization tool and actually doing the calculation for you as well. But it is an interesting point. It might be possible to write the Kibana application, we do applications there and integrate that into Kibana actually, which is an interesting idea. Not sure if we want to ever do it since SIPKIN is already providing that for free, but as long as we have the data, of course, we could do it, yes. Good, thank you. So something we didn't have is annotations in Kibana, which would be super useful if you have, if you deploy, and you can say, here I have deployed and then you see, ah, suddenly more errors, more latency, less latency, whatever. There is an open issue forever, we don't have it yet. Yes, what are you missing? We can leave that for discussion. And just to conclude, yeah, we've monitored kind of the application environment. We have logs, uptime, application matrix and the tracing of each call. And we start all of that in the elastic stack so that it was kind of nice. So, stuff you're missing, general questions, anything I should demo while I have the bill screen? Anyone? Ah, sorry, yes. So the question was, is there some automatic notification if a specific condition is met? Yes, that is how we make money. This is, this is a commercial extension. It's called, so all our commercial extensions are packed into something called XPEC and that component is called alerting. And yes, you can just define rules and it can do stuff like ping me on Slack, call page at youth, you send me an email, call any random web hook, but that is a commercial extension since it is a very nice feature. And somebody needs to pay my salary as well. Yes. So the question was, what is the overhead of running agents inside containers? The point is you should not run the agents in your containers actually. Normally you should run them as a sidecar. So you have your own container to run the metric beat and the packet beat. There will be some overhead. So you should not just do like, collect everything you do if you're not interested in it. You can pretty fine grained narrow down what you actually want to have. For example, for the processes, by default it's just dot star. So it's a regular expression, it's just every process. But if you're just interested in Java processes, for example, just put in Java and only collect for the Java process or for network information and throw out the flow if you're not interested in that and just do HTTP and or remove DNS. We support lots of other protocols. So Postgres, MySQL, MongoDB, we can pass the wire protocol of all of those and store the relevant information. But if you don't need it, don't do it. It is an overhead. I know that some very high-performance customers, they have like the productions set up for the network traffic and they have a wire tap to actually fork over the network traffic to another instance. And that instance is only monitoring the traffic. So that is only running packet beat, but it's not impacting the production system. Yes. So the question was, thank you. The question was if you're not using Java, what are the options especially for dynamic programming languages? So in general, we provide drivers for all the languages, how well they're integrated into something like that. So for example, the logging is not really Java specific. The logging was just I have JSON and I'm collecting the JSON files. You can do that with anything. Any programming language that can write out blocks to JSON, it's easy. If you have just regular lines and you need to parse it, it's much more cumbersome, especially if you have multi-line statements like stack traces, those are a pain in the ass to parse. So that is not programming language specific. You will need something to have the trace ID, but as soon as you have the trace ID and you have that in the log file again, that's again not really specific. I just picked Spring Boot because you only need to write like 10 lines of code and put everything together and run it because it's just very well integrated. But there is nothing particular to Java or anything like that. So as soon as you have the trace ID and you can somehow log it out and the elastic search can collect it, it's good to go. And as soon as you have some rest endpoint, provide metrics about your application, maybe there is nothing predefined to collect exactly those metrics, but it's very easy to customize and just take this as my endpoint, just collect these messages and store them every 10 seconds into elastic search and then I can visualize it out. So that's super easy. Final question is within elastic search, okay? So the question was, how easy is it to change the log level? I don't think that it's really on the elastic search side that is more on the application side. Yeah, but for example, for log files, the agent is just collecting a log file. We are writing a file with JSON out. So if you have a rest call to your application, you can do that. I was just confused because in elastic search itself, we have a rest endpoint and you can change the log level for everything via a rest call. So you don't need to restart the nodes, which is what you should do. Okay, thank you very much. I have lots of stickers here. If you want stickers and swag, see you around. Thanks.