 Thank you all for coming. This is measuring your Elixir application. My name is Hennan Hanali. But people call me Mew House. Probably much easier for you guys to pronounce than Hennan. So I'm a senior software engineer at Sherpa. I work full time with Elixir and Clojure Script. I've been doing that for about a year now. And I'm going to talk to you guys about measuring stuff. And before that, I worked with those hell of different languages. I spent a very long time in Ruby and C-Sharp world. So I surfed a lot on very different ecosystems. And this company I work on is called Sherpa. We're a Brazilian startup. I'm from Brazil if you haven't guessed from my accent already. We're a Brazilian startup that tries to do something like Xenophyts, namely, and those other HR platform startups are trying to do here in the US. Here's our home page. We have a home page now. We're very good friends with the guys at Botiform Attack. And when he heard that Elixir have hit 1.0, we decided, well, it looks like if it comes from them, it's going to be good stuff. So we're going to start our company using this brand new, exciting language. We've been doing that for more than a year now. We'd be very happy with the experience. And I'm going to talk to you guys about the experience we had in building the system and how we started measuring stuff in it and the things we discovered along the way. So first of all, why should I care about what Muhouse is trying to say here? Why should I care about metrics? I try to convince you guys about why this is important. And try to give a logic line here. So first of all, we write code. And we do so because it generates business value. Some of us do it because it's fun. I certainly know guys that do this only because it's fun. But most of us that need to pay the bills, we do this because it generates business value. But this business value is generated only when our code runs and not when it's written. If you only had written code sitting somewhere in the corner, it's not generating business value. If we want to, in order to generate this business value, we need to make good decisions about things. And in order to make good decisions about something at all, we must understand it. So our code generates business value when it's running. So we need to understand our code when it is running. And we want to make good decisions because we like getting paid. You want to pay someone that makes bad decisions. But there is a problem with that. It's very hard to reason about code. So I want to propose a challenge here and ask you guys which implementation of this weird dispatch function seems to be faster. Dispatch functions receives a list of data. The first implementation gets this list and maps synchronously over it, doing some kind of IO. Then it waits for all these async tasks to stop and then collects the results. Let's say that this IO is some disk-based IO. And the second implementation doesn't do anything. It's synchronous, but does it in bulk. So anyone at work with databases and disk-based devices know that doing stuff in bulk is always better in a sense of throughput because you can take advantage of aligned rights and things like that. So I would argue that it seems that doing bulk IO should be faster. This is some kind of model I'm constructing in my head about what seems to be faster. But if I show you the actual answer to this is we don't know. We don't know which implementation is faster because, well, maybe there's this one minute sleep beside the some bulk IO implementation. Only by looking at the code, we can't actually say it if it is how it will perform. So the mental model we have of our system is always, it is always flawed. We can never build a perfect model of our system. But wait, there's still hope. And how can we deal with these sorts of problems? Well, we measure stuff. And here is timing executions to both the first implementation and the second implementation. And you can see that the first one performs in 0.3 seconds in wall clock time and the second one in 60 seconds. So we probably should remove that one minute sleep there. As you can see, the remainder of the 60 seconds there, 0.06 seconds, is actually faster. So in order to understand the business value we are generating, we need to see our code for when it generates business value. We need to see it for when it runs. This, my friends, is science. You generate a model, you devise an experiment, you execute the experiment, collect the results, and refine your model. I like to say that this is changing stuff and seeing what happens. That's what we do all the time. And that's why I love being a software developer is because we get to change stuff and see what happens. So I'm going to, this is like a very simple and contrived example to why it is important to measure stuff. But you can imagine that today modern applications, they have lots and lots and lots of different components. And it is going increasingly hard to think about what our system does and how it behaves only through diagrams and things like that. We need to see our code for what it is when it's running. So I'm going to talk now about some types of metrics and how we look at them, what are some important aspects of metrics, and then we'll go and show how we can get to use these in elixir applications. First, the metrics we're interested in are time series. What I mean by time series is that something that varies with time. Like you guys probably have seen things like that, graphs like that. There's some value and it varies with time. That's why it's called a time series. And the first metric we all hear about is the average value. Everyone talks about average values. And average value is defined by this boring math stuff I want to speak about. But it's something like, you can explain it with a thing like that. There's some value that encompasses all the area behind that curve. Those of you who took calculus class in school or college are probably familiar with this picture. And the most important thing here is when you talk about an average value, you need to talk about the period of time to which this average refers to. As we can see in the picture, there's an A value and A. It's not very clear to see, but there's a beginning and an end to the period to which the average refers to. If we fixate the beginning and let the finish go on with time, we'll see a picture like this. The average value will tend to plateau over a specific value. And we will lose completely the sense of recency and the variation of the data. If you guys have, how many of you here measure your things in production? Have something like New Relic and quite a lot of people. So if you were to do this in your platform of choice, fixate the beginning of the average value and let the time go, you'll see something like this. So we completely lose the sense of variability and recency, which is something very important when you're collecting metrics because we don't actually care about, well, how was the response time for this action one month ago? Well, I'm not really interested in that. I'm interested if the response time for this action is 10 times higher than it used to be. It's probably because there is a problem or because we introduced some change that had performance regressions and things like that. We need to take care about a sense of recency. And in order to do that, we can hack the average value formula with that term there to add a weighted average. We tend to give more value to the more recent values. And we get, imagine that the lighter line is the raw metrics and the bright red is the average value. You can see that the average value tends to follow the original value. It retains a little bit of recency. And you can tune the decay exponential to see how much of it you retain or not. So it's very important to take a look at that. Another important aspect of metrics I like to talk about is distribution. Who in here never saw a picture like that? Everyone saw, this is a histogram. There's also a boring math stuff to it. I'm only showing this because I want you guys to know there's theory behind it, but you don't need to know the theory to get job done. So when you're looking at a histogram, many things start to look very clear from the picture. Imagine that this is a measurement of response times of some web application. So you can see that there's two very distinct populations of data here. There's a lot of requests that tend to respond in like 50 milliseconds. And then the thing, there's another population of data that tends to take more time than that. And you can see that by around 250 milliseconds, there's a huge climb in response time and nothing after that. That's probably means we're hitting some time out value. So only by looking at this picture, we can like devise a lot of models of what might be happening in my system. So we're seeing our system for what it is for when it's generating business value for when it runs. And histograms like when you tabulate, then they tend to look something like this. You don't actually report pictures to a metric system. You try to get some values of it like the mean value, the percentiles, who in here doesn't know what a percentile is? Well, a percentile is a value like the 99th percentile is a value in which 99% of all the other values in your sample are below it and 1% are higher than it. So it's like the average value takes 50% of the values to the left and 50% to the right. The percentile is a generalization of this concept. So when you actually collect and report values that are called histograms, you're actually collecting and reporting percentiles and things like that. Histograms are like average on steroids. But you all like in the case of averages, you need to think about to which period does these percentiles refer to? I will talk about how this ends up showing in tooling after that. Another point is resolution. This is a sine wave and it's adequately sampled. So you can from the sample see the shape of the curve. But imagine that I were to sample it like three times slower. I would get a picture like this. As you can see, we are losing lots of peaks between the little dots which are the samples. And we are actually thinking that the frequency to which our data varies is much lower than it actually is. So when you measure stuff, you have to take care to measure stuff in a resolution that shows what actually is happening. You don't want to lose peaks, you don't want to lose these kinds of things. So I have talked about all of this and how do I get started in the real world? Like I'm talking about important aspects of metrics, but how does this translate to my day-to-day? If you want to use metrics in your day job, you have to solve four specific problems. You have to solve how you collect metrics, how you store metrics because you want to store them because you want to query, analyze, and transform this data so that you, and then you can visualize it and make decisions, right? If you only collect data but you don't store it, you're going to lose very valuable information. So now I'll show how we do our metric collections at Sherpa and then explain why we chose this set of technologies. So I'll go backwards here. We visualize our metrics with Grafana. When here has heard of Grafana? Oh, that's surprising. Grafana is very beautiful and very easy to use. We store our metrics in InfluxDB. When here has heard of InfluxDB, fewer people. So InfluxDB is a database tailored for time series data. It is made to store, query, analyze, and transform time series data. It has lots of features that allow you to tune the samples and things like that. And we have four major metric collection points. We have, we use HA proxy to serve assets and things like that. We collect its logs and parse it and report it to InfluxDB. We use collectD. When here has heard of collectD, collectD is a daemon that collects machine information like disk usage, memory, and this raw, more coarse machine, system-level metrics. We have our front-end and closure script. We have a very simple API to which the front-end reports metrics to. And we have our Elixir, Phoenix, Backend, everything, and Sherpa is. Actually, we have mostly these two components, the front-end and closure script, and the back-end, everything we're doing at Elixir. And we, in Elixir side, we use Exometer and Elixometer. Elixometer is actually just a front-end for Exometer, which is a library to collect and gives you various tools to collect. I'll talk more about that, how you report data. So, to collect data, we chose Exometer and Elixometer because they were very easy to set up. It was very easy, very well documented, so we tried it in an afternoon and it all worked fine, so we went for that. We don't have scalability issues and stuff like that yet, so we didn't overthink it, and since the solution worked, we chose it. We have used CollectD extensively in the past, so it's probably best to use something you already know. In FlexDB, it was very, very easy to set up and configure. It is free of charge. It has built-in retention policy configuration, so you get to say for how much time you want to keep the data, how the resolution you want for it, and all of those things are very easy to set up. It has an awesome query language. It is SQL Live query language. It has a feature called Continuous Queries, which you were able to write a query that will run continuously and you use it to transform and analyze data on the go. So it spreads out the load on the storage server. It's very nice, and it plugs similarly with Grafana. It's very easy, if you haven't, we will see how easy it is in just a couple of minutes. And we use Grafana in the past. It's very beautiful, you will see that. And you might ask, why are not using a solution like Splunk, Datadog, and all of those SAS? Who in here uses Neuralix, Splunk? I think that's probably the go-to solution for most people. And in fact, first, we don't like our data outside of our walls. And since we're dealing with HR stuff, there's lots of NDAs and legal obligations. We're not actually allowed to send customer data to third parties. So that's not really an option for us. That's why we built this architecture and we're doing everything by hand, so to speak. So just a quick overview of what Exometer does. This design is, how did this one broke? This design is prevalent in most metrics collection libraries. You have a high-frequency metrics report that comes from the application. There's a layer that stores in memory and updates an aggregated view of the data, like for example, you want to report how long did it take for this controller action to finish? You don't want actually to report every single point, every single request, because it might overload the storage. So you pre-aggregate it and a tool like Exometer gives you enough room to configure and change that. And there's a reporting layer that collects this buffer and adapts it and formats it to how very different backhands might expect its data to come from. So there's an influx DB Exometer reporter that we use. That's one of the reasons why we chose Exometer, because there was already a reporter built for us. So that's mostly, this design is present in Java metrics library, in Ruby metrics library. It's very common. And other tools like StatsD, Telegraph might fit the role of Exometer in our case. Use the one you're most comfortable with. It is very important, I want to emphasize that, it is very important to be able to control the frequency which your storage receives requests. You don't want to overload it. You don't want to spend time thinking about scaling your metrics collection stuff. And that's probably why many people go for hosted solutions, right? Because you don't need to think these problems through. So now I'll show a very simple demonstration how we could measure all requests through a Phoenix app. There are steps by step instructions in this blog post. I have recorded the demo because I'm not brave enough to do it live, so bear with me. So we're starting a new Phoenix project. I haven't like edited the video, so it would look more real. So I think I was, pretend I'm typing here. And we're starting a new project called metrics. Everything live. And now we won't need to configure Exometer and the reporter, so we go into mix. We add the dependencies there. With the, there's some sort of hacking you need to do to get the dependencies right because some libraries have conflicting declared dependencies, but everything works. We've been using this in production for over six months, so believe me, it's safe to do so. There's this amazing feature of Phoenix. One of the things I like most when I saw it is this override true option where you can say, well, screw your dependency revolution algorithm. I want you to use this version. Well, it's very not, who haven't spent hours trying debugging version issues, right? So after we got the dependencies there, we need to add a lexometer to our list of applications, standard procedure. And in config.exe, we add the configuration for a lexometer and the exometer reporter. All we have to do is like copy and paste and changing stuff that changes your case. So we're creating a DB and influx DB called elixirconf. We will see March of that afterwards. Well, I tried to compile, but it won't compile because I forgot to change my directory and I forgot to get the dependencies. You can see that it's more real when there are errors involved, right? So now it will probably compile and compile all the dependencies and it will take a couple of seconds, bear with me. Am I speaking too fast? It's very weird because when you're speaking in a different language, you're like, you don't know if anyone is understanding anything. So now what we're gonna do is we'll start Docker containers with Grafana and influx DB. Now I will start a container with influx DB version 0.10. There's an error there. So I have already downloaded stuff, so you won't need to wait. And the same thing for Grafana. That's not actually what we do in production because, well, we don't use Docker in production for various philosophical reasons, but yeah, I won't actually talk about that. So, come on. So we have started the services and here's Grafana. It is live. It is very securely protected with admin admin credentials. So it's working, right? We can see that we have probably stuff. So in order to set up a data source for influx DB, we need to choose the type of the data source, give it a name, credentials, and things like that. That's all it actually needs to do to connect Grafana to influx DB. And there's a mistake there that's not actually there that you need to put the credentials. So there it goes. So Elixir Conf, the database we put in config.exe, and data source is added. And when you click test connection, if it is green, it means it worked. That's very nice of the configuration to be able to do that. So, and now we actually need to create the Elixir Conf database, much like we do when we're using a relational database. Influx DB exposes an HTTP API. So we just use curl here to send stuff. So, all right, it's 200, means it is okay. Elixir Conf database is created. And now what we're gonna do is we're gonna configure our recently created Phoenix app to measure every single request that passes through it. We're going to make a graph of it afterwards. So we're going to use plug. We're creating a metrics plug that in web.exe we're adding it into the controller. So anything that you put in this code of expressions in controller is going to be injected in every single controller in your app. So that's a seam you have in Phoenix apps to customize and metaprogram things. So now we're going to actually create this metrics plug. And so, who in here is familiar with plug in Phoenix? All right, so I will need to explain a lot of it. So, in order to report metrics, we need to put that use alexometer clause. That's all we need to do to get all of alexometer reporting functions to create metrics and things like that. It's very easy to do so. We like every plug definition, we just don't use the options. So, we'll annotate the start time of our request using Erlang's monotonic time. This will give the instant in time where, in milliseconds, where the request started. And then we will have, before sending the data back to the user, we're going to get the period of time, get the interval of time for that. So, if you want to have a counter, all you have to do is call update counter, give it a name and the amount of increment you want to put on. And to use a histogram, it's pretty much the same. So, we're going to define a metric name afterwards and we're going to put the request duration in that histogram. We're going to define the request duration and use a function we have not wrote yet called elapsed time that will give us the elapsed time for the request in milliseconds. And as with every single plug, you have to return the connection. Since we're not changing anything in the connection, we just return it. So, what we're going to do is just get the monotonic time at the end of a request, subtract from the start time, and that is all. Now, we need to create a name for this metric. We don't want all of our metrics reported to the same metric, right? Every single action reporting to the same metric name. So, Phoenix gives us a very easy way to do that. In Phoenix Controller, you can get the action name from the con and the controller name with the similar name function. It is very easy. To get that information out of Phoenix. One of the things that I like most about Phoenix is how explicit everything is. So, the name of our metric is going to be the name of the controller slash hash, I don't know, the name of the action. But we forgot, we actually need to execute this duration when the request is, just before the request gets responded. And we do that using the register before send function we have in plug con. This will, just before sending the response back to the user, it will call every function you pass to it. Since, yeah. So, what, sorry? Oh, yeah, it's weird, eMaxHackery. I haven't even realized, sorry. It's the fn, like, it's a chart for the fn. Yeah, that happens when you use eMax to long. So, that is all we need to do. And we're going now to create a test action in a page controller only to show how this works. And in page controller we will define a test action just exactly the same as index but with a timer sleep in there. So, we'll sleep for some time and this time is going to be a uniformly varied random number from zero to 500 milliseconds. So, now we're going to start our Phoenix app, right? So, if you see this message in FlexDB report a connection success, that means it worked and it is now reporting Xometer metrics to in FlexDB. So, now we're going to access this recently created route. As we can see it worked, right? Hello, metrics, hello, elux.com. And we can see in the logs that stuff's working. Right? So, now we're going to write a very simple shelf script that will keep making requests to our page so we can graph them over time. So, I think I would just accelerate stuff a bit here. So, we're not at shelf script comp. So, I have like a pathological love for shelf scripts. I figured that most people don't have the same love I do. But let me just show what we're doing here. Just a second. Come on. Well, yeah, that's right. What you're going to do here is we'll silently execute what we were doing in the browser. So, we will not see output in the terminal. And we're going to spawn a given number of curl processes and then wait for all of them to stop. So, what we're going to do is we'll run the script and we'll continuously keep making requests to our Phoenix app. Hit dot us H. We forgot to give it the number of concurrent requests we want to keep happening. So, there it is. It is waiting because we have stopped our Phoenix app. We get it back to foreground. And now things will start to keep rolling, right? So, in the background you can see different numbers for the request time there going fast. So, it means our sleep is working. Now, we're going to create a dashboard in Grafana to show this reporter metrics. So, one of the things I love the most about Grafana is the query editor. You don't actually need to know in FlexDB's query language. You can just point and click at stuff and it works. Reminds me of the time I was working with SQL Server. So, there you choose the data source and we're going to get the last five minutes of reporter metrics. And you can select all of the various parameters and everything. So, what we're doing here is the green line is the mean time it takes to respond to requests and the yellow line is the maximum, the maximum. No, the ninth, nine, ninth percentile. So, it is just logical that the higher the percentile, the higher the value. So, the higher the value. So, as you can see there, the average value we wait, the average time we wait when doing this request is around 250 milliseconds, which is what you expect from a uniform distribution from zero to 500 and the maximum value is 500 milliseconds. So, means it's doing okay. It's doing okay. So, that's all I have for the demo and other things that you should be measuring in Elixir and Erlang applications. The Erlang VM is awesome. That is why we are all here. There are lots of resources out there on how to measure, monitor and operate it. Sadly, I don't have enough time to go in depth about it. I tried to give a high level overview of why it is important to collect metrics and how easy it is to get started with an architecture like we did. You don't need to get your data off your grounds. But, there's a very nice library called VMStats that help you gather virtual machine metrics. Measuring stuff in the Erlang VM is actually, there's lots of caveats to it. It's not like every other language that I worked with. And, there's this great book called Stuffs Goes Bad, Erlang and Anger by Fred Hubbard that gives invaluable information on how you can profile and inspect a production live system in Erlang. If you don't know Fred Hubbard, he's the guy who wrote the learning of summer lung for great good, which is a great book, highly recommended. And, some closing remarks with, yeah, just finishing our time here. Find seams in your code where you can plug measurements, like plug itself. Try not to spread metrics reporting everywhere. Like, it gets really, really fast. It gets like, those logging, who in here haven't worked in an application where there's logs everywhere and you can't see the damn business logic because there's so many logs. And, it's very easy to get to that point of metrics too, I say from experience. Try not to spread your metrics reporting everywhere. It's very hard to make reason of them if you do it that way. Try to separate high-level measurements from low-level ones. And, this is just the beginning. After you have this infrastructure in place, you have to start doing analysis, correlation, and other boring math stuff. There's where the real goal is. Having a metric that you don't look at and don't drive your decisions is not valuable at all. One of the things I like to do is bend the metrics to generate alerts. Like, if the response time of this action is taking more than 500 milliseconds, something's probably wrong. So, you can plug that in PagerDuty on whatever alert system you're using. So, it's very easy to do so on Influx DB2. There's a product from Influx called Capacitor. I think that does exactly that. Another similar metrics collection architecture described in that blog post. It came, like, one week after I finished this, making those lights. It's a very good read, too, from the guys at Football Addicts. Zed Shaw has an awesome post about statistics and how we can use the statistics. And the title of the blog post is, programmers need to learn statistics or I will kill them all. It's a very entertaining read. And thank you. These are my contacts and I think we have enough time for questions. Or we do not. I don't know. So, it was my first talk in English. I was completely terrified. So, it's a very different beast speaking a different language. Okay, I think it's working. How does Grafana and InfluxDB compare to Elasticsearch and Kibana? Okay, I haven't used Elasticsearch and Kibana, not for this use case, but it seems that it's likely that they fit the same role in reporting. Elastic has a suite of products, like Influx does, that gives you a batteries-included solution for alerting and things like that. I think that you probably, with Logstash, Kibana, Elasticsearch, you solve the same problem we did here with Influx and Grafana and Exometer. So, I don't know, use whatever you have experience with, with whatever you feel it's more appropriate. But the main point, because we chose these architectures, that when we started doing it a year ago, we had libraries that solved most of the problems for us. So, I think that today, the landscape might have changed in the Elasticsearch community, but yeah, have I answered your question? Okay, I don't know much about Kibana, so. Sorry about that. Any more questions? So, I think that's it. Thank you. Thank you.