 Yeah, hi everyone. Can you guys hear me at the back? All right, so yes, thanks for having me. My talk today is about achieving data monitoring using Riemann enclosure. So just a bit of info about myself. I've, my name is Eugene and I've been working as a software developer for about eight years now. And I'm really interested in all kinds of programming languages, especially functional languages. And my, the first book that brought me into the world of Lisp was this great book by Konrad Basky called The Land of Lisp. It's about learning Lisp while programming games and there's lots of comics in it. So yes, I work at a company called Radioactive. And what we do is that we work with radio broadcasters to bring their FM streams onto the internet. And we also offer additional services like listener analytics, in-stream advertising, and we also manage the mobile apps and web players for them. So what is Riemann exactly? It's an event stream processor and it's written by Kyle Kingsbury. And in my opinion, Riemann is a very interesting and unique piece of software. And since adopting it, it's become the key component in the monitoring stack that we use at Radioactive. So why would you want to choose Riemann? Well, firstly, it fully embraces closure. Riemann itself is written in closure and to configure Riemann, you would write real closure programs. So in case, you know, you're looking for a way to work closure into your existing stack at your workplace, this would be one way to do it. And Riemann comes with a fantastic API for making your stream processing code really concise yet readable. And it's fast. It makes full use of the concurrency primitives in closure to process up to millions of events per second on a single hardware instance. It's also efficient. So it uses protocol buffers and that's a binary protocol. And you can also batch events up when sending to Riemann. Finally, it's extremely versatile because you have the full power of closure at your disposal. You can adapt Riemann to whatever the needs of your organization are. And Riemann also ships with lots of outputs. So Riemann can output events to elastic search, to influx DB, graphite, and so on. Right. So just to give you some background on what the monitoring requirements we had at radioactive and what led us to Riemann, I'd like to just share a bit about what we run. So radioactive we run a mix of bare metal and cloud servers. And in the past, we had single purpose holes running monolithic applications. And over time, we broke these down into microservices running in multiple containers on the same holes. And eventually we adopted Kubernetes. So now we no longer know where our applications are running. It could be anywhere in the cluster. Just a quick show of hands. How many of you here know what Docker is? All right. And Kubernetes. That's pretty much the same people. Okay. Yes. And in the past, our monitoring setup mainly relied on two different tools, which are namely Nagios and Gangwitha. So for those of you who have never heard of Nagios, it's pretty much the great grandfather. It's primarily a centralized polling system. It executes service checks either on the Nagios server itself or on your monitored host by way of an NRP agent. And depending on the result of your service checks, it may send alert to your notification system. So Gangwitha, unlike Nagios, is less concerned with running service checks and more concerned with measuring the performance of your systems over time. So Gangwitha agents on each holes are continuously gathering performance data and reporting to the designated leader within that cluster. Then you have a central server that is continuously polling for the state of the whole cluster. And finally, there's a web app to render the graphs from all your time series. And you would use these graphs to spot problem trends, do capacity planning or just to feel good about how many requests your system is handling. So what kind of limitations have we found with Nagios and Gangwitha? Mainly our problems with Nagios. So writing alerts in Nagios is pretty limited. You are only able to operate on a single value in time. So Nagios only cares about the state of your system at a specific point in time. And if you want to write your alert to take into account like the previous historical performance data, that's not possible. On the other hand, Gangwitha is able to store and process all this data. But you know, accessing Gangwitha performance data from within Nagios is a pretty clunky process. And each service check in Nagios is like pretty much independent. You can't, like, from a single service check, you can't access the state of a different service check. So you know, this need for more flexibility in our monitoring system led us to our newer model, which we call a push-based monitoring. And in this model, we have collection agents running on each host, gathering metrics and pushing them at events to remun. So remun processes these events and decides where they should go. In our case, we have set it up so that it sends exception of events to our alerting system. And also, we send performance events to Graphite for long-term persistence of storage. So one nice benefit of having this push-based monitoring style is that there's no need to open a network port on your host. So, you know, in the case of Nagios, you need to open the port in order for it to call your host. And there's also less configuration. So you just create new host or remove host and there's no need to update remun on the state of your cluster. So I'd like to talk a bit about remun's data model. So in remun, remun events are as simple as it gets. They are simply closure maps. You can use all your regular closure functions to slice and dice events in remun. And in remun, you have a standard set of fields, such as the host, the service, and there's also a state. But you're not limited to them. You can add as many arbitrary fields as you want and make your events as rich as possible. So here's what writing stream processing code in remun looks like. So I have like a visual representation of this code. So basically you have like a predicate function that filters any events that are tagged with production. And firstly, you then you split them into two different child streams. So in the first stream, you automatically send all events to graphite for long term storage. But in the second stream, and notice that these are actually the same event. So with immutable data, you don't have to worry about the state of events changing under your feet. So anyway, in the second stream, if the events are greater than a certain threshold, then we send email to your contact list. So getting to even more interesting use cases of remun. With remun, you can generate, sorry, yeah. So I wanted to explain here that remun streams are actually just closure functions that take in a single parameter, which is just the remun event. And you can pretty much do what you want within these functions. You can store it in the database, for example. So some more interesting use cases of remun. You can generate new types of metrics. So rather than just a simple threshold of the CPU, you can instead gather up like the last 10 seconds of events for this service and extract the median value and do your alert based on that median value. So this allows you to remove outliers in case you have like a CPU graph that's really spiky. You don't necessarily want to send an alert when that happens. So in addition to extracting the median, you can also extract additional percentiles. So in this case, you have the median and you have the upper percentiles. And we send them to graphite and produce new types of graph that were not previously possible. So one more use case. So just to give a motivating example of what you're trying to do, we want to detect anomalies in our system. So at radioactive, we have these various holes. We have a load balancer and we have various holes that are serving a given number of listeners at a time. So at this point in time, we have four holes that have 200 or so listeners and then we have a fifth holes that has no listeners. So this is an obvious problem but it's difficult to write a check because the number of listeners over the course of a day can drop to zero as well. So you wouldn't want to be alerted based on that. Now in order to deal with this in Riemann, you first have to know that Riemann also has this thing called the index which is an in-memory data store of all the events that Riemann is receiving. So we can actually, what we are doing here is that we are doing a lookup on the Riemann index to pull the data for all the holes and the latest state of all the holes that Riemann knows about. And we can then do some kind of like statistical analysis on whether the value that you're seeing here is an anomaly or not. So this code is a complete but what we could do is that we can take the current value and we can take the values of all the different holes then we can just do like a standard deviation and if this one is outside of that two or three standard deviations then hey you've got an outlier. So this is my last slide. So just to wrap up this talk, I just want to share some other cool things that you can do with Riemann. So you can track the state of your service over time. You can change your stream to only show the differences between each data point. You can also forward events to other Riemann instances and you can use, so in Riemann each event comes with a TTL value and if a hole stops sending events to Riemann then you know Riemann will detect that and you generate an expiry event and you can use that to alert that the holes is down or something. You can also send events to Riemann to indicate that you are doing some maintenance on the holes. You can query the index to gather additional contacts. So for example if you've got a CPU alert on the holes you could query the Riemann's index to gather not just a CPU usage but also the memory usage, the number of connections and you know amount of this space left and pack that all into your alert message because you know when you're getting woken up at 3 a.m. you just want to have as much information as possible. You don't want to go digging around for that info. So yes and Riemann supports full Rappel development and it has great support for writing unit tests as well. So that's the end of my talk and if you are interested in the kind of monitoring infrastructure model that we use and also how to use Riemann in general I strongly recommend this book by James Turnbull called The Art of Monitoring. Thanks. Is there any questions? Mostly you need to monitor the infrastructure. Right, it's a monitoring tool. I'm not sure whether Riemann can do for example your application status. For example APM? I mean because we are at the game, we're doing the game, we want to know some different games for how much time they are playing. I mean because now we are using Prometheus but we need to inject the logic we want. Okay so with Prometheus can you write arbitrary code? Like what programming language does it use or is it its own configuration? We inject actually put the logic in our code in every node so it will push the message from our field. I think it's called semantic monitoring. So basically application having an ability to push its internal state to the external monitoring tool. For example if you have your own application state where you want the external guy to know how many connections are. I mean we we do send like custom application information because you know all you need to do is install Riemann's client API into your application then you can send events. You can attach and the cool thing is you can attach as many attributes as you want to the data model. Similar way. That's the similar way. It's made full or you push. Okay yeah. How does the application actually I didn't get the part where how does the application actually connect with Riemann? So it pushes the data right? So how does it like subscribe like it has to connect with the API of Riemann and Riemann? So Riemann just uses you know it opens a TCP socket so any component that wants to send events to Riemann just you know connects to the socket and then you're sending protocol buffers. It's Riemann's protocol language. So you still need to put the logic inside your application or service code? Yeah but there's a lot of client libraries available for Riemann including you know Java, Go and so on. So it's collaborative data or status then send to Riemann? Yeah you can definitely do that. Can Riemann also work with block data? Block? Like switch off? Uh yeah I don't see why not but what use case would you have for this block data? Something crashes then you want to take the switch off just differently. Right right. Take a switch off before it crashes Yeah you can definitely do that because well I guess when I mentioned that with Riemann events you can attach arbitrary fields these fields are actually encoded as strings so you'll be essentially encoding your JPEG image into a string and you know probably you just want to not have to process it if you didn't Riemann but just pass that string on to your you know alert system or whatever right yeah you can you can do that as well. How do you maintain Riemann 24 hours? I mean in case if I update my update you'll be able to check on the stuff. Okay that's uh So are you saying that when you want to update the Riemann configuration how do you do that? Yeah so as I mentioned Riemann keeps the in-memory store of all the events so when you when you restart Riemann you essentially lose the index store so your the way you use Riemann should be that it don't depend on long-lived events in the index. However currently Riemann doesn't have this capability but you know they uh I don't see anything stopping you know creating like some redness back end for Riemann. In addition Riemann also has a experimental support for reloading its configuration without restarting the server as well. But I found that to be a bit problematic. Does Riemann have a network? Yes yes you can connect to a life-running Riemann instance and what's so what's pretty cool is that like we have this Riemann instance and I can just write arbitrary queries from my desktop and send it to Riemann and and I can like just extract you know information about like how many holes and what kind of events are in the Riemann's index from anywhere. You don't have to use the network REPL you can use that there's also lots of common line tools to to make queries of your live Riemann index. Yeah but yeah you can definitely use a REPL as well. How you make it secure? You can make it secure to security. How do you do that? If you open the REPL then does that provide you a sandbox? So to deal with security I believe Riemann supports client server certificates. So you know any communication would be true a secured connection using certificates. Sorry. So do you think the functions overwrite with some monitoring tools or for example if you say the alert function but because you already sent to the like Flux DB or some other DB actually they are already measuring the state or whether you think this function is overlapped with Riemann already because they already persistent with the state even your Riemann down in their side is still there. Yes yeah you you will use in Flux DB to to store like performance data about your cluster over time but you can't so much do arbitrary computation based on those events. So that's where I see the use case for Riemann is like writing really intelligent alerts and you know being able to attach lots of context to your alerts that you're sending out right. So thanks.