 Can everybody hear me? Excellent. Hello, folks. I'm Siddharth. I'm from Megzotel, which is a Cloud Telephony platform. I'm here to talk about handling logs, events, and metrics using HECA, which is yet another framework or tool for the Confused DevOps Engineer. The goal of this talk is to talk more about the why. Why we use HECA, what problems it can solve rather than the how. I'm not going to get into too much configuration of HECA or building plugins or extending HECA, because those are all things that I'm sure all of us can figure it out. The focus is going to be on how we have used HECA to solve some problems that we think are important, especially in building distributed systems. So yeah, that's my contact details and my Twitter handle. Not that I tweet anything interesting, so probably you shouldn't follow me. Before I go on with the why, the what, and the how, it's important to set some context. When you're building a distributed system, there are different kinds of data that the services of the distributed systems throw. Typically, these are over and above the standard business-specific request response data that services send amongst each other. So you have logs, you have metrics, and then you have events. If you don't do these right, you also have another kind of data which is angry email messages between your devs and your devops. So the characteristics of these three different kinds of data are widely varying. They are different, right? So logs, they're typically, these are your standard trace messages that you put in your applications, so debug, info, errors, warnings, and so on. So the freshness of the logs between when you generate them and when probably it appears on your Kibana dashboard is flexible, you're okay if it appears after five minutes. And it's even okay in some cases if you lose some logs because they're typically for debugging purposes. Metrics on the other hand are slightly more real-time in nature, so you would expect a metric to get updated on your, say, your Grafana dashboard within a minute of it being generated. Metrics are also typically lossy in the sense it's okay if you lose a few metrics here and there, and these are your standard counters, gauges, timers, and whatnot. And then there are events. Okay, I want to make a quick segue here and talk a little bit about the engineering, about some engineering practice, specifically about distributed systems, right? So whenever you're building distributed systems, one of the most important ways to make distributed systems highly available is loose coupling, right? So if you're making sure that one service does not depend on another service, even if one service crashes and burns, other services around it are not affected, you have achieved 80% of your high availability. And towards that goal, events are really a great enabler towards that goal. So when I say events, events are business specific, typically business specific happenings in the system. Say for example, there is a web app and a new user is registered, right? So you might want to do a bunch of different things when a new user comes in, such as maybe send them a welcome email, maybe you want to provision some resources for that user. So these things are typically done by different microservices. So there'd be a systems or a service that is specifically there for sending out welcome emails. There is a service that is there specifically for provisioning resources for this user. So the web app typically just raises an event, saying, hey, a new user is registered, here is the details, here is his name, email, blah, blah. And these different services are now listening for this particular event and then they do their own stuff, right, this is how you decouple systems and services in distributed systems. And events on the other, so the characteristic of event is that they're typically, they have to be handled within a factor of milliseconds because there are other business things that are dependent on it, right? So you can't delay them too much. So and they can never, ever get lost. So suppose the new user registered event has gotten lost, so provisioning has not happened, a welcome email has not gone, a bunch of other things has not happened. So events are much more, need to be much more robust in nature. So the characteristics of these different kinds of data are typically different, right? And now this brings us to the why, right? So the problem that we're trying to solve or the problem that I'm trying to explain here that we have solved using HECA is that typically collecting, aggregating, processing these different types of data requires a variety of different tools, right? And that at a high level is my problem. That's all, I mean, right? So if, right? So for example, you have logs, typically you would use a log stash, R-Syslog D, Fluent D, and I'm sure some of you use Splunk as well, some of the rich folks here. And for metrics, you'd typically use stats D with whatever backend it is, or you can have Perf Collector sitting on the respective machines, collecting metrics of those machines and then pushing it to a central server. Again, all the rich folks can use Datadog or New Relic. And the worst thing are the events because for events, there is no standard. I've spoken with a whole bunch of folks, but there is no standard to be frank. Different companies do it differently and versus that different teams within different companies do it differently. And so typical use is using a brokered queue, such as an SQS or Beanstalk. There are teams that do it using WebSockets, XMPP, there are some lot of frameworks that have been built for events, but there is no standard per se, right? So the problem is that for an application developer, there are these different kinds of frameworks that are there for raising these different kinds of events, sorry, these different kinds of data, and there is no standardization, right? So, and obviously for the DevOps team, there are these so many different kinds of technologies to support and maintain and take care of. So what if there were one tool which would take care of most of my requirements, right? That's the goal and that's why we reached out and started using HECA. So it is not a Panacea, so it is not going to solve all your problems, but even if it is going to solve 90% of my problems, even if one tool is going to accomplish 90% of my needs, as long as the 10% is not super critical, I'm okay, right, rather than having different tools that's to solve the entire 100%, but it's going to add, you know, 50% more in my complexity, right? For me, operations is very, very critical and to make operations smooth, the lesser number of moving components you have, it's always better. So what is HECA, right? HECA is high performance, extremely extensible tool for gathering, analyzing, monitoring and reporting data. To recap, HECA is used for collecting or gathering data and then reporting it, it's basically a tool for that. And when data passes through it, you can also use it to do some processing, analyzing, monitoring, et cetera. So that's what HECA is, it's a generic stream processing tool. And it is super high performance and the most critical thing is, it is extremely extensible, right? And this is one of the key things which help us solve the 90% use case. In terms of semantics, it is extremely simple, you know, it acquires data, transforms data and outputs data. That's at a high level what it does, right? So it's as simple as that. So internally, it is actually modeled as a message router. So data comes in from different sources and so they are all of different formats. You have log data, you have metrics, you have, you know, syslog data and so on. These are now transformed or encoded into one common message format which HECA understands. So there are these encoders that transform all these different kinds of data into one format and then HECA, all it does is, it passes them through a message router. So there is, think of it like a standard network router, right? Or a Cisco core router. So when a packet goes through, there are different filters that match the packet and based on what filters it match, it would now transform the internal message format again into some other external message format and then send it out through a output channel. The output channel again could be whatever, right? It could be another HECA message, it could be SMTP, it could be UDP, TCP, Elasticsearch, whatnot. It's as simple as that. So in the basic use case, right? So if you were to, if you were to, you know, have an anchor in your mind, if you want to compare it with something else, you could probably compare it with LogStash, right? In the basic use case. So it does what LogStash does, but that's the basic use case. So in general, it's a Swiss Army knife, right? It can do a whole bunch of things. And in fact, it is more than a Swiss Army knife. It's a Swiss Army knife where you can put in your own blade and knife and whatnot to build your own weapon, right? And how cool is that? So a little bit more on the what, on what HECA is. So it basically accepts different kinds of inputs, is what I mentioned a little bit back. So typically it's log messages. It can also accept TCP, UDP. It can listen on TCP and UDP sockets to get, you know, data through them. It can get HTTP requests. It can get stats D data, right? So it has an inbuilt stats D server. So it can get data in the stats D format. And then there are various other input formats as well. Then it can do some really cool things, right? So especially with respect to time series data, HECA can, HECA actually uses a circular buffer inside internally to store the time series data. And the circular buffer is something that you define how big is it, how many data points you want. You want it for the last five minutes or the last half an hour or more, right? So but internally it maintains a circular buffer for all the data, all the time series data that it uses, right? And because of that, it can do a lot of other things like inbuilt real-time graphing. So you don't technically need a Grafana also. That's of course a lot of other things, other caveats. But for the data in the circular buffer that is passing through, it has an inbuilt graphing engine. And it can do a lot of other complex processing as well. When I say complex processing, typically, I mean aggregation. So it can aggregate data for the last five minutes and then send it out to a central server. And the cool thing is it can do anomaly detection. So because again, so you have this data in a circular buffer, it has inbuilt functions plus you can write whatever you want to using plugins and extensibilities. Ways to find out anomalies in the data. So you can say you're collecting some statistic, right? So you can say if the last five minute data and the previous five minute data, the standard deviation is greater than two, then alert me. That's an anomaly that you're trying to find out, right? So it's in a way, something closer to monitoring now. So and it has inbuilt functions plus as I said, you can extend it however you want. And it has alerting as well. It has alerting because it is such a generic data stream processing tool, right? Because when an anomaly is detected, basically a message goes back into the message router saying anomaly detected, right? And then there is a filter that says, okay, if an anomaly message comes, the output should be a SMTP. So it goes over email to you or you want to send out an SMS or a call, whatnot. So that's what HECA is and so how we at Exotel use it. So currently we have written a library number one. So the library is basically providing methods for the service or the application that uses it to spit out logs, spit out metrics, and spit out events. Right, it's a standard interface. Only HECA-D runs on every machine and HECA-D is configured in our case for getting in three different inputs. One is your standard logs, which from the log files. And then it also, we're also having a stats D input. So there is a stats D server as I said, which is running. And so it's get stats D input over UDP and it also listens on a TCP socket. It's so events from the application, contact HECA-D on the TCP socket directly and then it does whatever it does. After as it gets the data, it typically pushes the data to a brokered queue. In our case, we use Kafka. And at the other end of Kafka, there is another global HECA-D that is pulling the message off Kafka and sending it elsewhere, right? So the elsewhere here in this case for logs is elastic search for metrics, it's influx and so on. And soon we also plan to use HECA for even pushing Nagios because it has plugins for Nagios as well. So essentially we want it to be the single data pipeline in Exotel and last few slides about our experience, probably the more important ones. It's a dev win for us because there is a single library that devs integrate with. There are no separate libraries for logging, eventing metrics and so on. From a DevOps perspective, the brilliant thing here is the DevOps engineer has the ability to change the underlying pieces without any dev support. So suppose tomorrow you get funding and say you want to, instead of using influx, you want to use Datadog for storing your metrics and this thing. So all that DevOps engineer needs to do is change the HECA config to push the metric data to Datadog apart from influx. And then at some point of time, they can just cut off influx and push the data only to Datadog. All without a single code change, line of code change, right? It is super lightweight. As I said, it's written in Go. It is super fast, especially if you're coming from LogStash, you will find it like a welcome relief. Those are some stats that we have. One con is that it's slightly tough to configure. So doping us in case you have any problems configuring it. You can write Lua plugins, very lightweight again and they can load without restarting the HECA. Final slide, take a ways. There, I think there is tremendous value in maintaining a single data pipeline in your company and very, very, you know, ops light. And HECA helps you possibly achieve it. So definitely do give it a try. That's pretty much it. Thank you. I don't know how much time I have for questions. Hey, hi. First question about HECA. How reliable is it? Because I mean, some of my experiences with message brokers has been, okay, were you able to hear my question? Yeah, yeah. I said how reliable is it? Because I mean, I've had issues with, you know, setting up redundant message brokers in the past. So what do you do for reliability here? Because quite clearly it's the center of your operations now. So it is, so far it has been very, very reliable. In fact, it has not crashed single time and it's been running for months without crashing. So, and lightweight as well. So it's not affected other processes running on the same machine as well. Well, I mean, obviously the response to that is, you haven't died yet. And I'm assuming that it's gonna happen at some point. But the point is obviously what I'm saying is, how easy is it to build up a redundant set of them? I mean, is it necessary? If you want to assume, say, that your central HECA-D is vulnerable to crashing, what would be the impact of that and how are you going to work around that? Sure. So I think the primary thing that would be affected in our setup are events, right? Because logs, anyways, go to log files and they can be pushed later when HECA comes back up again. Metrics, I don't mind losing some metrics. So for events, one option probably is to set up multiple HECA-D instances. We still have not explored that. So that's something that potentially can happen. Two HECAs running on the same machine, multi-master, and then events get pushed off to both. And then you can also do a deduplication at some, during the pipeline. That's something, again, as I said, we've not explored so far. So for logs and metrics, I don't see a problem with it. Even if it goes down, it will come back and push whatever is left. Yeah, hello? Yes. I have a question. Does HECA sends logs to S3? I think it does. I don't... So it has a bunch of output plugins similar to your log-stache plugins. I'm not sure if S3 is there. So if you're just purely comparing HECA plugins with log-stache, log-stache wins hands down. So log-stache has been there for longer and it's a much more evolved product. So in terms of plugin support, HECA doesn't cut this thing. I think there might be a S3 plugin. I can check it out. No, I'm comparing HECA with Fluendee. What exactly have additional futures? I'm currently using Fluendee. I want to know what HECA does. Because I want to implement this in the environment. So as I said, yeah, as I said, HECA, the primary, the area where HECA really excels is in the processing aspect of it. So Fluendee and log-stache are typical data pipelines, but HECA can also process the message. And you can control how it processes it. If you, as the example I gave was with respect to monitoring. So you want to put in anomaly detection right within the place where you collect the metric. These are some things that I don't think you can do with the log-stache or Fluendee. Hi, I have a question. We have a similar use case, what we explained, but we are using log-stache instead of HECA. And the problem right now we're facing with log-stache is it happens sometimes, not always. So we have a huge chunk of data coming out. And sometimes the pipeline gets blocked. So have you seen similar things with HECA or if that kind of case comes, how does HECA deals with it? Again, as I said, so far we have not seen, so far we have only a single medium instance at the other end, which is pulling out the logs and pushing it to elastic search. So far we have never seen it piling up because even on the global HECA, the CPU utilization has been very low. It's always been smoothly running. But HECA there can also be a cluster. You can have multiple HECAs, just like how you have multiple log-stache servers running. You can have HECA as a cluster as well. So even if a single machine hangs or fails, obviously the cluster, the rest of the cluster can take care of it. One quick extinction to that. So what you mentioned is like we have to achieve high availability, adding multiple instances and all that. So by default, by design, does it come by scaling or adding more servers is only the solution to achieve high availability? Yeah, so in this case, adding more servers is not only the solution to scaling, but also the solution to high availability. Absolutely, because if you have a one server, that is going to be a single point of failure, right? So and because HECA operates in a cluster, why not? I mean, okay, thanks so much. Thanks, Sid. Thank you.