 Hi, everyone. Thanks for tuning in. I am a DevOps engineer at Zerodha. I'm primarily interested in monitoring and observability systems, like using Periscope, sorry, using Prometheus and Grafana. I blog where I share interesting things about monitoring and just Networks in Sindel at mrcannon.dev. And I also happen to have my own self hosted stack of applications that I use. So this blog, this monitoring thing is also inspired from the same set of principles where I host a lot of applications. So I need to monitor them. So I thought, why not monitor my network as well? So we are in the lockdown since like seven, eight months at the moment. And ISPs of all the things have taken a huge hit because everyone's been working from home. And as you know that retail network lines, they are not really meant for such high load at a very sudden spike. Like the moment the lockdown has been started, there have been some of the other ISP problems which keep on happening. And people in various groups keep on reporting. So what I wanted to see was what kind of problems is my ISP giving me. And a lot of times it happens that sometimes your friend thinks you that he checked xyz.com is up or not. And your internet, it might be fine, but his internet, there might be issues. And it happens even with the same internet provider. So I used to have a lot of these conversations in my company group. So I thought why not just kind of get an overview of how the network is basically performing. So that was the whole idea behind monitoring my ISP. So before diving in the final solution, I had a couple of alternatives which I'd like to discuss here. One of them was smoke pink. So smoke pink is a very old monitoring solution for network basically where it's a PHP application and it gives you a lot of interesting information about your network using ICMP probes. So but the problem with this is that it comes with its own graphs, its own monitoring and alerting system. And I was already running Grafana in my own digital solution VPS. So I didn't want to add another monitoring service just for this one service. So that's that's the only reason I decided against smoking. The other was a speed test CLI exporter. I actually ran this for a week or something, but then I realized every, every speed test you do, it costs you around 50, 70 MB of your bandwidth and running this frequently every five minutes or something. It's not really recommended if you are on a internet plan, which has FEPs, which is basically almost every plan, a detailed plan. So that was one reason I didn't go ahead with speed test other than that. Another reason is that a lot of ISPs appear into the upstreams which speed test site uses. So you might get a flawed information that your internet is working fine and fast, but it can happen that a lot of regions are very slow for your internet and that happens in the, and that doesn't assimilate the real life browsing experience. Just like comparing just benchmarks versus real life experience. So that's why I thought of a speed test is not a great idea. Then before I, I mean I interacted with Gaurav as he said on the forum, I was almost tempted to go ahead with my own simple solution, which was just to ping a couple of sites in a shell script and output them to a SQLite DB and I have written a program goal and program which converts the DB data to Prometheus output format, Prometheus wire format. So I was almost tempted to go ahead with this, but then when I interacted with Gaurav and I found his solution is much more superior and does basically the same thing. So I thought I should not rather reinvent the wheel and check out telegraph. So telegraph is basically, it's also a go program and it basically has this concept of plugins. So plugins are just agents where you can give input, output and filter plugins to do a lot of custom monitoring stuff for the infrastructure. So I was particularly interested in the ping plugin, which basically uses glinted ping, which is a Golang dependency. It also comes with a native mode where it just runs a ping executable available on an operating system. So you can specify whatever mode you want to run it with. I chose the Golang mode because it's more consistent across OSS. And I was rather very confused about this fact. I mean, I had a wrong assumption about telegraph that it only works with influx DB, but that's not true. I was looking around the documentation and it actually supports an output plugin to Prometheus. So that was perfect for me. I mean, I really just wanted the Prometheus format for my existing monitoring solution. And it's not just ping plugin. You can even run DNS plugin, which I'm running because I have my own DNS server on Pyhole, which I wanted to monitor as well. And you can even run HTTP plugins, which can monitor the layer 7 upstreams BCP. Telegraph configuration is very simple. I'll share a sample configuration after this session is over. But the basic idea is, like I said, you just have plugins. So input plugin, you define, it's in the Tamil format. And honestly, it's very refreshing to see a Tamil over YAML in any of the ops tool these days. So yeah, you just give a bunch of upstreams. And these are all the standard ping arguments that you give. And you just specify the output plugin right now, which I already told it's Prometheus for me. And you just get these metrics in the Prometheus format. So these are the type of metrics that it exposes for the ICMP monitoring. And you can basically get a lot of data with just three or four metrics by writing queries using the promql, which is the Prometheus query language and visualize this in Grafana. So if you see, this is the basic dashboard where I have a list of upstreams, which I've defined in my configuration. And I use Grafana to basically visualize all the metrics which are flowing in. So I'll just go through a couple of patterns that you can gauge through all these, all the metrics which are exposed by a telegraph to sort of fit in your use case about how you're monitoring or whatever the use case you are looking for, you're able to do it yourself basically. So one of them, it's a packet loss. Packet loss issue keeps on happening almost like fairly frequently because let's say you are in Delhi right now and upstream in Singapore is failing due to ISP routing issues. So these are very frequent, but packet loss beyond a certain percentage, that's when it kind of impacts a real browsing experience. So in my particular use case, what happens is if I see a lot of packet loss, I tend to reboot the router and that kind of fixes my problem. If you notice that on 1050, I just lost my internet and before that I was getting some packet loss. So what I do is I can basically, I can plug these from Ethis Alerts to Alert Manager which allows me to define a webhook endpoint, HTTP webhook endpoint. And there I can write all my custom scripts like if I get a packet loss alert, I can just do a reboot. So I'm running this on my Raspberry Pi and I just have all of this is localized. So I just have all the shell script and all the very basic HTTP endpoints written over there. Secondly, this one pattern where basically I had a downtime during the night because some ISP maintenance work was happening or something like that. And before that also you can see the ping time had spiked up. So basically you can gauge what all, how much percentage downtime your ISP is giving you. But sadly in India or I guess anywhere you can't really enforce these SLAs or at least on the retail connections. If you're on corporate lines then it's a different thing. But yeah, if you're on corporate lines then this can be useful because you can have your own uptime metric and you can give them a report that if your SLAs beached. But interestingly, and this happened just yesterday as I'm giving my talk on my mobile internet and this talk about monitoring your home internet. So what happened was, so I have ATIL in my hometown and ATIL just started randomly digging in just front of my house in the name of just putting in the fiber cables. So I'm not having internet since like last 30 hours right now and this is the uptime metric I've captured and this is 24 hours of time metric which was taken in the afternoon. So I still don't have an internet but at least I have my graphs to see that I don't have internet. So this one is also interesting. So there's a metric called ping result code in telegraph which gives you one or zero value. Zero means it's successful. One means that the host is not found and actually it returns two also two means some other LL. So one means a host not found that means it's a DNS problem like all the other things it's always DNS. So you can see the both the time frame actually match with some issues in my DNS server. So this is the graph of the DNS query time where the DNS query time just on an average it just spiked during 242 3 p.m. And that's the same time where my ping result code was also showing some DNS related issues. So that could possibly mean that either my piehole or the way I connect through my DNS server which is still scale which is a my guard mesh network either of these were having issues and if I want to dig in I can basically get some more information from this graph and I can dig through the problem. And yeah this is again a packet loss metric where I can see if packet losses more than x percentage 40% this is really noticeable if you're browsing any any site because like in your TCP packet has to go back and forth if such high packet losses there. So you can see which particular upstream you get this packet loss a very high packet loss for basically. So it usually happens for the sites in US East one or similar area areas where the ISP routing is not really efficient or or in the cases where the CDNs are hosted not in the country but in places where your network your ISP operator doesn't have a proper pairing to it. So you can you can see that as well.