 Right, I think it's time we get cracking, so thank you all for joining me today for infrastructure monitoring basics with telegraph Grafana and influx DB. So I'm a developer advocate for influx data, so they're we're the creators of influx DB. In my past life, I was involved in industrial IoT solutions. Now you might be wondering why is the guy that worked on industrial IoT solutions talking to you about observability and network monitoring and infrastructure monitoring. But I soon learned when I moved over to influx DB, there was a lot of nuances and practices that we apply in both fields, whether it be, you know, industrial based on monitoring industrial machines. It's very similar to how we monitor and look after our IT infrastructure as well. So yeah, I have a massive passion for the Apache ecosystem for which influx DB is now built, massive demo tinker at home. I'm driven to make observability and IoT solutions accessible to all. And my belief is that the domain, the industrial success of any department belongs with the domain experts. So it's our job as providers of services to enable people in those industries to make truly data driven decisions and provide real impact to their customers and to their users. So as I said, I come from influx data. So we've been involved with open source for quite a while now. Our product was founded in 2013 from very humble beginnings. We're now within over 750,000 instances worldwide. If you're familiar with home assistant and other projects like this, you're probably storing a lot of your time series data within influx DB. And you know, a lot of our customers now come from our open source department, Tesla, Disney, Google itself with their own IoT monitoring application solutions, all using influx DB as their one stop shots for their time series data. So on the agenda today, I thought I would basically break it down into a few stages as we've got quite a bit to cover. So first of all, I wanted to establish the difference between monitoring and observability. We'll see where they're similar and also where they differentiate and why they are two separate practices. We'll then have a look at a problem for me. Problem drives learning. So let's create a scenario and see how we can solve that using observability and monitoring practices. The cool thing is I actually use chat tbt to create my observability problem and then we will solve that observability problem. So it'll be quite interesting. From there, we will then do next steps. My plan is you should always be able to get your hands on code and get dirty with what's discussed here today. So anything that we discussed with regards to telegraph influx DB, Prometheus open telemetry, all of the source codes online. That's all they're available for you to use and you can come back and give me feedback later and so forth. So monitoring versus observability. So the core difference between monitoring and observability is twofold. Monitoring itself is the collection and analysis of metrics, logs and events. So when you think about a metric and say if you're familiar with Prometheus, what you're doing is you're polling an interval, say a CPU usage stat or a storage interval time or the capacity and saying, what is that reading right now? What is that metric? We're going to get that every minute, depending on your poll interval. An event is an unknown timed metric essentially. So you could imagine an event to be something like an error code or a user driven action. So they clicked a series of buttons and these are being logged and written to a data store with a timestamp. We don't know when they're going to occur like we can with metrics, but they are ingestible, none the same. And some of our most vital data comes from event based data. The cool thing with event based data is we can actually derive metrics or regular based metrics from our event based data using aggregation. So aggregation, I mean here is say we have a number of errors that come in. If we count those in minute intervals, we've derived a regular metric because we know every minute we're going to get a count of how many errors, even if it's zero errors. So monitoring, as you can see, if we boil down to it, whether we're looking at metrics, logs, events, we are looking at the state or health of the system that is out in the wild, whether it be in production or under development. We want to know, engage how that system is performing, whether that be from a very hardware system capacity, all the way to looking at resource utilization, etc. Where this differs with observability because you might be saying, okay, well, within observability, I use traces, logs and metrics as well. But there is a key differentiator here and that's the ability to drill down into our data and work out exactly what's happening and why an error message might have been occurred or why a certain process that a customer has driven is causing this error message. And this is where you see a lot of these buzzwords like open telemetry and Yeager as an application is really taking place. So when we think about monitoring, we're thinking about observing metrics over the course of machine. That system health, what's going on? And with observability, we're proactively drilling into our code to be able to work out why problems are occurring or what user driven event happened in order to even better write our code or work in how we can optimize that as well. And so if we boil monitoring and observability into its four fields, this is kind of how I see them. We have network-based monitoring. So when you think about that, we're monitoring routers, switches, firewalls. We're ensuring data transmission, detecting bottlenecks and identifying security threats. For server-based monitoring, we're tracking the performance and availability of physical or virtual servers. So think about CPU usage, memory consumption, disk space, response times, ensuring optimal performance and reduced downtime. Application performance monitoring or APM, monitoring the performance of software. Again, looking at bottlenecks in the architecture that we design, inefficiencies in code, how we connect with databases and how we query through databases and other infrastructure components. The reason I highlighted this one in blue was because for me, this is where monitoring and observability can meet. We can monitor our application infrastructure, but we can also observe our application infrastructure by looking at the traces produced within it. And then, of course, we have cloud infrastructure. And this can be a really broad term. You can encapsulate server-based monitoring or application monitoring within cloud infrastructure monitoring. But really where I want to derive it here is when we look at cloud-based services such as, if we take AWS, for instance, if we look at their own Kubernetes solution, if we look at any of their database services, we're basically looking at uptime, how they're performing the cost analysis of this as well. So that's how I differentiate cloud infrastructure monitoring from the other fields. So let's look at this chat GPT-driven problem. Funny enough, I feel it's trying to reproduce itself and then give us a problem within that area. So chat GPT said it created a situation called whisper GPT. And essentially, the idea was that it was going to provide services to the greater world, which would basically be a natural language process model and be used in support processing, et cetera, to provide credible responses back. The problem is, for them, is growth. So this is going to scale rapidly. We're worried about key differences in things like bottlenecks, latency, the hardware capacity, et cetera. And essentially, the team at whisper GPT would like to work on how we can build a scaled solution for monitoring each of these key components that we've discussed in a hybrid architecture. And what I mean by a hybrid architecture here is we can actually see that we have both on-site premise stuff that we need to monitor and also cloud-based application monitoring as well. So for whatever reason in our infrastructure, in this case, we might just want to keep some of our processing or compute on our own servers, in our own infrastructure, in our own building. So in this case, we have a series of servers running the whisper GPT model on its own GPUs. That's talking to a backbone within AWS where we run and scale out, say, our user interface and also our API that developers interface with. So we need to monitor each of these solutions within the hybrid infrastructure. So if we break it down into our four columns again, as you can see, we're looking at monitoring our network and routers for capacity, how many requests are coming in, how much data is being sent back from these models. Is our network suitable for being able to do this? Server-based monitoring. In this case, we're going to focus on CPU and GPU usage because we're running our models based on GPUs as well. Application performance monitoring. This is where we really focus on how well the Kubernetes cluster that's holding up our model or our application here is being able to scale and differentiate. Where should I be pushing traffic to? How many number of requests should that model have? And then lastly, we'll look at cloud infrastructure monitoring. We'll be looking at cost and uptime of running our solution on surfaces like AppRunner or Amazon EKS for basically our API and our user interface. If I can get this to move on. There we go. So let's solve the problem. So as we've seen, there's quite a bit of a mountain that we need to climb. So what I thought we would do is split it into three sections. Data collection, data storage and data in action. So first, let's talk about data collection. So telegraph, if you don't know, it's quite a popular metrics collector. It's open source fully. It's been around for quite a while now. It has over 12.6 K stars on GitHub and it's all a single binary written in go. It's all Tom all based. And so it requires a very limited, it requires very limited knowledge of software or coding capabilities in order to be able to use and run it. And when I say it's community driven, it very much is community driven. As we said, with the 300 plus plugins that we've contributed, most of these plugins have been written by the community for the community. And we as influx data are just stewards for the telegraph project. So you can see in these list of input plugins here, I've kind of highlighted a few that might be useful for us in this solving this problem. Such as cloud watch, CPU, disk stats, disk IO, Gemini. We have other protocol based plugins. We have memory, Kubernetes monitoring infrastructure, NVIDIA, SMI for monitoring our GPU capabilities. I always like to highlight the Minecraft one in orange. We will accept any plugin if it's useful to someone and the code is great. So in this case, someone wanted to monitor their Minecraft gaming instance. The code was awesome. It solved a problem. It was contributed as an input plugin. So very much an open source project. And yeah, and the list goes on. You can see here and we'll cover a lot more of these as we go. I think I missed the page. Open telemetry up there as well. That'll be a good one. We'll cover later as well. So if we look at the telegraph architecture under the binary, this is kind of how our plugins hook up. We have a series of input plugins where we collect our data form from. We then have processor and aggregator plugins. This allows us to enrich our data and also pre aggregate some of our data if we so wish to. And then we have a series of output plugins. So we're going to talk about the influx DB output plugin today, but that's not to say if you want to send it to MongoDB or if you want to send it to CloudWatch, or if you want to send it to an open telemetry collector, because you have other methods that you want to use to push that data to other aspects, then you have that versatility with telegraph. You're not restricted to just sending your data to influx DB. So telegraph setup. So telegraph is meant to be highly versatile in how you set it up. We have flavors in most Linux binaries. So Ubuntu, Susie, Red Hat. Most binaries are all covered for Linux. We also have Windows as well, Mac OS, Docker. We have Helms available. And essentially what you need to do is create a telegraph config, which I'll show you in a moment. That's just a Toml-based config. Then you can test your config based on a series of these commands. So you can see here like telegraph dash dash debug. Give it your telegraph config and check how your plugins are collecting data. Dash dash test is a really cool one that I like to show people because that allows you to collect from your input plugins but not send your data to your output plugins. So that means if you have any issues with the data that you're collecting, you can catch those before you actually start writing them to a database or to your end source. And then telegraph dash dash once is great for testing your output plugins because it only sends one sample rather than say 5,000, 10,000 metrics at a time before you get to that point. And once you've done that is deploy. So whether you're deploying it as a window service, Kubernetes, Docker compose, system CTL, telegraph is pretty versatile. It's very versatile on how you plug and play it into your infrastructure. So I won't have enough time to cover it in the talk today but you can also sidecar telegraph into Kubernetes. I wanted to leave that up there just in case you guys wanted it but it's just a simple demo on the repository there so you can see how you can sidecar telegraph into your Kubernetes infrastructure. The cool thing about this demo is we actually use the Prometheus input plugin. So we basically monitor our application and also monitor the Kubernetes infrastructure using Prometheus endpoints, scrape all these and send these into influx DB. So we fully acknowledge Prometheus as the master of all monitoring when it comes to Kubernetes but it just shows you can integrate other agents within to this infrastructure and have more versatility over the components that you're monitoring. So here's a telegraph config. This is basically the agent config. This is the global config. And so what you can see here is just some simple, I just wanted to highlight some simple configuration base bits just to get you started. So the interval for most plugins is when we pull for data. There are certain plugins which are push input plugins as well so we don't make use of the interval. You can set a global interval or you can set an interval per plugin so you can collect down to say a one millisecond interval or you could potentially only query on certain plugins every two hours or over two days or something like that. Basically any data collected in the input plugin is put into a memory buffer. So for whatever reason if your network goes down these samples are stored within the message queue and then when you come back online they will be written out of the message queue. So you have the ability to supply a batch size as well as a buffer limit to that service. So just as you know the bigger queue that you have the more memory that you need just as a simple one there because we've seen people put astronomical queues in before and go why have I run out of memory and that's because you've stopped connecting to the internet and you've got too many metrics in your buffer. So that's some of the agent specific configuration. Let's move into the actual input plugins that we'll be using to solve this problem. So to solve our network problem we'll be using SNMP which is a simple network management protocol. Through Telegraph we can monitor our routers, our firewalls through this method. We can actually do this through SNMP to pull these endpoints so we can ask Telegraph to reach out and say hey tell me the status of this router or tell me the status of this firewall currently or we can also monitor traps as well so we can actually feed the data directly back from firewalls or routers if an event occurs into Telegraph as our collection server. So you can see in this case we're just going to do really basic we're going to go give me the system uptime for this router and we're also going to give me the system name but there's lots of other we could also say monitor temperature monitor usage, monitor network throughput of routers and there's great examples for Cisco based routers and others online. So similar on the rest this one's quite bog standard we have CPU, we also have open telemetry which I'll cover in a second and then we have cloud watch metrics so in this case what we said we would do is we'd use cloud watch to basically monitor all of the different basically all the different AWS parts and then we'll collect from cloud watch all the metrics we want from all our different services. So as you can see we also have output plugins we're going to be using the Influx DB V2 plugin there and then here you can see is basically an example of writing the data to Influx DB here's the configuration here but like we said you can also write to Prometheus open telemetry AWS Azure based output plugins as well. So we've covered data collection let's talk quickly about data storage. So Influx DB is a purpose built time series database it's now built on Arrow, Parquet and DataVusion so it's designed for ingesting millions if not billions of metrics per second of high cardinality data. The idea is Influx DB is schema on right so you don't have to define a schema beforehand which means you can keep modifying it as you go. We can query and write data on the leading edge of millions of rows per second since we're using a column of store. We can be a single database for both metrics, logs and traces which are shown in a minute within the demo and we now support with new Influx DB 3.0 SQL support so if you are a SQL user you don't have to worry about learning a new query language you can use SQL directly within Influx DB and this is just a quick bird's-eye view of like the flow of Influx DB 3.0 from data collection data storage and data visualization. So to understand a database I think it's good to understand the core data model so let me quickly brush through this so a bucket is closely resembles a database with one key difference it allows you to set a retention policy within a time series database it might depend but your data as it gets older either gets more useless or less interesting to you based on the new data that you have coming in it also allows you to maintain the high if you're storing lots of high volume time series data this also allows you to maintain your disk space as well so suppose we could set a 30 day retention policy on our bucket when our data timestamp becomes older than 30 days it will automatically delete that data so we can bring new data in you can set unlimited retention if you need to but most people decide how they set up retention policies and then they can say okay well here's all my raw data I'm going into a seven day bucket I'll then move it to a longer term storage bucket of 30 days when I down sample and I aggregate that bucket measurements you can see as tables that's our containerization tag sets are part of our primary key that's how we differentiate our data series or our data points from one another if they share the same timestamp field sets contain our actual data or our readings so numerical strings and representation then we also have our timestamp which can be down to nanosecond precision and then a series is a unique combination of measurement and tags so you can imagine that being so if we work this down into the data model here is an example so in this example we have our measurement as server our tag set which is part of our primary key is the host name of where the data came from and also its location we then say in within our field set we have our memory our CPU and then we also have our timestamp associated and this is how influxdb ingest data which is called line protocol you don't have to worry about line protocol telegraph does that all for you you just collect it and that's what the output plugin does writes all of this within the line protocol for you so influxdb can ingest it so just some schema recommendations I'm just going to re-brush on these avoid wide schemas, avoid sparse schemas the way you do this is through a homogeneous architecture which means which I'll show you in a second but essentially means try and keep your tables consistent store all your network data in one table store all of your application monitoring data in another and that prevents you from having say lots of null values or a wide schema that has too many columns with irrelevant data the second thing is designed for query simplicity using SQL or influxql whichever query language you plan to use if you have really rogue names for your columns just remember you need to query these columns later so if you call a column server124xy-247 something you're going to have to write that within your SQL query you're going to have to escape those special characters so we advise keeping your table names sorry with column names simple and so this is what I mean by homogeneous keeping all of your data consistent within their containers so measurement one could be your network measurement two could be your server application and cloud-based monitoring let's just check in time so I wanted to really highlight a new use case for influxdb based on our new storage engine our new version we are now focused on supporting open telemetry which means we can store traces, metrics and logs all within influxdb in one storage engine rather than spreading them two different dedicated storage engines for metrics, logs and traces so I just wanted to show you kind of how the schema looked here for open telemetry for us you won't have to worry about this I'll show you how the demo works and how we transition this but you can now see that within each table we can store our spans our logs and our metrics for application-based monitoring and this has been made possible by our new storage engine which has unlimited cardinality or what we call we've basically removed the idea that when we create a tag which is a unique ID you don't have to worry about runaway cardinality or runaway tags which could have an infinite number of values and so this is what's made open telemetry monitoring and trace monitoring possible for us so last thing to mention on influxdb is you can also use influxdb within a hybrid solution so we understand that a lot of people sometimes want to keep their database close to their source or at the edge so what you can do is you can install influxdb locally say on your server you could collect all of your raw data locally you could then down sample or aggregate that data locally and then you can write that data to a more global source exactly what edge data replication allows you to do essentially as data is written into a bucket we then automatically put that into a durable queue which then writes that data to your remote instance of influxdb so just a cool new feature that was added to influxdb open source quite recently so we've covered data collection covered data storage let's finally talk a little bit about data in action so we love Grafana Grafana has been piled with us for a long time yes they have their own solutions for time series and logs and traces as well but we go well we're first class customers and we go a long way back with each of us so it is our primary way it's our primary user interface and dashboarding method for influxdb so Grafana flavors there is Grafana cloud and Grafana open source we have plugins for each and you can interact with influxdb through free methods flight sequel influxql and flux we're going to focus on flight sequel today which is the sequel engine just since that's new to influxdb so if you're not familiar with Grafana the flow works as so essentially what you do is you define a data source so in this case we're using the flight sequel plugin we basically specify our influxdb endpoint we specify a token which is our security interface with influxdb and then we supply some metadata so the metadata here is the bucket name now the reason we supply metadata is we actually contributed this flight sequel plugin to the open source community so anyone that has a flight sequel endpoint such as dreamio or druid or any of the other column of stores can actually make use of the flight sequel plugin and that's our commitment to open source as a company from there you then use the explorer to create your query so we're just writing a standard query here that you can see in sequel so we're just collecting a few columns and we're specifying the time range that we want to collect that data room and we can return a table within grafana from there we can then use that time series table and create a visualization which I'll show you in the demo so in this case we're just basically monitoring our usage and you can see we've differentiated our usage into different series so we have CPU 4, 5 also the total amount there and that's where tags come in if we didn't tag our CPU types we wouldn't be able to differentiate between the different types of the different metric coming from each CPU reading so just some useful queries for you within sequel, within this case you have data bin this is how you create time-based aggregations in sequel so you can basically define within grafana say okay I want to bin all of my data into 5-minute intervals and then I want to average that data within those 5-minute bins so that's kind of what the sequel command at the top is showing you in the next one you can see these select to last and select to first functions these are specific to a xdb and to flight sequel so they're time-series based functions and they say you can select the last row based on a given time or select the first row on a given time so that's great for gauges whereas averaging out your data is great for line graphs and charts like that depending on how you're querying think about how you want to visualize your data and this will help you in terms of how you want to build your queries so after a QR code here so this is a basic quick start dashboard that we've created so it basically gives you the system stats for your system you basically use telegraph to monitor like disk usage memory system we write all that to influxdb and then we build a dashboard that you can see yourself that will visualize all these using sequel you can also be proactive in grafana you can also do alerting the alerting grafana is extremely powerful because the way that you can do it is you can basically say define a threshold and say if my CPU usage is above a certain limit if it stays above that limit for say 2 minutes you can then trigger an alert and there's a variety of ways you can trigger an alert through grafana whether it be through Prometheus, Slack, you can even write the data back into telegraph to use other output plugins, pager duty the grafana alerting system is pretty versatile so I think time is getting away from me a little but yeah, so as you can see here we've built our different components we have data collection, data storage and data in action and that's not to say you can't just use grafana, you could build out your own solution with any of the client libraries you could use data analytics engines like Apache spark and data fusion, sorry rapid miner and then there are other visualization platforms as well as grafana such as superset which I advise you to check out as well so I'm not sure if I'll be able to cover the demo looking at time but this is kind of the demo that you can try at home essentially what this does instead of using telegraph it uses the open telemetry collector what we do is we collect the data from hot rod, we write that data directly into influxdb including span, log, logs and metrics that should be rather than latency we then use jager query to query that data back out and we use that as our bridging interface with grafana so we use a jager data source and that basically says any commands coming from grafana basically go into jager query we convert that jager query into sql and then we ask influxdb for those results so that's how we can interface directly with our query data sorry our trace data I think I can quickly show you if I do this really quickly so you can see the hot rod demo here if I quickly generate a trace and I navigate over to grafana you can see this is our grafana interface here for monitoring our traces this is over the last 90 days I can bring this down to say the last 5 minutes and then so basically what we can do is we can click on a trace here we can see the schema sorry the relationships of all the spans within that trace and then we can actually also drill into our trace as well as part of our observability stack and all of that data is stored within influxdb so we've not had to use multiple different data sources to combine that data that is all directly within influxdb so I believe I just went back across so I won't jump over this but as you can see we've kind of gone full circle in monitoring our solution we've used telegraph as the backbone for most of our data collection here whether it be monitoring our app or show applications our cloud based infrastructure also looking at our open telemetry as well we store all of this data even in different tables or different buckets within influxdb and then we use grafana as our central visualization and observability platform connected to influxdb so next steps how can you get cracking and try all of this yourself so first place I recommend starting is quick starts so we created this repository to get you started with a series of grafana dashboards and also telegraph configurations you can use that qr code I try and add to it when I can based on community feedback so I add more as we go as well if you would like to try the open telemetry demo and get started with open telemetry for me open telemetry is the next band bandwagon to be part of it's on a bit of like a I think the guys from adobe said it was on a bit of a hockey stick trajectory we're kind of really you know only scraping the surface of popularity with an open telemetry so definitely want to get started with I use killer coder so killer code is just an online education tool it means you don't actually have to install the repo or sorry pull the repo and configure it yourself you can just follow the step by step tutorial on killer coder hopefully everyone got that one and then last but not least as a dev rel I would not be doing my job if I said do not get involved with our community please we have a vibrant community on Slack and discourse we're there all the time answering questions looking at humble solutions for home projects and also we have people contributing directly to telegraph in fluxdb and other projects as well as our open telemetry connector as well for that case so please get involved with our community we wouldn't be here without you guys and it's always exciting to have new members we also have in fluxdbu which has further courses on learning telegraph in fluxdb as well and hopefully I've left enough time for questions sorry I felt like there's probably a lot of content in there but thank you very much guys I hope that was insightful enough to get you started and does anyone have any questions I think I've done what was it thank you so we start we store everything within parquet format now so for us we've actually been experimenting the compression of logs within sort of parquet format I would say if we're talking real-speaker there's still more to be done and how we can compress the logs in parquet but it's made a sizeable difference compared to our last storage engine which was TSM which was our time series merge tree so it's been so we're on the road to having much better compression for logs in that matter as well yeah so if I need to mention this so we're actually releasing a feature for influxdb that's coming out soon which is basically retention based down sampling and essentially what that will allow you to do is exactly what you said which is sorry I should be repeating these questions part of the thing but essentially what he asked was can I essentially when data is being deleted as part of a retention policy can I move that aggregate that data and move that to a different bucket it's kind of like the crooks of your question and yes we do have a feature that's coming which basically says if data is being deleted that last section of data on that 30 days you could say take the average or take the last sample within that silo of data that's being deleted and move that to a bucket that has a longer retention policy so yeah we'll actively look into that because that's been a feature that's been requested for a long time to be able to do that and the open source product just to mention as well we have a task based system which allows you to do some of this down sampling aggregation between buckets as well