 So, I'm Paul Dix. The talk is about monitoring Mesos with Influx data, specifically the different parts of the stack that we make. I'm going to give an introduction to the stack itself, and then Tomas is going to come up and talk about how they're using Influx to monitor Oracle Data Cloud. So I'm the founder and CTO of Influx data, which is where the makers of Influx DB, among other things. So Influx data, we call ourselves a modern engine for metrics and events. I think of it as time series data. In my mind, time series data is it's not just metrics, it's metrics are, you know, things at regular intervals, but it's also events underneath. So we built different components for dealing with the different challenges people have when working with time series data. Obviously they need to store it, so we built Influx DB, it's a time series database, it's open source, it's MIT licensed, it's written in Go. We started the project back in 2013, but the precursors to it was code that I had started back in the fall of 2012, so pretty early on in Go's life cycle. It has a query language that looks kind of like SQL. We added a little bit of sugar in there to make it easier to do queries for time series stuff, which I'll show in a little bit. A couple of years ago, we wrote our own storage engine from scratch, which some people thought was kind of an insane move, but it's actually turned out really well. We call it a time-structured merge tree. So it's a storage engine that's very, very similar to LSM trees, if you know those. And it has built-in stuff for creating retention policies, so you have high precision data that you keep around, say, for seven days, then you have medium precision data that you keep around for longer, all that kind of stuff. And it has stuff in it for basically doing what we call continuous queries. So basically you can create a query to create summaries or down samples of your time series data and write it into another retention policy, and now just run continuously in the background on the database. You don't have to create separate stuff, separate like down sampling logic. So what is time series data? When I think about time series data, I think about stock trades and quotes, right? Each trade in a stock market is obviously that's a time series. This is, we're looking at Apple stock price, so what you're actually looking at there is a summarization of an underlying time series. There are too many trades of Apple stock in a day to visualize it on a single graph, right? There aren't enough pixels there, so this is a summarization of it. We also think about metrics, right? This is like server metrics, server monitoring, application performance monitoring, user analytics, events. This is a log from Apache. When I see this, I see a bunch of different time series, right? 200 requests over time, request to a specific page, 404s or errors. And finally, sensor data. So that's physical sensors out there in the world. This is like the IoT use case. So as I mentioned, there are two different kinds of time series data. There's regular time series, which is samples at fixed intervals of time. And then there are irregular time series, which is event-driven stuff. So this could be trades in the stock market. It could be response times for individual requests to an API. And the thing about irregular time series is that you can induce a regular time series from an irregular one, right? If you have a bunch of individual API requests that you're timing, what if you say like, oh, I want to look at the last four hours of data at one minute summaries of the min, the max, and the mean response times? You basically just created a regular time series out of that underlying event stream. So InfluxDB has a pretty simple API. It has two endpoints. It's an HTTP API. There's one to write data, right? You just do a post to here. You specify the database you're writing to. And you use the name of the password. There's a get, so you can send a query here. And everything else is done through the query language, which I'll show off in a second. Here's what the post, the write looks like. So we created a line protocol to represent time series data. That's this. So the structure of the data as you feed it in is you have a measurement name, which is a string. You have tags, which are key value pairs, where the values are strings. And this is metadata that we index. And then you have fields. Technically you can have an unlimited number of fields. So fields are key value pairs where the value can be a bunch of different types. You can have in64, float64, strings, or bools. So as I mentioned, technically you can have unlimited fields, but realistically you probably would want to have fewer than, you know, say a thousand. So our collection agent will collect, say, for example, CPU statistics. And every unique measurement under CPU will be a different field. So in the line protocol, floats must have a decimal. And the types must remain consistent over time. And finally, the timestamp is a nanosecond epic, which we, surprisingly, we have people who use nanosecond scale timestamps. Largely we've seen this in quantum computing use cases and also for some high-frequency trading firms. They track their network gear at that level of granularity because they have atomic clocks that make sure everything's in sync globally. So we have a command line interface where you can just spin it up and throw queries at it. So let's walk through a few of the queries. So you have a concept of a database. You can create a database with the query language. You can create a retention policy, right? So it has a name. It applies to a database. And it has a duration. And this default part says, by default, all writes and queries will hit this retention policy. You can change that at the time of a write or at the time of a query. The other thing to note is that when you write data in, like, other than creating a database and a retention policy, there's no other formal setup that you have to do. It's not like a SQL database where you have to create tables and they have a schema and all this other stuff. Once you've created a database and retention policy, you just throw the data at it and it will create the schema for you on the fly. So with this use case, with the time series use case, we found that in addition to actually querying the raw time series data and getting summaries and all this other stuff, discovery actually becomes pretty important, especially in infrastructure monitoring where you could have thousands of servers and a bunch of different services. And you may not even know what data is available for you to query, right? So we wanted to make sure that people could do discovery on what data exists that I can work with. So for that, we basically have a separate part of the database, which is essentially an inverted index. Now, when people think about inverted indexes, normally they think about using it for full text search, where you're mapping terms that appear in documents to those document IDs. In our case, what we're doing is we're mapping measurement names and tag key value pairs to the series that they appear in. So I'll show some of these discovery queries and then I'll show some of the actual queries where we're, you know, doing computations on the time series data. So here we can see what measurements exist. We can see what measurements do we have for one specific host, where host is a tag and server A is a tag value. We can see what tag keys we have. We can see what tag keys we have on a specific measurement. We can also see what tag values, right? So here, show tag values from CPU with key region. This will show us what regions we're actually collecting CPU values for. Or it's useful, like if you said, show tag values from CPU with key equals host, right? That shows you which hosts you're actually collecting CPU stats for. Show series, which are all the underlying series. And then you can filter down the series by tag key value pairs. And with the, you know, the where blah equals blah, you can also do predicates and stuff like that. And some other key equals some other value or all that kind of stuff. All right. So let's jump into some queries and show this show, show off some of that. So like I said, it looks kind of like SQL, you know, should feel somewhat familiar, but it's a little bit different. So in this case, we're just getting all of the fields from some series for the last hour. Basically, just give me the last hour of time series data from that measurement. Here what we're doing is we're saying, give me the 90th percentile of value from the CPU measurement for the last day in 10-minute windows of time. Another thing we could do is we could add a group by, say, region or a group by host. We did a group by host. We would get a separate time series for each individual host that we have, right? So if it's 10 minutes of time one day, that's 144 data points that you get back per host. So as I mentioned, our field types, we can support different kinds. So this is actually pretty unique to us as a time series database. Most time series databases only support either N64 or float 64, but we actually support strings and booleans as well. And what that means is you can do interesting things with the string fields, right? You could, in addition to your metrics data, you can write in log and annotation data to give you more information. And you can match against a regex, for example. So for writing a bunch of log values, log lines into this, we can say, oh, look at that. And the thing is because we have a time in there, that actually is a pretty efficient query, right? Normally you don't want to grab through your entire log, but if you add time filtering and potentially also, say, filtering by a host or a specific service name, this can be quite fast. And you can also match against a regex on a tag, right? So you can say, oh, give me the time series data for any host that matches this regex for this window of time. So we have a bunch, like here you see percentile, so that's like one of the functions we have. We have a bunch of different functions in the language, min, max, percentile, first, last, all these different things. We're actually adding to these quite a bit. We're doing a lot of work on the query language right now to expose a lot more functionality. So really quickly, here's what a continuous query looks like. So the new stuff is basically we have the create continuous query. We name it, apply it to a specific database, right? The select count looks familiar, and then the new thing is into, so we're basically feeding it into a specific database and measurement. So that's basically all that stuff is about the actual database itself, but as I mentioned, we saw that people were facing common problems in time series data, right? They had to collect it, they have to store it, they have to visualize it, and they have to process it so that they can monitor it. So we built basically a bunch of different components for this, right? We have a collection agent called telegraph, we have for visualization, for dashboarding, or basically drilling down in things, we have chronograph, for storage, we have inflex DB, the database, and then finally for monitoring and processing the data, we have capacitor. So really quickly, I'll cover each one of these pieces. So telegraph is the agent, it's also open source, MIT licensed, it's written in Go, and it's basically an agent that you would deploy it across your entire infrastructure, right? You deploy it on every single host, and it can do things like collect system metrics, but it can also collect, you know, metrics for well-known services, all that kind of stuff. We have a bunch of different, we call those input plugins, so I think we have about a hundred input plugins at this point. Most of those are actually contributed by the open source community. We wrote like a small number of them, and then we have people continually contributing new open source plugins, which is good because as the agent matures, it just means there's more pieces of your infrastructure you can get visibility into. For instance, like, we have a bunch of plugins for collecting stuff on Windows boxes, and I don't think a single developer in our company has a Windows box, so we didn't write any of those things. And then there are output plugins, right? We wanted telegraph to be useful as an agent, whether or not you're running other parts of our stack. So you can actually output the data to other things like inflex DB, graphite, Kafka, I saw signal effects here, like they also have an output plugin for telegraph. So for visualization, we have chronograph. It's open source, but it's AGPL licensed. It's written in Go, and for the JavaScripty portions of it, we use React and digraphs for the visualization. We think of it as a UI for administering the txtack, and for doing ad hoc data exploration and visualization, and you can also do the dashboarding as well. And the other thing you can do is you can create monitoring and alerting rules that get injected into capacitor. So there's like a point and click UI where you can say, you know, monitor the server and if the CPU goes above this for this long, then trigger and alert. And in addition to all of that, it has like a query builder. It has a tick script editor. Tick script is the scripting language we created in capacitor. So this is what chronograph looks like. This is like a dashboard that we created, like an example dashboard. There's also a screen where if you're running telegraph and inflex and chronograph, then by default you get this view where you can see all the hosts in your infrastructure, and then here on the right we have links to the different kind of services that we're monitoring that we have built in dashboards for. But you can also create your own dashboards. Our goal is over time we're going to have dashboards, like pre-canned dashboards for basically any well-known service, basically any telegraph input plugin. So this is the screen where this is like the screen where we're creating an alert so you can create an alerting rule. Here we can drill down to like the different measurements and filter by the tags, and we see an example of what the data is that it's looking at. So you can create different kinds of alerting rules. There are absolute thresholds, relative thresholds, and then what we call a dead man switch. So basically if something stops sending data, it'll trigger an alert. You can do a lot more than this in like raw capacitor land with the tick script, but this is something we created so that you wouldn't have to know that to actually be able to create monitoring and alerting rules yourself. If you wanted to go more like DevOps style on it, you'd probably want to have your tick scripts. They're actually checked into code repos and managed, you know, through your normal like DevOps infrastructure management. And then this is the data exploration screen. So you can see the data either in a graph, which we saw already in the dashboard stuff. The other interesting thing I think is that you can show it just as like a table, right? So you can show a table of the data that you're getting back. All right. The last piece, the processing piece, which is capacitor. So it's open source and MIT licensed also. It's written in Go. And it's there to process, monitor and alert on the data. So you can alert on it. You can act and execute on something that comes in. Like I mentioned, we created a DSL for capacitor called tick script, which is basically it's a declarative language that allows you to create complex rules around either transforming your data or monitoring for things and triggering things. So it works both for streaming and batch. So it can actually subscribe to the entire data feed of what's going into influxdb. And it can process that as a real live stream or it can work in batch mode where periodically you will say like, oh, query the database and pull back like the last hour of data or whatever. And the other thing is it has the ability to store data back into influxdb. So that's the part where you can essentially use it to do transforms on your data. Or say if you're triggering alerts and events, those can get stored also as time series back into the database. So later on when you're doing problem investigation or you're doing like, you know, review of all these things, you have the actual alerts as time series data as well. And there's something we call user defined functions. So our goal with capacitor is to provide as much as we can out of the box via tick script so that you can do all sorts of custom stuff. But we know we're not going to hit every single use case. So we wanted people to be able to write their own code to execute. So we have examples in Go and in Python. And basically what it is is you can write your own code and as long as it can communicate over a socket and deal with protobufs, then you can write custom code to do whatever you want. We have an example up on our documentation where we're using TensorFlow to do anomaly detection of the time series data. And that's run as a user defined function in capacitor. And then the other thing is we also, so most of our, historically our system has been push based. It's basically a push based monitoring system. But we've seen all the work that's been happening with Prometheus and the people who are, you know, fans of Borgman and Google. So we wanted to start adding more and more support for the pull based mechanism as well. So you can do both. So within capacitor we added code. We actually pulled in the Prometheus code for doing service discovery so that it can connect with your service discovery mechanism and actually scrape targets that are actually exposing metrics end points. So we'll work for both. All right. That's my end of the talk. I'll hand it over to Tomas. Hey, everyone. I'm Tomas Chaudhury and I work at Oracle Data Cloud, but specifically the data logic side of Oracle Data Cloud. And I just wanted to talk to you about how we've been using the TIC stack, everything from influx data for quite a while now in order to monitor our infrastructure. Namely the mesos stacks that we give to dev teams to do their workloads. So I wanted to start off with just talking about what really my team and I kind of like focus on. So our job is really to lessen the infrastructure burden as much as possible on dev teams, the traditional dev ops kind of focus. So when I got to ODC, we definitely had a lot of work to do in order to make that infrastructure services as turnkey as possible. We also didn't have all the teams using containerization as either they didn't have any containerization ability to containerize their workloads or some teams had been doing it and taking on the burden of managing their own infrastructure. So what we really, really try to do is provide those turnkey components for the business. And then with that, the best thing you want to do is try to integrate your metrics and monitoring right into the stack. So for the meso side of things, just like the entire parts of our stack, we have monitoring hooked in with influx data from the get go. Our containerization stack, so right now we are using a templatized mesos cluster and we provide to the teams a way to do heterogeneous workloads. Each team gets their own cluster per cloud account. Finally, we kind of had the issue where we didn't support heterogeneous workloads and it was kind of a lot of effort on our part to get to a place where teams could deploy different kind of big data applications. So now it's nice to know that we can afford this kind of workload to dev teams and we wouldn't have been able to do that had we not had the monitoring baked in to see how teams are utilizing the infrastructure, seeing where we can kind of make the mesos stack as perform in as possible. I forgot to mention earlier, our teams, like we focus on providing them marathon and singularity on top of mesos and that's kind of like our focus to keep, that's how we basically allowed dev teams to containerize their workloads on top of mesos. We now have an internal watchdog service that monitors each individual mesos cluster and it looks for opportunities to scale up and down the cluster, so that's really important for us just to control costs. We take all of that event data and we actually persisted back to influx DB so that we can graph to the dev teams a way to see, oh look at this point in my cluster, the utilization went up so definitely the watchdog service went ahead and scaled up my stack. Additionally, so we capture as much metrics as we have on as possible through telegraph and we persisted back to influx DB. I'll talk a little bit more about that. This is the stack basically right now in its current incarnation. We're running a 1.3 mesos, the latest marathon and singularity marathon LB and try to manage zookeeper as much as best as possible with Exhibitor, so a little bit more about the actual stack. We're running a single very large influx DB instance that's per cloud account. Right now we're running at 8 VPUs and 32 gigs of RAM. We have a default retention policy that's 90 days but dev teams can kind of request to have larger retention policies for their own individual databases. That gives them the ability to, since we make the database instance multi-tenant, dev teams can send their application metrics and then of course if they want to keep their metrics longer that they're afforded that ability to do so. Right now we are working on migrating to enterprise so that we could have a more spread out workload, a more shorted setup. On the telegraph side of things we have telegraph installed by default on every node. When I first started building this stack at Datalogics initially we were using Collectee, that's what I just had most experience with and at that time telegraph didn't have the plugins that we kind of needed but it was nice to see that very quickly influx data was able to get a lot of the plugins that we needed going forward to kind of appease all the different infrastructure use cases at Datalogics so it's very easy for us to kind of like remove Collectee and just go back and go to telegraph that gave us many of the plugins with Mesos and Docker and like Paul was saying earlier we have dev teams that do deploy onto Windows infrastructure and it was nice to kind of like have that dual purpose. So yeah with that the ecosystem itself is also all the plugins we've never had seen any issues, they're very performant, everything is written in Go. For our Mesos stacks these are our key plugins that we currently use, Mesos, Docker, Zookeeper and HTTP response. It's probably worth your time to go into GitHub and just see the list of all the metrics that the Mesos and Docker plugins alone pull for you because there is a lot. Going forward with that I mean we have for our alerting we have capacitor obviously we templatize all of our tick scripts so dev teams can very quickly get some base alerting and then we try to work with them to figure out what kind of other alerting paradigms they need so we try to add more tick scripts into our capacitor ecosystem. We have also provided dev teams an ability to run their own capacitor containers so that way they can manage their own tick scripts so that's been a kind of a nice unique use case for us where we can like make even the capacitor side of things self service. In addition we integrate all of that alerting with capacitor into our chat and on call systems to kind of have a end to end workflow from the metrics gathering all the way to the inspection into the database via capacitor and then sending any alerts that gets triggered into any on call system that we have set up. So for me like a capacitor has been a very good Swiss army tool I mean there's so many use cases for it it's like probably creativity is probably the only limiting factor. I've been really happy to see that influx data has been even you know staying abreast with the industry and giving us the ability when we potentially look at deploying with you know Kubernetes in the future that we will have an ability to do some of the things that Prometheus is also doing currently. So chronograph there's been a lot of work in that area by influx data that's not the visualization tool that we first selected when we started this work but we're keeping it kind of on the back burner. We're really waiting for some way to lock down the GUI so that we have a better administrative control but I'm pretty happy to see what is coming down the pike with that tool. Grafana is the tool that we're using for all of our visualization. We tried to pre-build all of our graphs in Grafana so in the case of the mesos clusters that we provide to dev teams all of the graphs are already pre-built in Grafana we bring up the systems with our configuration management tools and based on the cluster names that are represented to the dev teams the dev teams are very quickly able to see oh these are all my metrics for my particular meso stack here are the Docker container metrics for the specific containers running on our stack and that definitely has been very helpful in making that self-service as well so beyond just collecting metrics like the big journey for us like I was saying earlier was not just giving dev teams the ability to use containerization but really to give them a new platform for doing deployment of business services so I always like to answer or try to answer you know let me just build this slide out larger business questions so a lot of interesting questions that we had initially was if we do give dev teams a mesos cluster are they going to be able to see a good cluster utilization for each one of those stacks what are we as an ops team when we're providing all these clusters out to the different dev teams in the organization how do we see across the board the total utilization for all those mesos clusters that we're providing right because we're thinking of ourselves as a service team what happened or how are we capturing those cluster scaling events are we actually graphing that correctly and are we bubbling that up to the dev teams so that we can expose that when when you are running your services on a mesos platform you need to be able to see that are you using it does that cluster have too many too many mesos agents running or not how do we basically try to reduce our cost band because that's very important to us sometimes we see that oh even with our watchdog service that we are scaling the the cluster and sorry I lost my train of thought so to backtrack so the the the big kind of onus on us like I was saying is to provide these mesos clusters out but the nature of our workloads are so are so disparate that it's been quite a bit of a challenge for us to kind of find like the best generic use case so as we've been building out these clusters through all the metrics that we've been gathering we've been learning a lot ourselves on to how to make these clusters more performant so that it can basically serve many dev teams and now it's like we have two dozen dev teams that were able to support at data logics but initially like with it was it was really hard until we added all those plugins with telegraph to see exactly the visibility that was happening in each of these those clusters and then probably the the one big question we always ask is are we being successful with moving these dev teams to containerization how many how many services at Oracle data logics do we really have now at moved over to containers right answering that business question of are we actually making a good head headway into moving into containers has been really important to us as well so here's some screenshots with taken from our production side of our Grafana hopefully you can see that but you can see that telegraph gives you quite a bit of metrics here we're looking at that's a little hard to see but we're looking at the meso stack so we're looking at the cluster a lot of the stuff you see in the mesos admin gooey is basically being pulled off and in this slide you can see the number of killed lost your error rates and then in totality you can see how many tasks have been running on this cluster in this screenshot we're looking at our docker stats so this is a screen grab of all the different containers that are running on this particular cluster this you know this makes it very easy for any dev team to see okay I am running these these particular data logic services on my cluster so that definitely has been the the number of metrics that we get from the telegraph plug-in on the docker side of things has been very very useful as well here's a growth here's a dashboard that we built for bubbling up the marathon service health checks so you can see the three green boxes that shows us that marathon our front door is operating very well we have CPU metrics graphed below and so again to show you showcase if you look at the top left corner there's a cluster drop down every every dev team can just very easily pop into their cluster and see what is the state of their their containerization stack so additionally outside of telegraph we we have our own services running and feeding back data into into influx in this case this is our own internal tool that we've been using to track all the marathon applications running in at odc so here we're just starting to build out a little cleaner business intelligence dashboard where we can you know hopefully provide to the leadership team a very easy place to see how many how many business services are running in containers and then lastly I have a couple of screenshots here this is this is a at least the top three graphs you can see this is a view of all the the the latest incarnation of our mesos clusters in in production and you can see in the top in the totality of all the clusters that we provided out most recently we're only utilizing those clusters globally at about at about 25% so we have more work to do to try to get to a higher cluster utilization across the board in our dev workloads since we kind of have more ability to do more you know more quicker jobs we basically we do see a higher cluster utilization currently but even then we do have some dev teams because of the nature of their workload they're able to hit in this case like almost like a 60% utilization which is very nice to see so in summary so why influx data I mean overall it's been a great metrics tool an ecosystem for us at Oracle data logics and it really just helps us to you know keep fostering and providing end-to-end infrastructure services for the organization it's also allowed us to capture our own custom metrics and to keep pushing the needle in moving our dev teams over to containers lastly the the team at influx data they've been a great partner and the community has has also been very collaborative we've been able to give feedback and kind of help them kind of grow the ecosystem which in turn helps us move forward as well so thank you for your time you're doing service discovery and you're running workloads and containers how do you tell Telegraph where to find those so so Telegraph you is done through configuration it doesn't do service discovery so essentially you have to make sure that your you know deployment scripts and all that other stuff deploy it maybe you can talk about what you guys do for Telegraph the service discovery stuff is really for pull-based scrape targets so that you don't need to have a Telegraph agent for like capacitor can do that and it integrates with it uses the exact same code that Prometheus uses for service discovery so any target service discovery target they support capacitor also supports and it will actually do the job of scraping the data and then because it's capacitor you can scrape it and either like process it and alert on it right away or you could just port it to the database or you could do both yeah like I like why I was saying we install Telegraph in every agent but for some of those nodes like a mesoslave or a mesos master the way that we have our naming convention for all of our infrastructure and through configuration management management we can just do discovery and then inject the particular plugins that we want like Docker or or mesos onto those particular nodes themselves so then we basically additively add more plugins depending on the node type so the question was can Telegraph send data to do two different locations or more specifically can you have one input plug-in send data to one place and another input plug-in send data to another place so that currently isn't possible it's on the roadmap right now Telegraph can send data to multiple locations but it sends data for all input plugins to the both locations or whatever it is so but yeah that's good to know we can we can definitely prioritize input plugins specific yeah you can absolutely run n number of Telegraph instances for sure like that's what we've for people who've asked for that kind of thing for now that's what we're telling them to do right so the question was I said we were gonna add more functions to the query language what are those was it look like so yeah so the big effort we're doing right now is we're actually creating a new query language that look that is functional it looks like a functional query language so the structure of it is it looks very much like like D3 or jQuery or something like that so basically there you can chain functions so basically the idea is you can view your time series data as a data frame and the functions are just transformations that you do on it right selecting a range windowing it into five-minute intervals and then computing summaries on it so as we move to that we're going to be adding more things so we'll be adding histograms as a function that you can do I'm interested in bringing in more workloads that currently you can only do in like pandas or are so things like doing you know k nearest neighbors on on a matrix of things to find what series of all these are like similar to each other so we'll be doing more and more of that actually quick plug we're having inflex days conference in San Francisco on November 14th and I'm giving a talk there which is gonna be like the first like unveiling in a talk of the new query language and the new functionality so yeah cool yeah thanks everyone