 Hello, and welcome to the Cortex Maintainer session. Let's introduce ourselves, first of all. You've got myself, Brian Borum. I work for Weaveworks. I'm a maintainer of two CNCF projects, Cortex and CNI, and work on a bunch of other ones, WeaveNet, et cetera. One of my hobbies flying, and I put dislikes out of memory errors. I hate those. Jacob. Hi, I'm Jacob Wiesie. I'm a software engineer at Grafana Labs. I'm just a maintainer on the Cortex project. My main likes is low latency, and I dislike cardinality. OK, so we're going to give you a little overview of the Cortex project. Some news, bring us up to date. Going to do a demo of a recent feature, and talk about the roadmap forward. And we've left plenty of time for questions. So if you're watching this live, type them in to the box on your screen, and we'll get to them at the end of the video. OK, what is Cortex? It is a store for metrics by which we mean some quantity that changes over time. Cortex is scalable to millions, tens of millions of metrics, highly available and durable. It'll store your data for a long time. It's a centralized system. It can be used by many teams or users. We call this a multi-tenant system. And it does take quite a bit of work to set up and configure. So if you don't really need the scale, you don't need the multi-tenant setup, then it might be that something like Prometheus, which Cortex is based on, is a more suitable tool. But listen along, and you'll see how it works. And you'll be able to see whether it might be good for you. So this is the big picture. We have Prometheus on here on the left, three, just illustrating like you could have three different teams or three different customers of a system. Prometheus is gathering data from different sources. So those are your application programs or exporters from the US, things like that. Gathers the data, and it sends it up to Cortex. Cortex then stores the data and makes it available for querying, for instance, in dashboards. Let's drill into that a little bit more. The data arrives at Cortex and goes through a process called the distributor. And the function of that is so that we can scale, we will distribute the incoming data across as many ingester processes as you need. So you can run handful of these, you could run 20, you could run 100. And each one can store a few million series, depending on how big your machines are. And what the ingester does is it captures the metrics that are coming in, compresses them, and eventually sends them off to the store. From the store, we have a querier process. And again, that can be scaled. Every single part of Cortex can be scaled horizontally to cope with more data and more users. And the querier is what makes the data available to dashboards and so forth. Cortex is multi-tenant. And by that, what we mean is the data all the way through the system is tagged with who it belongs to, where it came from. So if these three Prometheus on the left are three different customers, then we tag the data as it comes in. We spread it out across different ingesters, depending on where it came from. And we hold it tagged with that identity in the store. So it always has its identity, its tenant ID. And that means when you come to query it, you will get the data just for that one tenant. That's why we stress it's really good for those situations where the data is separate amongst different customers, different teams, they just wanna see their own data. Okay, that's Cortex in a nutshell. I'm gonna hand over to Jacob now for an update. Hey everybody. So let's go over some updates on the project and just a basic timeline of the project since when it started. It was originally founded in June, 2016 at Weaveworks by Tom Wilkie and Julius Vols. It became a CNCF sandbox project in September of 2018. Then as of last April, we released our 1.0 version. And then later last August, we became a CNCF incubation project. So there are a number of organizations that are adopting or have adopted Cortex. You can see from this list, you may recognize some of these names current or past Cortex users. The ones highlighted in yellow are new users since the previous KubeCon. In terms of contributing to Cortex, we have a hundred people who have contributed in the past. Hopefully we can get that a bit higher. So we can say we have hundreds of contributors. We have seven maintainers across three different companies and our workflow is mainly driven through GitHub, specifically GitHub issues, GitHub PRs. Pretty standard there. In terms of larger features, we do have a proposal system where you create a proposal using a Markdown document and then do a PR against the main GitHub Cortex repository. We do a community call once every three weeks. And on the same cadence, we also do a bug scrub call usually among the maintainers to help keep the issue backlog manageable. We are looking to grow our community. Please, if you are interested, join the CNCF Slack, go to the Cortex channel, say hi. If you want to contribute, you don't necessarily have to start with Go code. You can start with docs or the HelmChart, for instance. If you want to work on the Go code, also feel free, we're looking for contributors there as well. If you're interested in learning more about the Cortex community, I highly recommend you go see Gotham's talk this Thursday on the Cortex community story. In terms of project news, so we've implemented a number of items on our roadmap, specifically pertinent retention. So this is basically deciding how long you want to keep data points for each specific tenant, queries over multiple tenants, which I'm going to show you a demo of a bit later, bulk loading data from Prometheus. So taking your blocks from your Prometheus node and uploading them to an object storage system for Cortex to use, and then horizontally scalable alert manager. So this is basically the alert manager component of Cortex is now scalable in a horizontal manner, whereas previously horizontally scaling, it didn't actually have that much of an effect. It was mainly just for HA. Some items that we implemented that weren't on the roadmap, I should note shuffle sharding. And if you want to learn more about that, you should go see Tom's talk on Friday, but it's a really powerful way of distributing load in a distributed system and basically managing capacity and noisy neighbor problems. The other thing we also implemented is pertent query statistics. So you have a better idea of which one of your tenants is driving those really heavy queries and putting a lot of pressure on the system. So block storage is now live. This has been going on for a while now on the Cortex project where we've basically migrated the way you store data from using a NoSQL backend to an object storage backend. So as a project, we now recommend block storage as the primary way of using Cortex. We no longer recommend the chunks backend. Why? Well, it uses the Prometheus native TSDB code format. So it standardizes across the ecosystem and allows us to reuse code from Prometheus and Thanos. Also, object storage is generally cheaper or easier to use than NoSQL. It's definitely cheaper from commodity offerings in terms of running on prem that you would have to run your own TCO calculations, but we found it to be more cost effective broadly. And then the third thing we should mention is that it is a bit faster on the read path, especially for pretty standard rate queries. It also is a bit more expansive in terms of its prom QL compatibility because you can now do anonymous name queries. So some next steps. There are some limits that we had in chunks that we haven't ported over to blocks yet. The main thing I'm thinking of is the number of actual discrete samples per query. Being able to limit that on a per tenant basis was pretty useful in chunks. We can't do that currently in blocks, but we are going to look to implementing that in the near term. The other major limitation with blocks is that it isn't as good at ingesting data that's old. So blocks are cut on a two hour timeframe and if you go beyond that, it's a bit wishy washy, but after a certain point, it's very difficult to actually write data to blocks. You have to upload it using a backfill mechanism if you don't get it data in within about an hour and a half of when it was first created. So let's talk about multi-tenant queries. This feature landed in 1.70 Cortex. It was implemented by a coworker of mine, Christian Simon. So if you are interested in this feature, be sure to thank him. So the way this works is we've talked about how tenants are basically isolated databases within Cortex. So you can have a promql database for multiple tenants and the data between those tenants is isolated from one another. This allows you to have different users in your organization have different sets of data that are relevant to them. However, there's times where you want to get a global view of data within multiple tenants. This is what multi-tenant queries let you do. Essentially query multiple tenants as if they were a single promql data source. So let's actually see what this looks like in practice. So this is my Grafana instance. Let's take a look at some of the data sources currently configured to it. So we have three data sources, Team Brian, Team Jacob and then Team Brian plus Team Jacob. Underneath the hood, Team Brian is basically just a Grafana data source pointing at a Cortex instance running on my local the slash Prometheus endpoint. So this denotes a Prometheus API is underneath this endpoint and a custom header set X scope org ID. This is how a tenant is referred to in Cortex. The value for this is obfuscated by default in Grafana but it's essentially set to Team Brian. So if I copy and paste this, it should be the same value. Team Jacob is set up pretty much the same. So let's take a look at what those data sources look like. Check out the up metric which you may be familiar with if you use Prometheus. So we have three series in the Team Brian data source. We have an up metric for the Prometheus job, the Cortex job and the Minio job. Looks like they're all doing fine. Let's check out Team Jacob for the same query. So we have two series, one for Grafana and one for agent which is the Grafana cloud agent. Basically just a scrape and write version of Prometheus without the time series database built then. So let's take a look at this Team Brian plus Team Jacob data source. Again, the header is obfuscated but let's just show you what this would look like. So it's the same as before Team Brian and Team Jacob but they're separated with a pipe operator. So we set that header value there and let's go take a look at the Team Brian plus Team Jacob instance. Same metric up as before. And now you can see all of the up series between the two instances. You see the Team Brian series here and the Team Jacob series here. So these are sort of separate tenants but we're able to view them as one consolidated query. Again, one of the ways this works is we basically take the tenant ID and turn it into its own label selector and you can actually use this label selector in your query. So if we were to do count up by tenant ID, we would see that Team Brian has three series for up and Team Jacob has two. As you can see, this is just a really powerful way of courting across multiple tenants. And if you want to expose tenants as a way of delivering a Prometheus database by isolating data for your users but then still have that global view, this lets you do that in a really seamless way. So that was a bit of a contrived example in the explore tab. Let's see what this would look like in a dashboard. So I have this go processes dashboard. It has two variables up at the top tenant. You can see Team Brian and Team Jacob and job. So we can see all of the applications we saw on the previous screen. This is a bit of a contrived example. Each of these dashboards is an aggregation of all of the jobs. And if we change some of these dropdown values, we'll start to see some of these graphs change a bit. So if we just select Prometheus, you can see the graphs changed a bit and this is the memory usage of Prometheus and the mem stats of Prometheus generally. If we were to change it back to Cortex, we see just the Cortex values. So what happens if we change the tenant value in the dropdown to Team Jacob? So now we're left with a selection of agent and Grafana. And you can see here if we just select Grafana, the variables change. You can already get a sense that what multi-tenant queries allow you to do is have dashboards that are specific to specific teams and data sources that are useful for specific teams but then still get that global view across the board if you need it. Again, it's definitely an organizational tool that you can use if you're interested with Cortex. So at this point, let's jump back to our slides and see if there's a few more things we can touch up on before we end the presentation. So now that we've seen this demo, let's talk about some challenges to the future of the project. So Cortex has multiple interconnected services. It requires a significant amount of tuning and balancing and the tooling is a bit early stage or focused on specific use cases. For instance, the JSON library is very mature but it focuses on the microservices use case. We are looking to improve the usability of Cortex from this perspective and make it easier to use simpler deployments as a new user and not have to move straight to a deployment that checks all the boxes and dots all the eyes. Once you have this in place, there's nothing stopping you from scaling further and moving towards a more complex deployment but it's just getting to that initial easy deployment. Something we're gonna focus on in the near future. In terms of feature development, some things on the roadmap include exemplars. So this is being worked on in upstream Prometheus. It's essentially a way of taking observations or data points recorded by your Prometheus instrumentation and associating them with trace IDs. So if you have a high latency query request that's instrumented in logs in Prometheus, you can later take that metric and then look up an actual trace ID of a high latency request. The other thing we're working on is improved scalability. This is gonna be pretty much a constant with the project. We always wanna push more active series, higher ingestion rate, higher queries per second. It's just something we wanna do as a distributed time series database. Then in terms of tooling and packaging, we have a Helm chart, we have a JSONite library, fleshing those out, making them a bit more powerful and expansive and encompassing the single binary use case, for instance, that's something we're looking to do. Just make those libraries a bit more consumable so that new users can get started with the project more easily. So if anyone has any questions, feel free to ask. Me and Brian would be happy to answer. You may also be interested in joining the Cortex Office Hours this tomorrow at 11.15 or the Community Story session by Gotham or the Shuffle Shouting session by Tom. But I hope you guys enjoyed the talk and happy to hear any questions you might have.