 Hello everyone. Good morning or almost afternoon. Thank you for being present here at Thanos Conte today. I hope you've built some appetite for Thanos in addition to lunch. Now that we have some understanding of how, as to how Thanos stands as a solution for scale metrics and longer retention, it is also important to understand how to manage this data in an efficient manner as it grows. So one way of doing this is through multi-tenancy, which is the topic for our presentation today and maybe a better way of calling it as multi-verse of Thanos. So a bit of introduction to begin with. My name is Colleen and here's my colleague Jacob. Jacob and I have been working as software engineers at Red Hat and we primarily work in the observability realm and it's been about a year and we contribute to monitoring projects such as Thanos and Prometheus. And also this is our first time attending CubeCon. To begin with, let's understand through definition what multi-tenancy is. This is basically a concept in software architecture whereby a single software instance can serve multiple distinct user groups. These user groups can be as fine-grained as a single user or a group of users that come together in the form of teams, services, clusters or organizations. So we can broadly classify them as two types. The idea being of one is soft-tenancy which is also known as logical-tenancy, taking the analogy of an apartment complex where each of the flats share common resources such as water, electricity, where in technical terms that would mean segregating your data and resources at the application or database level, when they share the same resources or physical hardware. The other form is hard-tenancy, also known as physical-tenancy, where again you have a bunch of houses that have their own utilities, separate utilities, so basically there's isolation at the physical level, which means each tenant's data is stored at a separate physical or virtual service. So now that we know what multi-tenancy is, we need to understand why do we need tenancy. So a disclaimer, any resemblance to real life is purely coincidental and hope we do not offend any DC or moral fans, otherwise you can chat GPT for that. Okay, so we have an international superhero alliance which is operating across the globe, managing different facets of superhero activities such as mission deployment, threat assessment, superhero training and emergency response. Now each team within that alliance, as normal engineers, make use of microservices and to maintain their specific domains and they heavily rely on observability to ensure that their operations run smoothly and to quickly diagnose and address any issues. So here we have Dave, he's a site reliability engineer and a multiverse guardian. He prides himself in keeping the digital cogs running smoothly. It's the start of the month. He's ready to tackle whatever challenges come across his way and yeah. So one morning his routine is interrupted by the cloud infrastructure bills. So even superheroes are humbled by bills at the end of the day and then he's wondering where these costs come from. Is there a secret mission running or some alien invasion? Then he quickly starts digging into the cost explorer but the numbers don't make sense to him and just look as cryptic as ever. So this superhero alliance basically sprawls across numerous services and microservices and each innovating at a breakneck pace. He realizes that without tenancy implemented in the observability he wouldn't be able to attribute costs or hold teams accountable for their resource usage that they are making and it's like trying to solve a jigsaw puzzle without a picture in the box. So they're basically flying blind with the company's credit card. Okay. So now hopefully we convince you that you need tenancy. So let's have a look at some of the challenges you might face or some of the requirements you have in multi-tenanted setup. So obviously the first thing that you want to do is to have data isolation. Because if you're operating across the multiverse, obviously you don't want to end up in a confusing situation like these guys. So you need to make sure that the data from different tenants are kept in isolation and separate from each other. And in Thanos we can use labels to do that and we'll see how it works in a bit. The other thing you kind of want to think about is resource isolation. So now you have a bunch of different teams or tenants that are fighting for resources, right? And you want to make sure that you don't have a bad tenant, for example. Let's start using a lot of resources and then interrupting the service for all of the other guys that are behaving well, right? So resource isolation is quite a good thing to have in the multi-tenanted setup and we have some ways of implementing this in Thanos as well, which we'll look into slightly later. And then there's a cost aspect as well. It's quite good to know if you have multiple tenants that which one are actually using up the most resources and are costing you a lot of money, right? Because we don't want to have the infrastructure build just run wild. So you want to make sure to attribute cost to different teams. Maybe you're even billing different teams or organizations or even outside your organization and you want to just have an understanding of which of the tenants are costing the most on the infrastructure side. So scale is something, we've had a couple of talks now on scale and when you have a multi-tenanted setup, you just want to make sure it's scalable, right? And Thanos hopefully should work well for that. So I won't really dwell into too much details on that one. And then there are some sort of things you can think about on security and compliance, right? We talked about like hard tenancy and soft tenancy and the difference between them. And I think if you have really strict security or compliance requirements, you have quite sensitive data in your database. Maybe you want to think about doing more of a hard tenancy and harder isolation than having a soft tenance setup where things are closer together, right? So it's a good thing just to keep in mind when setting up a system and choosing kind of your tenancy model. So now let's look at how multi-tenancy is adopted in Thanos. Here we have a simple setup of Thanos with these various components. There are two ways of ingestion into a Thanos setup. One is through Sidecar and Receive. For the purpose of tenancy, we specifically choose the Receive component as of now. And the Receive component in the ingestion path has had the tenancy capability for a while. And as of a recent release, now we have tenancy awareness in the query component. So let's look at the ingestion path. So basically don't need a general setup. Receive will handle tenants as they come in. All you have to do is in the remote right client request, you need to set up the HTTP header with your tenant label. And then the Receive adds a label to the ingested metric which corresponds to the HTTP header. And when a new value is detected in the tenant HTTP header, Receivers will provision and start and manage an independent TSTB for that tenant. And then the TSTB blocks are sent to the S3 with a unique tenant ID and can be used to compact blocks independently for each tenant. Now that we're able to ingest metrics through Receive, we also want to limit or control the amount of metrics that as the services grow and we emit more metrics. So this helps to ensure that no single tenant overloads the system or causes outages or disturbances for the other tenants. The limits can be set up on a global level at the component, at the Receive component, at a default limit for all the tenants and also per tenant limits. Looking at per tenant limits, the limits are set at the request level. So we have something like size byte limits, basically the maximum size of the incoming remote right request, zero meaning no limit. Then we have the series limit which is the maximum amount of series in a remote, single remote right request that can be sent for that tenant. And then there's a samples limit which is the number of samples in the remote right request which is a sum from all the series that are a part of that request. There's also a limit that you can set based on the number of active series across all the replicas and yeah, this is on the tenant level. So there's also, through the Receive hashring, you can also route the metrics coming from a specific tenant to a specific set of receivers. So this helps to ensure that say you have a critical tenant and you want to ensure that you don't want to lose metrics on those things. You can route it to a specific set of receivers and there recently there's also a capability that was added to match reg X patterns for the tenants in the part of that list and which is interesting bear to take a look at. Right, okay, so let's look at enforcement on the career path or tenancy on the career path. So this is a pretty new thing. It's only sort of the last big piece, I think in the last release. And so there's a new flag to the courier to enable a tenancy enforcement and it's off by default, right? So everything will work as normal. And it works pretty similar to the Receive site, right? So you send a HTTP header to identify the tenant and then what the courier actually does is that it enforces a label. And as we've heard, that's a bit of a topic today. And we're using prom label proxy in the back end. We are sort of importing it as a library. So we didn't have too much specific code to find us, which is nice. And yeah, so that's basically how it works pretty simple. There's a quick example here where we have a bunch of metrics and with no tenancy enforced, you can see there's a bunch of results coming in from all of the different tenants. And then if we enable the tenancy, you will only see the one from your specific tenant. There's a box in the UI that will show up. It didn't quite make the last release so it will come soon. That will allow you to enter the tenant name so you can kind of play around with it in the courier UI. And yeah, so that's like the enforcement part. Now one of the cool things, now that we have tenancy awareness and probably one of the main reasons that we wanted it in there actually is to get a bit of understanding of how the tenants, different tenants behave on the courier path, right? So this is something we couldn't do before. Before we had a proxy in the front to actually enable to do the tenancy enforcement. But now that we have it natively in FANOS, we can start getting some metrics. So there's one example here which is quite interesting. It's actually meant for measuring courier latency in respect to the result size of the couriers. But we can also use it like this. So there's a bunch of different pockets you can see on the left-hand side. And then on the right side is basically how many times this tenant had made a query that is where the result is of that size, right? So this first tenant is making quite a lot of small queries. So you can see basically all of the queries that it's making is less than 100 samples and 100 series returns. So quite a small set of samples. If you look at a different tenant, this one has a more varied set of queries coming in. Some really, really large ones. So the top ones you can see are actually hitting this infinite bucket, which probably means you should change your bucket size, right? To get a better understanding of the sizes of those. But even if these two tenants are making the same amount of queries, we can probably say that this tenant here is causing a lot more load on the system. And this is super interesting to start to take down into this kind of data. There's a bunch of other metrics we can look into. There are some on the store level as well because we're propagating the tenant also all the way to the store so you can start looking at which tenants are having a higher utilization of the cache or something like that. Cool. And also it's a good thing to know that these metrics do exist even if you don't enable enforcement, right? So even if you don't need enforcement for some reason, you can still get the metrics, which is cool. You just need to send the HTTP header correctly. Great. So just a couple more things on the architecture when you're setting this up. So if you have layered queries, it's important to know where we're actually enforcing the tenancy. And basically the thing to remember is that we're just enforcing tenancy on the first level query which is hit by a prompt query, right? So like this, if we're enforcing it on the first query there, it will work fine. But if you might think that you could enable enforcement just on these queries on the right-hand side, but that won't actually work. All the requests coming in on our most layered, they won't have a tenancy enforced. So it's a good thing to keep in mind. I'm not sure if this is what we want, but this is how it is for now at least. Something we can to discuss maybe. And then the other thing is that FANAS doesn't do any authorization or authentication. It just looks at these HTTP headers that are coming in and assume that it will be fine. So probably you're going to want to have something in front, some proxy that are actually putting these headers in there to make sure that it's the correct tenance that are setting the correct labels. We used Observatory API, but I'm sure there are a bunch of others that can work. So looking at the future, now that we understand how a tenancy is implemented in query and receive, there are also some future scope of work that can be taken care of. So like setting limits in the receive component, we can also try and set up limits in the query component which prevents any single tenant from monopolizing the shared resources to ensure there's fair usage across all tenants. So this way you can prevent scenarios when there are extensive queries that can degrade the performance or cause a noisy neighbor issue. Also for the purpose of analytics or monitoring a billing, there can be added a capability to cross-tenant queries from multiple tenants under control circumstances. This becomes useful for administrators or services that need to aggregate the data across their systems. Lastly, at the end of the day, it's all about how you optimize your resource usage and the metrics that you're sending and then you're not overwhelming your systems which boils down to infrastructure costs. So this is an important aspect to assess your resources consumed, including all the storage, computer and data transfer so that you can allocate and budget your observability usage. Now going back to our use case scenario, so Dave finally has implemented multi-tenancy across his alliance, and this recognized the challenge of spiraling costs and he said to ensure that each team takes responsibility for their own metrics and resource usage. And yeah, thank you. Thank you for listening and here are our social handles. Please feel free to connect. And I'd like to take any questions if there are. So any questions? Okay. Performance-wise, is there a difference to using the tenancy queries compared to just including, say, the tenant label in the query? Could you repeat that? So fundamentally, is there a difference between if you have a tenanted query to using the interface with the header or if you just include the label in the query, the tenant label? Hopefully not. I mean, the courier, like, it just basically adds the label in front of it. We haven't done any measurements and it should be pretty quick to just inject the label. Yeah, like, you wouldn't face any performance issues within TANAS, but it depends on, like, what you're using as a proxy. So the proxy will authenticate and then assign the tenant query and then forward the request. So that's the only tenancy one. Is tenancy enforcement applicable to Prometheus instances or other store components? Or is it only for the receiver now? Yeah, so I guess any time you would have a tenant, I will enforce it, right? So if you ingest data in some other way and you add correct labels that the courier will understand, then I will just enforce the label. Are there any scenarios that should create separate receiver instances instead of creating multi-tenancy in one receiver instance? Not specifically for tenancy, I think. I mean, it could be that, like, you want to have different tenets on different received, like, hash rings, because, you know, some of the tenets are using unique, like, dedicated instances because they're ingesting a lot of data, right? But otherwise, I think you can run everything for a single instance. Cool. Anyone else? Going once? Going twice? Okay, thanks, Kalini and Jekal.