 Welcome everybody, welcome to the talk that we will present today, banking observability at scale. My name is Arjan Leike, I'm a father of two, and in the weekend I'm a volunteer at a dog school where I teach students about dog behaviour and training, and with me is... Yeah, I'm Salotore Vitale, I'm Italian living in the Netherlands for around 10 years. If you wonder if I speak any Dutch, just a little bit to get some information at the kindergarten about my two years' daughter, that's it, I hope that she will help me with the Dutch in the future, and I'll let Arjan continue the talk. We are here on behalf of the ING, the ING is a bank with a global presence, but by far the biggest presence that we have is in Europe. So our job within the organisation, and that's quite an interesting job, because only two years ago, I guess, we were still using monitoring systems like Tiffly, HP OpenView, and we basically identified that the open source was the way forward. So we started working, and what we want to provide is a self-service platform for our engineers so that they can simply spin up the observability that they need. In order to design such a system, we had to think about some concepts behind the design, and basically what we did, if you're probably familiar with the optometry and the pipeline within the collector, basically what we did is take that pipeline and basically take it out single process and spread it out over multiple processes. The first stream that we identified was metrics, and that's what this talk today is about. If you talk about design drivers, everybody knows about scalability and those kind of things. The thing that I really want to highlight today is the architecture building blocks, because everybody knows that in enterprise it makes sense to reuse those building blocks but even more so in the bank, because there's a thing called risk and compliance, and we are very concerned about risk and compliance. So if you walk on our floor, where we work, you hear things like risk appetite, risk controls, in control statements, risk scores, so that means that if you design a system within the bank, you have basically a choice where I reuse the architectural building blocks where the risk is managed for you by a different team, the second day operations are managed, or roll your own. So basically it's a trade-off between the best technical solution or the best solution for the bank and also for the effort that you have in order to support your system. So today we're going to talk about how we handle metrics, and we created self-service platform for this, and we like to call that the reliability toolkit, together with a single view. So before we dive into how we solve the metric observability, we have some numbers. These tend to change on a day-by-day basis because we still have team onboarding. So currently we are handling the load of around 5,000 Prometheus instances and around 1,700 Grafana instances. So if you look at how we handle scalability, basically we went to sharding route because each computer system has limits, and it can come in many forms. So an egress interface can have a bandwidth limit. A cluster might have limits, even a metric store might have a limit. So in order to overcome those limits, we shard. So basically we have the ING container hosting platform. It's an Kubernetes environment, and what we do is that we create namespaces in that environment, and there we have also a control plane, and the control plane is basically responsible for allocating the workloads, and it can be done in different ways. It can be done with the placement policy, and the placement policy can either be like okay, this team should be pinned to a specific shard, or we're going to use the least used shard that we have available to us. The container hosting platform is stateless, so that means we don't have persistent volumes. That also means that we need to store our state somewhere else. So what we did is that we deployed MIME in the ING private cloud, so MIME is deployed in a monolithic mode on VMs, and we have hardware appliance with an S3 compatible interface, and that is what we actually use to store the metrics. So that handles scalability, but what about availability? The way that we handle availability is quite simple. Each time you request a deployment, a logical deployment, you get actually two physical copies, one in data center one, and one in data center two. So let's have a look at shards and what's in there. If we zoom in a bit in what resides in the shard, we first see the security proxy. The security proxy is something that we built in-house, and it's basically a reverse proxy between the Prometheus interface and the Grafana interface. So the proxy handles a couple of things. First we present the consumer with the deployments that they have access to, and also it handles the authentication. And for that we delegate the authentication to Active Directory via OpenID Connect. Based on the access token that we get back, in the token we have a lot of group memberships, and what we did is that each deployment is also tagged with the different groups that can have access to it. And by intersecting those two pieces of information, we know what to present to our consumers in the interface. Then we have a couple of shared components within each shard. The thing that we use to load balance is Envoy. So Envoy is between Prometheus and Grafana. And it's basically ensuring that we balance over our memory cluster. Then we have a shared alert manager, and that alert manager will forward alerts that it receives to also a custom build service, that's the MDPL service as we like to call it. And the service basically has a webhook that understands the Prometheus alert manager payload. And we use that to forward the alert. Then we have the federation service, and the federation service is what we use to get data from collecting, to processing, and delivery. So basically we use labels. So if the consumer has a desire to use a specific capability, later on the pipeline had to think about anomaly detection, air budget reporting, trend prediction, they can label their data set with a specific label. It will be picked up by the federation service, and the federation service will use remote write to write to that same MDPL service. That MDPL service will forward all the data it receives to Kafka, and Kafka is used then to transport certain metrics to the different capabilities. For example, the long term metric store, because we have actually multiple memory clusters. We want to keep tabs on the footprint that we have as a company on monitoring data. So basically what we expect our users to do is that certain metrics that should be kept for a longer retention period should be labeled so that we move that data into a different store where they have a much higher retention. Then we have a custom alerting backend, and of course our analytics and reporting capabilities. So that concludes the overview. Now my colleague will show you something more about the orchestration. Thanks, Arjen. So yeah, let's dive a bit into the some technical aspect of what we want to focus today, which is basically how we provide Prometheus and Grafana as a service to our engineers. As Arjen mentioned today, we actually have, we maintain almost 5,000, 6,000 Prometheus and almost 2,000 Grafanas. So how this works, on the left you see the engineer that is basically communicating or sending a request to what we call the resource manager, a resource manager. This resource manager always following the approach that he was highlighted by Marianne at the beginning of this talk is that he's a component within ING that we want to reuse. And he's basically, in ING, he's called Touchpoint Automation Framework. And if you want to know more, some engineers at the ING, both from tomorrow on, will actually explain about it a bit more. But basically it works like a sort of Kubernetes API where you can send a manifest, and this manifest has some resource definitions. And then the resource manager is aware of what sort of resource the user, the consumer in this case, the engineer wants to create. The resource manager actually extracts some, let's say, some logic and does some logic for us. First of all, he does all the authorization and the authentication. So again, against the ID that Arjen showed in the previous slide. But also he runs an open agent policy. So he has, as an actually provider of a service, we can use open agent, let's say, what are called rego files to specify what kind of constraint we want to put on the resource. For example, one of the constraints we have is that if the resource is actually for a production environment, we would like to have at least four eye principle. So another engineer or another team members needs to approve the pipeline with which the resource will be either created, updated, or deleted. After that, after the resource manager has done his job, so received the request, he knows there are a Prometheus resource or a Grafana resource that he wants to create. He send it into our queue, where our control plane, where we have our controller, pickups the message, does some extra validation. And then if everything is all right, he stores the state into our DB. And then he sends a message to the orchestrator in the shard, either with the least utilization or either with some dedicated service that we want for particular teams. Let's say we have platform teams like the Kafka teams or other teams that require a highly amount of pods. So for those kind of teams, we would like to have a dedicated shard. The orchestrator then spins up a pod, which communicates actually with the Kubernetes API and actually creates other Prometheus or a Grafana pod, depends on what the consumer is requesting. And then the user can actually, at this point, use the resource manager CLI to actually request what is the status of his request, if it has been completed. And then once it's completed, you see that the status isn't completed. You can actually log in into the security proxy, where you will see an overview, and basically, you will see if his Grafana or Prometheus are ready, on a state ready. And they can basically monitor, basically, and can visualize and access the UI of Grafana and Prometheus. But this is nice on paper, so we have prepared a small demo for this. Before, in which I basically will show exactly this process from the engineers, but by using the CLI. I would like to basically say a bit what the demo of the scope will be. So in this demo, we will create a Prometheus instance using the resource manager CLI, then we create a second Prometheus instance. Just to highlight out the fact that scraping is completely decoupled from visualizing. Then we will start a Grafana instance, where basically this Grafana will have automatically access to the data scraped by the Prometheus instance that were previously added. And finally, we will apply also using the resource manager at dashboard. And before starting, I would like to mention that we will refer to Prometheus internally as a reliability toolkit too. And to Grafana as a single view, which is basically our internal name for the resource manager definition. And finally, we will not use the Kube-CTL, but as I said, we will use the Touch-CTL, which is the internal product that we use at ING. Okay, I'm going to avoid issue with VPN and stuff, security in the bank. So I just recorded, but you can trust that this video has not been manufactured with this reality. I can show it to you later in case. So let's start. This is a reality-bidded toolkit too. So it's a Prometheus resource definition. And then, oh, sorry, it's not showing. Good, Rian, good catch. Why not? Maybe I need to exit the presentation mode. Yeah, good catch. Yes. Yeah. Yes. Okay, I'll start again. Yeah, good catch, guys. We were losing a lot of nice information. Yes, okay. So I will start with a Prometheus resource definition. So this is a Prometheus resource definition. You can see here some similarity to a Kubernetes manifest, where basically most is just the resource definition that the resource manager needs. Here I'm giving a name, which is the demo for the conference. Then we have some specs regarding basically the deployed model, where in which region our consumers wants to really deploy. Then we have some configuration. You will recognize the Prometheus configuration. In this case, for the sake of this, we are just scraping Prometheus itself. Then we have some alerting. In this case, we want to be alerted if the Prometheus itself is down. Let me just put this up, sorry, guys. As you can see here, I'm using the Touch CTL, and I'm applying actually the manifest. You can see that it returned like that the state is impending. So after that, I can actually retrieve the state of my request with a get command. And here I'm basically seeing that my request has been completed, and it has been deployed to Shart13, where I can basically access Shart13, where I have my overview page, and here I can see that the pod is spinning up, and I can monitor it until basically it will be in a ready state. Here I'm actually creating a second instance, which is exactly the same, so I'm not just the name is different, so I'm not actually bothering you again with the manifest. Again, here we are seeing that the first instance actually is already on a ready state, while the second is currently being spinned up by Kubernetes. Now it's ready, so actually we can dive a bit into the Prometheus UI. Here we will look at the configuration. You can see that, in fact, Prometheus is scraping exactly itself in the localhost. Then automatically, of course, we configure the remote write to our Mimir cluster, so that's completely managed, and then you will see also that the simple alert that I put in place is there. The first two steps, so we have two premises, so now we can go to create the Grafana instance. Again, another manifest, where most of it is basically the boilerplate for the resource manager. In this case, we call it single view for the demo, and again, in this case, we have a simple manifest because the specs are just the deployment model because all the other logic is handled by us, so I don't need to make it simple, basically, for our engineers. Again, we apply the manifest using the apply model, my apply command. First, we get, again, status pending, and now if I go to the proxy for the single view, I see that my pod is basically getting into a ready state, and now I'm ready, so the consumer now can access the Grafana for the first time, and then here it will have like a default dashboard in which you will see in its data source which promethuses it can access, so here you can recognize that this Grafana automatically has access to the two promethuses that I previously created, and now, basically, I have the data, I want to create a simple dashboard to show that all this as a sense and works, so now we have another resource definition. Basically, the engineer has already created a dashboard, exported into JSON, and then applied via this CLI. We need to say, in rich Grafana instance, we want to apply this dashboard, and then you basically, yeah, this is just the JSON, and then once the dashboard is applied, it will be shown in a few seconds in your Grafana, and now you have basically this dashboard which has been applied, and you can see that, in fact, for very few minutes already, the two promethuses are producing data, and then, basically, these data are available. Of course, engineers in ING has way more complicated targets to scrape and way more complex dashboard, but for the sake of the presentation, this basically explains the entire flow that our engineers follow. And of course, our engineers, they don't use actually the command line, but they use the pipeline version of it that we run on our cloud. So this basically concludes the demo. I go again to presenting mode. So yeah, next step. So basically, for the coming future, short term, what we would like to do is to migrate all our recording rules and alerting to Mimir. For that, we are waiting to have the possibility to actually run Mimir on microservices. At the moment, as Ariane explained, we are currently running on VMs with the monolith approach. And then, in line with all, let's say, the conference of today, we would like actually to move towards open telemetry collectors for our managed solutions. And so that's basically the trend also for us. So we are happy that, actually, we saw today that we are going on the right direction. We will go in the right direction. If you want to follow other talks from ING, tomorrow our colleague will be pretty busy with very interesting talks. So I invite you to, if you are interested, to follow them. You can follow us on the Various Channel of ING, or we hope very contacted us on our LinkedIn that you actually find on the website of the conference. And that's it. Let's conclude our talk. Wow. That is an awesome thing. 5,800 Prometheus instances. That blows my mind. We have a few minutes for questions. Any questions for the ING folks? Oh, yeah. We have a question there. Why do you need 5,300 Prometheus instances? That's a good question. It depends on the, what, what, no, no. The thing is, we concern ourselves with least privilege, right? So if we would collect all the data, put it in one big bucket, then basically everybody could see potential sensitive data. So within our IAM model, we made it explicit that we want to separate basically what the team can see. So basically each team gets its own Prometheus instance. And maybe not only one, but multiple, right? If they have a large environment and they want to monitor a light of moment, one team might even have like five instances. Now, then we have also test development acceptance. So they might spin up multiple containers for multiple purposes for multiple environments. And then that number sums up quite quick. Yeah, to give you some numbers, for example, I think, yeah, the Cassandra team, which is a managed database for the entire bank, use like maybe 25 Prometheus for their own service because they are on or maybe 25 million times series, active time series. Any other questions? Oh, yeah, we have one here. Hi, why did you decide to wrap all the CRDs in your custom CRDs? What are the properties that they are giving you? When we started to design the system, we had two choices. We could either use something that was already available within our organization, or we could roll our own. The idea at first was to just use operators like you suggest and then just apply the resource definition on the operator level, but due to this risk and compliance, teams tend to restrict the openness of a system, right, to ensure that we as a bank can stay in control of what happens with that system. And then, yeah, it's a trade-off, right? Like, okay, are you gonna provision your own Kubernetes instance? Do you have the engineers to do the second day ops for that instance? Do you have the engineers to do risk and compliance? Do the pen tests each regular interval? Do the evidencing of all the risk? So it's basically, yeah, what I mentioned earlier is this trade-off between using something that's already within the bank or rolling your own solution. Cool, we are out of time, but yeah, the speakers will be around.