 Welkom, iedereen. Mijn naam is Mark de Jong. Ik ben een platformarchitect van ING. We werken daar voor ongeveer 14 jaar. In alle verschillende infrastructuur-posities. Ik ben Rotterboer. Ik ben een elite-engineer voor de support-team. Ik heb in ING gedeelte 5 jaar geleden, dus dat is bijna geleden geleden veranderd met Mark. We hebben hier een beetje te vertellen over de kwaliteit. De kwaliteit van onze vervolgens in ING. Maar eerst een kort introductie hier. In ING hebben we onze eigen private cloud. Om de cloud-native workloads in onze environment te ontdekken, hebben we eigenlijk onze eigen container-hosting environment. Dat is de ING container-hosting-platform. In short, het is ICHP. ICHP is een managed Kubernetes-platform based on OpenShift, en we proberen naam-spaces als een service. Dit is de manier waarvan onze vervolgens niet te zorgen over de ongelijingsinfrastructuur en dat ze geen Kubernetes-kwaliteit nodig hebben. Om de naam-spaces als een service te proberen, hebben we ongeveer op de multitenensie-capabiliteiten van OpenShift, in dit geval. En we gaan een beetje naar de extreemers. We proberen ons, vooral, om te zorgen dat er geen noisy-neber effecten zijn aan iedereen. En we hebben ook veel netwoordpolacies gezet om de connectiviteit tussen verschillende naam-spaces te disabelen. Dus iets naam-spaces is het eigen triggerer-t-boundary in dit model. Voor allemaal van dit, om het te implementeren, hebben we de eigen controleren en onze eigen auto-scalers ook voor het quota-environment. En ze worden eigenlijk partially open-sourced volgende wedstrijd op Kupon. Dus iedereen moet kunnen maken gebruik van dat in de near-future. As ING is a... ...failing on me here. ING is in a regulated industry, as we're a bank, we have a large and strict risk and compliance framework. To take away the burden of all these controls and everything else, we made available this platform for everybody to use. We reduced the cognitive load of our developers so they don't need to worry about running their own Kubernetes cluster. And to say we have quite a bit of success in the area, we actually started this platform in 2018. We went live in 2019 after a bit. And we actually stared our story a while back already at Open Smith Commons a few years ago. And we actually had an update last year at Open Chef Commons in Detroit. Currently, to say something about the size, we're running over 2,500 namespaces with way, way more than 25,000 deployments and over 45,000 pods. They're all spread across about four clusters, I think. And the cluster size is very between 28 to 200 nodes. And all of it is running on bare metal. A really important requirement to be able to run on ICHP is that you need to be immutable and stateless. We don't support any state in our clusters. This is really important to keep in mind, especially when the next slides come up and Rob's going to explain a bit more on that. But as ING is a bank, we're a large financial institute with locations all over the world. And we have a lot of developers and many, many, many business domains. We need to be able to make sure that our environment is available. And as we're regulated in the industry, there's a lot of compliance and risk that we need to take care of. And for our consumers, like I previously said already, we need to make sure that we're available. So everybody needs to be able to use his mobile app or the website to transfer money or pay something. Last year, however, for example, during a normal workday, at least many of the Dutch guys here will probably know, ING always pops up in the news. In this case, the mobile app was down. And the ICHP team immediately dove into the clusters to check, hey, is it our cluster? Is the cluster failing or whatever? But the weird thing was that the cluster was fine. Everything was working. And actually, the uptime was like 100% for the entire year already. So there was no failures in the cluster. So we had to dive in to what was the actual cost. And I can't tell you the actual cost. But during our investigation, we came across a lot of interesting findings in all the deployments that were running and how people were actually using our platform and those findings and what we did with them. I hope it's going to elaborate on a bit. So when ING is down, well, for whatever reason, we can't tell you. It will be in the news. People will be talking about it. We had some incidents last year en de question was raised from upper management like, can we improve this quality because it seemed to be the quality of our consumers running it because we were basically not down. It was just working. So could this incident be prevented in the future? That was the big question. Are all our users, all the teams within ING using all the possibilities of Kubernetes to the maximum and build resilient containers? Are we doing the right thing? We had no ID as a team. So keep in mind, it's a big blind spot for us. It's not only for us, but also for all the technical teams and the managers within ING. So we decided to set a baseline on what all deployment should look like. And that's a bit of difficult because how about exceptions, not every deployment is the same thing. We needed to have some basic checks, but keep in mind we're talking about 25,000 deployments. We started on basically, how do you scan that? What we found out is bad practices on non-production were just copied to production because it was already there, that's easy. You build something in development and you just copy it. I see you're laughing. That's basic practice for the most of us. We decided to scan everything from development all the way up to production. If development is good, your production will also be at least better. To get good insights on the whole topic, we decided to scan everything. So that's 2,500 namespaces, 25,000 deployments and over 45,000 pots and images. That's a lot of information. We got all the information, we processed it and we made it available in a bit more readable format because the total output was like a couple of gigabytes. What we found was a bit of a surprise. We know that the developers at ING are quite capable, so it's not an issue, but there were a lot of gaps towards a containerized platform. They were not using all the options available within Kubernetes en they were not even building resilient containers. They just forgot stuff. They were used to the old-fashioned virtual machines and they just copied their code. But copying your code and deploying it on a virtual machine differs from something on a containerized platform. So some teams didn't even think about it and they tried to solve issues themselves, which is wrong because all the options are there. We had to make everybody aware of it. We created a dashboard and in this dashboard, like I said, it's still 2,500 namespaces and 25,000 deployments and if we take 600, 700 teams, there are 700 development teams in there. To create a simple dashboard we had only information required which we used in the baseline. We put simple stuff on it. For example, within ING, you should never ever touch your namespace. If you access it without a pipeline, you will get a bad score. That's a bad thing. We call it tainted. As soon as you access your namespace, we call it tainted. We started a scoring system. The higher the score, the worst quality you provide. We put a really high score on that. That should shake people up, not to access their containers and use proper pipelines. We also looked at simple things like host and zone affinity. That's an easy fix. It's just three, four lines in your deployment. That's easy. We're not using it. We don't know why, but how can you make them aware? We don't want the tainted stuff because we don't want the scores to be too high. Other simple things, like health and readiness probes, you need to think about them. They started thinking about them which basically solved a lot of issues. If you have a good readiness probe, your application, if it fails, you run replicas. We put scores if you're not running replicas. It's quite simple. We just added more scores and more scores. You can see it's really simple, the output. That's it for now. Mark. As Roach mentioned before, we did the investigation. We started to give insights based on the dashboard you see here. In a previous format, it was just an HTML table. It wasn't nice. We did the investigation. We started the insights to your consumers. We had started to explain to our developers what they would need to fix. Like Rob said, some easy fixes for anti-affinity, replica states, life readiness, readiness probes. All of these were scored. For each of these parameters, we wrote down the effect of setting that parameter. What it would mean to your deployment. Next to that, we gave example code on how it should look like in your deployment. So they could just copy paste more of the generic settings, for example. And specifically earlier, we had a panel here discussion on platforms. Within ING, we use a lot of platforms. We build a lot of platform teams. And we, as a platform, provide capabilities there as well. One of the bigger ones we have, we have a generic CICD capability, which is called one pipeline. And this team provides Git, pipeline, artifact repositories and all kinds of other stuff. And we also provide compliance for your deployments. So this is also the only way into our platform. But then again, one pipeline is also a platform, so other teams can build on top of that. And one large team within ING actually builds a generic built and deployed pipeline, which includes everything for your risk control of an ING-specific implementation of a service mesh. And we work together with that team to actually implement in the best practices. So everybody that did their next deploy automatically had all kinds of fixes implemented. So when that was implemented, I think about 60, 70% of all workloads were automatically fixed on a lot of the settings because everybody just uses that template for their deployments, which is a really, really easy win for us. So that really helps to have all kinds of platform teams in that context. Making things better is always a fun thing, but also a challenge. The previous thing was a wake-up call for most of the teams. They had low scores and nobody wants to have low scores. So the teams started even competing each other not to have the highest scores within the departments. So it was fun for us to see some teams struggling, but also the questions raised towards us. Sometimes crazy. The users were informed of all the features being present now within Kubernetes. And only the easy ones to fix. So there were still some challenges to it. So the first goal was accomplished. We created awareness within the teams within ING. But we also got a lot of questions from teams like some of your rules or your scoring system does not apply to us. It's not applicable to us. We had to look at those and sometimes we had to create exceptions for those for example a monitoring tool or whatever. But those are all generic. You provide to a service within ING. So some team was providing that and we could easily rule them out. We had less false positives in the report. In the next version of the dashboard we also put in a scoring system. We now can easily add a scoring system in the database and it will automatically be added. So it will be real time. The new dashboard will be using database at the back end and all the information is stored in the back end. Just by pressing one button on the system we can add a new score to it and the scores will be automatically updated real time. The most important thing to us is the quality is really improving. We made people aware of it and I guess that was our main goal. It was our goal and I think we did it quite well within just a couple of months. An important thing I forgot to mention actually because it's best practices and most of them are generic. We actually enforce them in policies code as well. So if you deploy into the platform and you don't have it anymore you get audited at first and then at the end you also get blocked because you just can't deploy anymore. So that's always a fun thing to do as well to make sure that you're securing your own platform. Simple things like an image is olden in 120 days. We put a score to it. Failing pods they were just forgotten by teams they had pods that were failing Never looked at it again. They need to look at them again because they got a score on it and you won't have as least score as possible. So that's a bit of the background which we did. We did a lot of insights into our environment and to see the maturity of our developing teams and we were happy to help out there to bring everybody to our level. With that we end up with questions about this one. Over there. We don't provide any storage. In our design we have two distinct use cases. The use case we are talking about here is completely immutable and stateless. So we don't provide any persistency in any workloads. So you can already run. Ok, it's working. The application was simply lifted to the container and they usually like to write a lot of logs detailed and so on. That's definitely another one. We started out sending all our logs out to the generic environment which we stopped because of reasons. We filled up the entire log environment after three days already. We stopped forcing everybody to push logs towards standard out. So everybody logs towards their own environment but specifically for the ephemeral or the empty-deer or other types of ephemeral storage we limited it to like a gig. Five gigs. So nobody can use more than five gigs to do anything locally and then it being gone. Five gigs is a lot for a typical application. We run fully on bare metal en we have about 16 terabytes of storage for each node so a limited thing to five gigs per pod and then going up to 160 pods on a node you fit into the amount of storage we have locally. We saved at least our platform from that site but indeed it's not a good use case to log locally in a pod. Of course. Because why I'm asking this is that a lot of my customers like to run minimum disk size for the node because they think nothing is there at all so why even bother to put something bigger there. We catered for that in the design to count the amount of pods we want on a node limited to a specific amount of pod and then we make sure that doesn't break the node at least. Is this published somewhere? No, I don't think so. I think it's just in my head it's at least a good practice to have in mind when designing a cluster. Thank you. Any more questions? Last question. There's two hands over there. You can decide between the two of you. No, we're actually thinking about publishing some of the parts that we did. We are publishing some other stuff on Thursday but this might be an extension to that code being extended there. So it might be in the future. Ja? Ja? Ja. I was stating that the platform didn't send any logs. We enforce our consumers to log directly towards like a Kafka bus or something else. The consumer itself needs to directly publish their logs to something else not via the platform. That was it. Thank you very much.