 So, hello everyone, thanks for coming to my talk. Little bit of an agenda, so I've made a few chapters so you have a little bit of an idea of what I will be talking about and it helps me and everyone else keep track of where we are in the story. So, little bit about me, my name is John, which is somewhat obvious because it's in the slides, and I've been doing various things in the hardware and software world. Been around for a little bit and at some point you go from hardware and operate systems to software. You go from software to, well, a little bit of service-oriented thinking and architecture and before you know it, you're doing microservices and microservice architecture. So, Wacom is probably not very well known. I should have probably put the company name in the first slide as well, but what do you do? This is a Dutch company who only operates in the Netherlands, so that's why many of you will probably not have any idea who we are. We're a digital department store. I've stolen this fancy slide from our corporate deck to give you at least some idea of what we're doing. What isn't in the slide is that we have our own relatively large vertical integration for such a relatively small company. So we have our own logistic centers and we have our own software engineering teams or our own infrastructure teams and that is partially because that has grown out of historical habits. So we've been around for a while, which means that we also have done migrations a lot. This is what the public usually sees, like a bunch of catalogs, because, well, when you're in the 50s, there are no computers, so how do you tell people what you're selling? So it starts out with a bunch of non-technical things, which is just paper and Rolodexes and typewriters. And at some point your business grows and you need these things, which means that you are now forever bound to legacy systems because that's hot back in the day. Fast forward a few years and it's no longer the new cool thing. Now, fast forward a bunch of more decades and we are in the here and now, which means for us, it's AWS mostly, a little bit of on-prem, which is actually also self-hosted in our own data centers, which if you think about it is also kind of weird, but it has been very helpful for us and very beneficial over time. So, well, that's why we're still doing it. So today it'll be mostly about the migration from measles to Kubernetes. So we have had this platform for about 10 years, maybe nine years, and it's getting a bit old and long in the tooth. So we have to maintain this ourselves. The Apache software foundation said at some point, well, we're going to move measles and marathon into the attic, which is not great, which means no more security updates, no more feature updates. And to be honest, there haven't been any real feature updates for, let's say, five years. So it's been problematic. There's also a lot of custom work in the old system, because back when we created it, there wasn't anyone to give us all the systems that we needed, so we had to build them ourselves. All the people that made that stuff, they no longer work at Wacom, so now we're double screwed. It's no longer supported, and we don't know who made it and how it's made. So we wanted to use something new, something that is well supported, and something that has the features that we want without having to develop them ourselves. So how do we do that? Well, it's somewhat obvious, we use Argo and we use Kubernetes. So to get from this old situation into the new situation, we had to go and figure out what we can do and what we must do. So on one hand, we want to do all these migration things. On the other hand, we only have three people to do it with. And even if you're not a very big company, you're very quickly end up between the three and 400 microservices with all sorts of development teams, who rely on you to deliver them a platform that works, keeps working. So we went ahead and we figured out we need to do some sort of requirements gathering. Now, mind you, the people in our team, they have been with the company almost as long as the furniture. So they tend to think more in lines of, well, we go to this vendor and we buy their box and we put it in a rack and that solves your problem. That's not how it works. So when we first needed to gather the requirements, I figured that it would be best if we just started talking to people, which can be very scary. Now, when you talk to people, at least in our company, and that's both product teams and it's developers and well, everyone who wants to have a say in things, we managed to put up four sort of golden rules because everyone had their own variation or what they needed and wanted, but a list of thousand variations of the same theme is not very helpful, so we distilled them into four things. So, first thing, when a developer builds their code, they deliver a container image to us and we have to make sure that when that happens that we eat that container image and we put it in the platform. And the second thing is that when that happens, the developer expects that their image or their service is available at a particular URL and that needs to keep working as well. So those two things, they are in the old platform, they need to be in the new platform, they need to be there at the same time and also during the migration. So, yeah, it's old system, new system, but everything has to be the same, which is somewhat awkward. Then there are two other rules that we made, like don't make anything worse, which is more like the inverse where what we actually mean is don't introduce scope creep because it's very easy to say, well, all this old stuff is all bad, so we're going to destroy all the old stuff and make everything new, but when you do that, you are going to have a three-year project and when it's done, it's already too old to be used. So, couldn't do that. And it also means that if we have something bad in the old system, that it's okay to put it in the new system as long as it's not more bad or more badder. So, the last one, equally important, don't make everyone rewrite their code. So the old system, it has this very interesting traffic pattern, which isn't necessarily that new, but back in 2012, 2014, it was very new. It means that the microservice that a developer might want to release or use, it expects that the traffic is handled for them and there are all sorts of steps. At some point, the traffic ends at their service and they don't need to do any thinking. They get this nice HTTP header and it says, well, it was customer DX who wanted to do thing Y, which means that this all needs to keep working. Now, that is not necessarily the only thing that we need to do traffic. If you are interested in traffic, come to my Istio content after this. There are a couple of self-serve features that we also needed to support. So there's marathon, which I mentioned before, which is essentially the GUI-like version of your task information and task scheduling information. It's not very GitOps because, well, if you click around in the GUI, you can change things and they persist. Unless, of course, your cluster goes down and then other settings are lost. So we needed something like this, but not exactly like this. And we had all these other things that we needed to make sure that are available and are working. So what we did is we, of course, made an iceberg because everyone likes pictures of icebergs. But what we actually also did is gave the picture to the business teams because they're group of pictures and we in our own team made this list of things that we wanted because developers want things, but we also want things. I mean, we have a very small budget. We have a very small team who is working on this. I mean, the entire platform team is about seven people and we could only spend like half of them on the new thing and half of them on the old thing. So yeah, not a lot of manpower, which means that inside of this list, we want to remove all the things that are problematic if you don't have enough resources to do everything by hand, which also means that you have to automate as much as possible. Now, with these rules in mind, we were thinking this is enough of a recipe. I mean, it's not like a super detailed technical overview or technical plan of what you're going to do, but it's enough. So we went ahead and we built a platform out of it. Now, how did we do that? Sorry, how did we do that? We first looked at what was definitely bad at the old one. So the old one was one repository to rule them all, which configured the entire platform. All of the service dependencies, everything, because back in the day, Mono repositories were a very cool thing because Google did it and therefore we must do it too. That turned out to be a very bad plan. So what we did is you split it up into three elements. So there's the stuff that's the same for everything. For example, everything needs a network. So we're not going to treat the network as something special, but we're also not going to pull it in with the stuff that changes frequently like applications or stuff that's not super special, but special enough like you need a Kubernetes cluster. So how do we then something is buzzing? Something is buzzing. All right, so what's the same for everything? Well, that's one of the boxes in the sandwich that I had in the slide just before this. The things that are the same for everything, they are the resources or the resource types that we are going to have the most of. So we're thinking, well, what if we can use the knowledge that we already have or the things that already have that are not super bad so that excludes measles and marathon, and we can just reuse them. Well, we had a couple of things that are probably well known to at least some of you. However, maybe the green icon isn't, that's the spreadsheet because although lots of things are automated, we have an on premise network team and they want to be the king of all the network subnets and how do they do it in a spreadsheet? So, well, we can't upgrade the spreadsheet to some fancy IP address management solution. So we're keeping that. But all the other things in here, they allow us to automate everything from AWS account creation to network management. We can do releases, we can roll forward backwards, the little island at the bottom, that's Atlantis, which allows you to do Terraform but in an automated and collaborative way. So that's very nice, but that leaves us with the runtime facilities, which are the things that you need to have your service actually work. So we had a look at what we already know, already have, so that's Jenkins because at some point you can change in your life and just sticks on you like tape. And we also needed something to schedule our containers, which was not supposed to be musils and also not supposed to be marathon. So we looked at Terraform, which is maybe HashiCorp and they were like, well, we have Nomad, look at Nomad. And we're looking at Nomad and we're thinking, well, Nomad, it looks really cool, but it needs all this custom work around it to make it work. So essentially we're just exchanging one problem for another, we didn't want that. And then we looked at Fargate, which is what AWS pushes as their super deluxe container solution. And at least when we started about a year ago, it wasn't featureful or mature enough to do what we needed with our self-serve and our traffic management. Yeah, so that wasn't really an option. Of course, during this period where we were looking for products or systems, we ended up, well, the obvious choice would be Kubernetes, but making the obvious choice without having any good reasoning about it, it's not necessarily the best way to make your choices. But we ended up with that anyway. And we also did the somewhat next obvious step, like, well, we have these things, we don't have a delivery system yet, so container scheduling, that's all nice. But if you look at Kubernetes, you are generally looking at a command line or at a visualization that happens to be made by AWS, which in both cases, isn't necessarily what you want to present to your developers. You want to have something nice and easy to use. So what do you do? You go to the landscape, which is not necessarily nice or easy to use. However, it is a whole lot better than all the other options that are out there to go through a list of features or products that are available. So we just focus on these areas, because, well, as the label says, it's about orchestration and stuff, and that's what we wanted to do. So looking at orchestration, you have a bunch of projects. So within our team, we're like, well, we don't know any of these products per se, but what we can do is mess around with it a bit, read the docs, run it locally, see how it works. And some of them have public dashboards or demo dashboards, and you can use those to essentially, I thought it was my pay-to-duty, but all right. You never know. The production's down. Well, actually, I'm in third line as a read-duty today, so great, excellent timing. But we mess around the features a bit and also very important, look at the documentation if it's any good, because if you have something that is very cool and does all the things for you, but you have no idea how to use it, it's still problematic. Now, Argo takes all the boxes. I mean, I could make a very big story about how we tried flux and all other things, but it's Argo, of course, we're using Argo. And we're using it in combination with Helm and GitHub. We don't allow Coup CTAIL. I think lots of people say Coup Cuddle. In the Netherlands, we don't, I don't know why. But no access for developers. And that's two-fold. On one hand, we don't have the manpower to manually fix everything or help everyone with very detailed problems. On the other hand, we also want to make an experience that is so automated, so useful, that you do not need to manually intervene. So it makes it a trade-off between automation and self-serve. But Argo solves all of that, so super great. One of the key components there is the application set. So I know a couple of people are using, a couple, probably a whole bunch, are using Argo without application sets. But application sets were very important for us, a little bit more on that later. And then there is the layer cake in the middle where we say, well, we want to roll all of this out. And you need a whole bunch of components to make it work, to make your clusters do what you needed to do, and you don't need to do it by hand. Because when you do it by hand, takes a lot of time, something goes wrong, mitigating any problems, also takes a very large amount of time and work. So we automate all of that, which means cluster creation, Argo installation, all your resources, all your CODs, everything, needs to be automated. You need to push one button and everything needs to appear. And you don't need to do it just one time, you need to do it like a 50 times. So how do we do that? Well, that's where Argo helps a lot. So in our structure, we have a global namespace and there's an environment called control. And the only thing that that one does is spawn more controls. And what do the other controls do? Well, they manage runtime environments. So when you have a runtime environment, that's where you're actually workload, your microservice are actually hosted. And those are essentially controlled by namespaces or little buckets where a small Argo CD has its own runtimes and the other small Argo CD also has its own runtimes. If one goes down, the other one stays up. And this is not necessarily the best pattern for Argo, but it works very well for us. So what else did we put in here? So we had this idea of AWS namespaces, which means that we have a group of accounts and that together they are responsible for a very specific set of features or a business domain. And in our case, that is somewhat of a holdover of the old measles infrastructure because it tends to break a lot. So if you are experienced with lots of breakage, you are going to partition your stuff a bit, make sure that you have different field of domain. If one thing goes down, it doesn't mean everything else also goes down. Now, that's not the only thing we put in there. We also have your prod environments, as you do, but we don't have one dev in one prod. We have an unlimited amount of devs and prods, which means that when you as a developer want to deploy, you don't need to, well let's say, it used to be that you have to pick and choose your cluster. So you say, oh, I want to run on cluster one in production and then I want to run on cluster two in development, which is not very developer friendly because you have to give your developers a very long list of all the possible environments and then hope that they find the right environment to deploy to. We didn't want to do that. So we added a bunch of labels and we added a bunch of GitHub repositories and we said, well, if you are at least sure in which of the namespaces you want to deploy, then we'll handle the rollout to the correct cluster. And that is very easy to put our CD. So we have a bunch of clusters. They all have labels and parameters. So this is a short list of some of the cluster that we have. It used to be that our technical products were directly linked to say a domain name or a product that we were selling. That's not very smart idea because your website might change, you might buy on, you might sell one. You might have problems that come in, problems that products that go away and you would have to change your system names all the time. So we didn't want to do that. I was of the opinion that we should use random hexadecimal numbers. Everyone else hated that. So we were using Greek names of the Greek God Family Tree which is somewhat of a classic example. But what you're getting is an environment which is also an AWS account and is also a Kubernetes cluster as a couple of tags. So you know which environment it is, which namespace it is and also which system namespace it is, which is very important because we have Atlas which is how we named our internal platform. And well, if you think about it, when we serve a development platform to a bunch of developers, then they are developing on it but we have to make sure that it works. So for them it's development, for us it's production. So we need to make sure that it always works which means that we want to develop something. We have to essentially make a mirror environment which is why if you mirror the word Atlas you get SELTA and well, that's how we did that. So on one hand it means that you get a whole bunch of accounts and a whole bunch of environments. But since we automatically generate them and they are pretty cheap to make, well, it solves the development problem for us and the automation problem. We can destroy or something, rebuild something. Nobody is going to have any problems with that because well, it's automated and you won't even notice it. So great. Now, from the developers perspective, you have a microservice and you have a bunch of components and these components, they need to be supplied by us so they can consume them. Even if the arrows say something else, just let's think about it that way. So what it means is that when you release your microservice you expect that all these facilities are there. So what we did is we just went back to the landscape, of course, because that's what you do. And you pick out all the components that you need. Now, some of them we are already familiar with. So you might have tunnels and Prometheus. It's very good for your metrics and you might have flew a bit for your logs and for your traffic, you need some external DNS into Mistio so you can get your traffic in and out. And that works great, which means that the only key component that is missing is how to make a developer or let the developer consume or use this. Well, that's where this comes in. So you have your application set and this is a very short, very invalid bit of Pseudo YAML. But we essentially combined the two generators. So you have your major generator, you have your list generator and you have your cluster generator, which means that if you serve this resource, which is an application set, then Argo is going to create all sorts of applications based on that set that match all of the possible combinations that we have here. So let's say we have two development clusters and two production clusters. If we say for every dev and prod variation of the cluster that we know, we would like to set the replica count to one or two, depending on which environment you're in. That means that with this little bit of code, you get four deployments that are all configured correctly. Now, of course, you don't have one value. We have like 20 values. So the file is a lot bigger, but this means that instead that was the old way was you go to one repository for the service configuration, another repository for the environment configuration, yet another one for your scaling configuration. And then you have your Docker file, which contains a very long label that you also need to add it. So it used to be, if you are a developer and you want to change something, there are like five different places you have to go to. Not very great. This, very great. I mean, not everyone loves YAML, but it works. So very cool. So, what else was remaining? Well, there are of course things that we don't have in Argo yet, because the way we use Argo, we actually tried Crossplane, but it ended up slowing us down a little bit. But what we really wanted to do is also make sure that if you have an application set for the application, that you can deploy your database with your application in the same application set. Now, the application set is based on a help chart that we maintain. So in theory, we could just add additional resources in the chart. And as soon as the developer updates their chart release that they're using, they will get access to these additional resources. For now, some of them are, well, not necessarily automated. So for example, we have a couple of drone jobs. They used to be in, I think it was Kronos. Yeah, Kronos. There's a project also for the missiles marathon era that you can use to schedule tasks. Very custom. Nobody knew how it worked. But if you stopped it, the company stopped working. So what you do. But some of the jobs we could translate into Kron jobs. And Kron jobs, they are also visible in Argo, which is great. Which means that although we have them, they're not in our normal system. So it takes a little bit of extra work. But because you can actually see that there's a job format and it spawns jobs and you can see the logs. And you suddenly have this visibility. Now, we are very close to time. But there are a couple of bonus options because, well, what do we do with the remaining time? Well, we can do one of these things, but not all of them. So, yeah, let me know what you want to do. So I'm going to ask questions, number two, no other picks, choose options, stuff. Questions? Raise your hand. Hi, thank you first of all. I was interested why cross planes load you down in your development process. Well, currently when you use cross plane, you install cross plane and you install a provider of choice. Now the provider will install a bunch of CRDs. In our case, 300. And those 300 CRDs are also, because we use applications as for everything, for all system facilities as well. So when we install cross plane, then Argo has to sync 300 CRDs. And every time you sync something, you're actually doing two things. So you're not only installing the CRD, you're installing the intent of installing a CRD and then it also creates the CRD itself. So you have essentially two items that Argo is constantly going to try to reconcile. And if you do that with 600 resources, it's pretty slow. But now if we deploy an entire new AWS account cluster, Argo see everything takes about 20, 25 minutes. So that's very important for us to keep. I think I'm doing something with the microphones. Yes, no. Anyway, the way we used cross plane was for bucket provisioning and RDS instances. It worked, but because it was so slow, we couldn't reliably cycle resources in and out. So that's why we had to stop using it for a while. Right until someone patches in the ability to pick and choose how many of the CRDs that are within a provider you want to use. So if you, for example, only use buckets and databases, you also pull in, I think, security groups and policies, but that's about it. So if you have five CRDs to pull in, much better. Yes. And what was your solution to cross plane to the problem, to any other? Well, we were already using Terraform and it used to be one big bucket where everyone did everything, but we split that up, which means that in the old days, if you need to provision 300 databases and you want to change one, that means Terraform is going to check all 300 databases, which takes a very long time because they have lots of separate resources. But it's automated using Atlantis, so that was kind of okay, but at some point everyone started complaining that it took too long. But by splitting the environments that are specific to only application details, it means that every developer can essentially use Terraform to provision that resource. And then this is something that's very dirty, I don't recommend doing this at home. You take all of the data that Terraform outputs and you put it in your cluster secret and you get a very big cluster secret, but it means that you can refer all the information that's inside the cluster secret in your application set. Because we want to share information between Terraform and Argo and there currently is no better way, as far as I know, to read external information into your helm values during the rendering of an application set, so yeah. Thank you very much. More questions? Thank you for your talk. It was very interesting. It's very similar to what we're doing, but I have a question to you. Isn't Argo CD is a bit of an overkill to do a dev environment? Does it mean that you push to must or your default branch first and then deploy to a dev cluster? Yes, so in a way it does mean that. We actually wanted to make, I wanted to make separate repositories. I mean, always take the blame, but people were worried that we would end up with 10 dev repositories and 10 prop repositories and that every developer then had to go to 20 repositories to manage their files instead of five. So while it would technically be better, especially you could use branch-based deployments or you could use separate repositories, we also wanted to make sure that a developer wouldn't have to make their life worse. It needed to be better. So what we're planning to do is I think there's a plugin called the Argo CD Lovely Plugin, which allows you to use both customize and Helm charts at the same time. And that would allow us to have one repository where you do your deployment, but you could use customize overlays to then specifically only modify a dev environment or only a prod environment. So that is one way we're looking at it. But yeah, it's still in development. We are running production. We've been running production for a couple of months. We had that zero alerts and all the platform has been down twice, so we're better. And yeah, that's good. Thank you. Yes, so we don't really have time, but there's one very cool thing. If you use the cube from stack, it comes pre-configured with all sorts of very cool alerts and permittious dashboards and stuff, which might not be useful for everyone by default, but there's one very cool thing that I would never thought of and that was there's a dead man switch built in. So you have an alert that you can continually fire and you send it to, for example, PagerDuty. And then PagerDuty, if it stops receiving the signal, that's where you get the alert. So that's how you know that your stack is actually working. It's very cool. Excellent, let's give them a round of applause. Thank you.