 All right, so hello everyone. Thank you for coming to my talk. I have a little agenda. I will introduce myself and the company and then I will present our migration case, because how we got to Istio is kind of part of how we architectured our solution. And yeah, let's get right into it, because there are a bunch of slides and well, let's hope that we make it in time. So, a little bit about me. I've been at Wacom for quite a while, in tech quite a while. I've done pretty much everything from hardware to software engineering to architecture, mostly just as a growth path. At some point, I noticed that the DevOps stuff and the microservice architecture is the most interesting for me, so stuck with that. And then started doing that at Wacom. Wacom itself is probably not very well known outside of the Netherlands. That's because we are essentially a digital department store in the Netherlands. So, unless you're here, well, you probably haven't bought anything from us. This slide I stole from Corporate Deck just to give you an idea of what the size of the company is. So we're not that big, however, for the Netherlands, we're doing okay. Now, we also have something that's not in the slide, such as our own data centers embedded in our warehouse. We're also in the cloud. We do have actually our own warehouse. We have our own software engineering teams and our own platform teams. So that's, as far as I've heard, somewhat unusual for a company that's not super big. We have been around for a while, about 70 years, which means that we also have a bunch of legacy. Now, this is what you usually see as an outside customer looking in, a bunch of catalogs and then at some point a website. However, if we cut this down to two phases, you have to phase where there's no technology, like in the 50s, I mean, there's no Istio in the 50s. There's just paper and phones and Rolodexus. At some point your business grows and then you end up with these, which are punch cards, which go into mainframes. And at that point you are essentially forever stuck in upgrading from one legacy to another, because what's cool today is going to be legacy tomorrow. Now, our legacy that we are dealing with is going from missiles to Kubernetes. And on missiles we had this super-duper custom traffic solution, because missiles, well, back when it was created, there wasn't really any cool do everything traffic solution. And in Kubernetes there is, which is Istio. Now, how did we decide to go for Istio? That's somewhat interesting. If you look at the age and the problems that we have in the new platform, we didn't want to have those problems. So that means that the custom work for traffic, like server discovery and routing and balancing and figuring out who should talk to who. That is something that we didn't want to do custom anymore, because A, it's not necessary. And B, we have only three people working on this, including me. So right now there's two people working on it, which means that the more capable the system is, the better for us. So there's also a little bit of an extension, like a slight Kubernetes tentacle that goes into the warehouse. So some of the warehouse systems, as you know, they are operational technologies like PLCs and robotic packing machines. They are very latency sensitive, so you can't really run your robotics software in the cloud and then your robotics machine in the building, because, well, the cloud is in Ireland, so not great. So that means that our traffic also needs at some point, luckily for us not right now, but at some point it needs to understand how to get things that are in the cloud to talk to things that are not in the cloud. Now, where do we come from? Because that is a very, well, not necessarily impactful, very strong driver as to what features we definitely need. It's essentially a platform that allows a developer to do lots of things automatically, self-serve. And that also means that for us to enable that, we have to make sure that the facilities that do that exist. Now, those same facilities, they will be in the destination architecture. For traffic, those facilities look like this. So you have your browser, you go into Cloudflare, you then get into your load balancer at AWS, and that sends your traffic into OpenResty, which is mostly Nginx plus Lua. And that does a bunch of thinking about where does the URL actually point to, which microservice. And it does things like authentication. So that needs to be happening at what in Istio is called your ingress gateway. Now, for a developer, it doesn't necessarily matter what the implementation is. So we can use Istio, but as long as they just say, hey, I would like to receive some traffic, I would like to get something, as long as that keeps happening, it doesn't really matter what we use. So for us, that means that if someone says, well, Istio, that sounds very complicated. Maybe, but as long as you don't have any direct interaction with it, you shouldn't really have any problems with it. So what happens is that your typical developer workflow is you push some code to a repository, it builds an image, and the image gets deployed, and at some point you will have your URL where you can reach your microservice that is running based on that Docker image that you published. Now, what do developers do when they want to tell the system how they want to have their microservice behave? Well, they use a very long Docker label string, which if you break it up in a few lines, it looks very readable, but in reality, this is not very readable because it becomes one very big line that you have to do custom parsing for and then figure out how to convert these flags into actual settings for, say, NGINX or HA proxy. Now, if you also take into account what people have to do to reconfigure their service, it's in a Docker file, which means that the only way to update it is to build a new image and redeploy. This means if you want to make a configuration change, you now have version version 1.1.1.1.1.1 because while you need to have this slight configuration change, and the only way to get that into your platform is to reconfigure the entire service and redeploy it. So, not great. Now, what we want to do is to keep these features. So some of them are somewhat weak because they're not very secure, there's no MTLS in Mesos, but we want to at least give them the sense that the ease of deployment and the ease of receiving traffic remains the same, which means that this part, the part where your traffic gets in and somehow ends up at your microservice, that needs to be, well, nearly identical in Istio, which also means that the authentication part, which is somewhat tricky because it's entirely custom, also needs to be transplanted to Istio, because what's the point of using Istio if you're using Istio plus all your legacy stuff? We really want to get rid of it, we don't have the manpower to support it. So, that needs to be somehow maintained and kept as a feature. And then we also have this service discovery part. So, we're currently using HAProxy in conjunction with Consul, which kind of works, but it breaks more often than you would think. So that's not great. But luckily for us, that's a problem that is super-duper solved because Kubernetes does your service catalog automatically, you don't actually need to do anything for that, it just happens. And even if you weren't using Istio, I think HAProxy also does a decent job to at least get traffic to your container. But this is nice because by default, you get Istio, it works. Great. Now, this part, in your Dockerfile, there is an ID, which is used to generate internal URLs for your service, which then get low-balanced across all your instances. And one of the developers were like, well, it's going to be a pain in the ass because we changed to Istio, which means we also changed all of your URLs, which is partially true. But if you look at the pain of rewriting all your services because your traffic mesh is now super-duper different, that would be a much bigger pain. So we're like, well, this is a little bit of pain, we can deal with that. And if everyone agrees that that's something that we're willing to deal with, we can continue and migrate. So, from this source architecture, we did this whole analysis ahead of time. So we knew that after the fact, no one's going to complain like, oh, you switched to Istio and our service is broken. Oh, you knew this ahead of time. So informed people tend to be a lot happier with migration, which brings you to the last part of the legacy system. So in your Dockerfile, we have three different flavors of traffic that you can receive. So you have consumer traffic, which is your public website and your mobile app. You have third-party traffic, which is mostly business-to-business traffic. If we order a bunch of shoes, then the supplier might say, hey, I've got a bunch of pictures to send you, and we do it in an automated fashion. So we need an endpoint to upload the pictures to, which means third-party traffic. And we have trusted traffic, which is not actually trusted. But someone, I think 15 years ago, thought it was a great idea to call it trusted, which is essentially the opposite of zero trust, because it's implicitly trusted traffic, which I didn't agree with. But keeping it easy for everyone to use means we also have to keep using this label for trusted traffic. But what you do is you essentially have a Boolean, you set it to true or false, and your service will receive that traffic based on what you configured. Now, how do we do that in our destination architecture? Because we need to keep the same developer comfort, but we don't want to do it using legacy systems. So we take the ingredients that we have in our old system, and we somehow translate them into new ingredients. And I'm calling them ingredients because you might usually have a long track where you do research in what are the functional requirements and what are the technical requirements, and we're going to try all of these things, and we're going to fly in an army of consultants. But we don't have that money and we don't have the people. So we were just like, well, how can we make a list that at least looks like we can match it to something on the very familiar cloud-native landscape. Now, the first time I presented this to people inside the company, they were very scared because they were like, what the hell am I supposed to do with this? Which kind of makes sense because if you look at it and you don't know what it is, you're like, well, I can't do anything with this. But the nice thing about it is categorized. So you can just pick a small portion of the entire landscape and only use that portion. So, well, we already knew it has to be traffic. So we just took all the traffic stuff that's in there. We all grabbed a few and we just tried them out. So a lot of them, including Istio, have cool documentation pages, demo pages, applications you can try out immediately. You can do it locally. You don't even need to buy a cluster or rent a cluster somewhere. So you just use this never-ending cycle a couple of times until you're comfortable with one of them. And this doesn't take any additional manpower or additional money. It all takes its time. And that's something that we had. So, you know, use what you have. So at that point, we were comfortable with Istio. We had some idea of what to build, what to use. So we came up with this very large diagram. Now, don't worry about the diagram. I'm not going to talk now about what the diagram represents. We're just going to cut the top half off because that's essentially what's replicated in all the other halves. So in the middle there are three streams of traffic flavors. That's essentially why so many of the boxes are duplicated. And there's the egress gateway, which is there because I insisted that we have it, but everyone was very angry that the service would no longer be able to talk to the outside world. So it's there, but it's not enforced. But anyway, if we cut off the top part and we only look at the traffic path as we want to present it to developers, we have our ingress, which is an AWS ALB. Of course, cloud for air sheets and all of that, but it's not necessarily relevant. And we use this 2.2 and ingress. And why do we do that? Because the AWS load balancer controller does not support pointing to something that is a node port because it will create a network load balancer. And the downside of network load balancing is that you cannot use ACM, which is AWS service for certificates. So we really like that because it automates your renewal and does all the things for you, very cool. But it only works with application load balancers and those only work with ingresses. So that was a bit of a bummer because, well, AWS didn't notify us about this. So we were spending like a day trying to find out why we got the wrong load balancer all the time. But yeah, you sandwich an ingress in between and everything works great. Then we have your standard combination. And you will find this in the docs repeated many times. You've got your gateway definition and you've got your ingress gateway. So the ingress gateway and the... Essentially, this is the part where you have an envoy proxy that is configured to accept traffic from the outside world and it makes some preliminary decisions like, well, is this a domain name that we actually know about and do we want to pass this traffic to someone else? You use label selectors in Kubernetes to make those choices. So that's that part. Now, the only thing that's not standard is that it has to be a node port type and with an ingress, it only works with a node port type. So we have to chain three not super standard things together but at least the components themselves, they are standard. So that was a win for us. We only do some configuration, which is fine compared to 20,000 lines of custom code. So then we tacked on a proxy wasm module which is not necessarily advertised as being production ready but it worked great. Did I talk about that at rejects because even though it has some rough edges this allows us to migrate to Istio because what is one of the problems one of the slides a few minutes ago is our authentication mechanism. It's non-standard. It uses a web token which in theory can be handled by Istio but in practice our tokens are broken so Istio cannot handle them. So what do we need to do? Well, we need to make a filter and the filter then handles our broken token and then the traffic goes through the normal envoy path and onto your service. So that solves that part for us and then you need your virtual service because you might have per service preferences. For example, my service is only reachable on slash service slash basket if you have your basket service and you need to configure that somewhere. So we deploy with every service that a developer would want to deploy we deploy a virtual service definition and then using some flags in YAML which I'll show in a bit you can configure as a developer do I want this consumer traffic or not? Do I want this third party traffic or not? So we're not necessarily exposed to the inner workings but we're using all the components as they were designed to be used which for us is super great because this solves all of our problems without having to write all of this stuff ourselves. And then at the very end there's the service definition which is used to find all the pods so that's just a standard Kubernetes resource. So these things they're all deployed using GitOps, using Terraform, using Argus CD so this whole traffic flow is made up of standard components a little bit of custom configuration and we essentially transferred the entire old traffic pattern onto Kubernetes. It was very great. For a developer it means that you have this these very big letters which mean you can configure whatever you want using just Booleans except there's this little bit at the bottom so some of our services they are let's say not of the level of quality that you would like them to be so we want to prevent any unsafe request methods so for example a service that is supposed to only receive Git methods because we know someone left some Java actuator open on the back door and that would be bad so we want to filter those out so how do you do that? Well, we deploy an authorization policy with our pre-application as well which means that we can now decide if you are a developer and you are accepting in this case consumer traffic and your service is not super safe we will just deny every operation that's not yet and this is also immediately making use of the spiffy identity so the principle that's listed in this element that's actually the service account identity which is also the spiffy identity of the Ingress Gateway which means that as soon as a request comes in and it comes in using the consumer traffic flow we get to apply this authorization policy to it and we get to deny it if it's not a Git request which is also super great because this used to be again like 200 lines of custom Lua somewhere in a gateway that nobody maintains and this is a well supported resource that does exactly what you need which was very great for us now, if you are using this you might think well, how do you get to this stage? Well, if you look at the documentation and this graph isn't the highest pixel density but when you work with these authorization policies you might think well, how do they work? How do I make sure that I don't make a policy that accidentally blocks something else? Now there are some guidelines in documentation on the Istio website which says well, make sure that your ordering of rules if you want to make safe authorization policies are set to essentially fail safe so if you make a mistake in your policy it doesn't suddenly block everything but it fails essentially open so at least that's how I interpreted it but the documentation is very clear on suggestions but also if you ignore the suggestions it also works now they also have this very cool diagram which shows you when something is denied or allowed and for us it means that if we only specify a deny policy that means that if you go through the entire flow and it says is there a custom policy applied? No is there a deny policy applied? Well, if it comes through trusted then no, is there another policy applied? We also know so that means that we allow the request so instead of having 20 policies that says oh we deny this and we allow that we just define one, only the one that you need and then everything else keeps working so that's super great now we do have a little bit of time because there might be something that is interesting for you guys so we can do or Q&A thing like how do we package it, we can look at our Argus City instance so you can see how the topic is represented visually or we could look at the system and the workload repositories that we use to segregate our Istio configuration that applies to the entire cluster or the configuration that only applies to your service or we can look at the proxy wasm stuff because that's somewhat cutting edge but maybe someone's interested in that or if you want nothing that's also fine we can just sit around Sounds like we have a request for the Argus CD topic Yeah, alright, so this is always of course dangerous because when you do a demonstration it means that everything needs to work but since we do still have I think five minutes, that should be okay, so what I'm going to do is let's end this presentation let's see and we can then enable screen mirroring because living on the edge, living dangerously so we can mirror this alright, is that big enough? So, of course it's only exciting if you do it all very much in production, I actually have a picture that says we don't test on animals we test in production so let's live on the edge, let's go into the VPN and pick the Argus CD thing, so what we have is with multiple environments let's make this a little bit bigger and we essentially have two main projects in our Argo CD configuration and one project is responsible for the services that developers deploy and another project is responsible for the things that we as a platform team deploy so if we look at for example our Istio stuff so for example we have the base that is essentially the Helm chart that Istio provides for default installation that lives in a GitHub repository and Argus CD keeps that in sync which means if we want the version, we bump it in Git, everything gets updated and usually when you do a minor version bump, it's painless which is amazing so we have a bunch of stuff in here that just gives you all the resources that Istio needs which I think if you've used Istio you are somewhat familiar with, if you've not used Istio let's not go through all the details however on the service side that is especially for developers somewhat interesting so we have this service called the Echo service which we use to test things and you have this traffic overview which gives you essentially well not a real-time visualization so it's not like these moving errors represent how fast your traffic is flowing but it means that when you are a developer and you are wondering like hey, my service isn't working you can very quickly just click on the tab and you can see if you maybe made a loopsy in your configuration and you don't accept consumer traffic so that solves a whole bunch of problems because it used to be that you have to come to the platform team and you bug us on Slack and maybe we wouldn't respond so we come to our desk and then we wouldn't be at the desk because we were working at home and as a developer that would be a very sad situation however right now you just go here and you can see what's happening now some other thing we generate well-known names for every service so what happens is during your deployment the help chart generates domain names so when you are wondering what is your domain name well you just go into your desired manifest and it says it right here it gives you a bunch of hosts so you know exactly what you need to type into your browser to get to your service it also gives you the URLs that we support and there's a little catch in there so in the match line I think that's line 26 sorry I keep moving away from the microphone but line 26 that says well we match exactly on a URL with no slash and we also match on any URL that has a slash and something else which is zero characters or more and you have to do both because if you say I want to match on just the name and you then do a star then you can essentially impersonate another service because it doesn't have to be a slash that comes next however you also want to be able to reach a service if there's no slash so you need to specify both now this is automatically generated so a developer doesn't need to specify it we as a platform team did so that they don't run into the same problem that we did and you get your port, your destination so your Kubernetes service in there as well so as a developer even if you didn't write all of the Jamel yourself just to put two flags a true and a false in there this is the result which means that if you're just coding away you're just committing some stuff this gets handled for you and it also gets handled for us by Istio which is pretty cool so let's see do we have more time? probably not there's usually someone with a sign yeah three minutes so if there's any questions maybe it's time for one question otherwise that's it thank you for the talk I just wanted to ask how many developers are you? you said you're a platform team of two and you're kind of doing this abstraction layer to help the developers so I was just wondering like how big is your development team or how many developers you have? we have in total about 120 people in tech department and a hundred of them are developers we have a few support staff a couple of product owners developers are usually and that's something that works very well for us and probably also works very well for other companies our structure is essentially developers in the team they do the technical stuff but we also have product members in the team that are more on the business side but they are together in the same team and they own the entire from a business perspective and a product so that means that you don't have a department telling another department to build something you have a small group of people with a shared manager and a shared product owner and usually it's like two developers one product person maybe someone who is somewhere in between and they are responsible for maybe two or three services you divide a hundred developers over that and then you know how many teams we have so that's our format alright let's have a big round of applause for John