 So welcome, everyone, to Istio Day. This is very exciting. So my name is Arthur. This is Bourdon. We are both SREs at Pigment. For those of us who don't know you, we don't blame you. We're a young company. And Pigment is a French software company. And so we've been using Istio in production since the very beginning of Pigment, so four years ago. And this talk is not about explaining how Istio works, because we don't fully understand it, but more about how we use it at Pigment. So brief aside, I'm not going to pitch you Pigment, but what you need to know is that Pigment is a web application that stores user data. They can use our application to import their data, transform it, visualize it, and do that, you know, make business decisions based on that. So there are zero Pigment users in this room. Well, two, because we talk food a little bit. That's it. So the way this talk is going to go is we're going to start with why we actually set up a global request routing to achieve seamless data locality, and then how we did it, which is the interesting part. And then we'll conclude with some next steps that we can see in the near future. So why did we do this? Why do we care about data locality? Why do we care about where data is? And the fact is that we're all tech people here. We care about where our data is. And the fact is that most organizations are starting to care as well. There's legislation in place, like GDPR, CCPA, that actually enforces rules on data and how it is managed, depending on where it is geographically. And so companies we want to sell Pigment to, they are starting to care a lot about where their data is. And so we need to have a response to that. And so our response was, well, I mean, when we talked about with our users, potential users, we wanted to sell Pigment to, what we figured was there are three things that are important to them. The first one is that they want their data to be stored and processed in a specific region. We call this a location in our implementation. And basically, European users, they want their data to stay in the EU. American users, sometimes they're OK with the EU, but sometimes they want their data to be in the US, et cetera, et cetera. So they want to be able to choose. The second thing is that users want a single platform. Users don't want on our login page to have to choose EU, US, Canada, whatever. They don't want to do that. They just want to click Login and be logged in and for it to work. So it needs to be seamless somehow. The fact is, organizations care about where the data is, but users do not. Users just want to use the app. They don't care about what data center they're using. And finally, they care about their data. So we have to protect it. We are stewards of their data, and so security is important. But then on top of that, we had Pigment R&D. We added some requirements to that. So first of all, this multi-regional, this global infrastructure has to be reliable. But when I were SREs, reliability matters to us. And also, when we're extending this infrastructure, so we're going from a single region application to a multi-region application, we don't want this extension to actually require downtime. We don't want to break our application even temporarily to be like, OK, well, hold on. Pigment is down for two hours because we're migrating. We don't want to do that. Third, we want our infrastructure to be cloud agnostic. So today, we're running on Google Cloud, but we want to be able to extend elsewhere, mostly because we have some potential users who don't really like that they don't like to give us money that we then give to Google because they're also working in advertising or something like that. So we want to be able to move to or extend to other cloud providers later, which is why we use Kubernetes, which is why we use Istio. It's really nice to manage your own stack. And finally, we don't want to compromise our productivity as engineers because we're managing multiple regions. We don't want to deploy more slowly. We don't want to have more operational load. And these are the things that are important to us. And so I'm going to pass to Boudewan who's going to talk about how we did it. So at a very high level, Pigmont's infrastructure is a quite common pattern. So we're Austin on Kubernetes. We have a set of databases, a set of backend services, gateways managed by Istio. And that's pretty much it. And with that project, what we wanted to do is do the exact same infrastructure but in the US, separate set of services, separate set of databases. And so came the question of the user experience. So what we could have done and didn't want to do was to have each infrastructure behind a different host name. That's the quickest. But as Aatro mentioned, we wanted something similar for users, so we did not go that way. Instead, what we decided to do was to route users to the correct location at the gateway level, leveraging Istio. And so American users would use the same host name as the European users. And the gateway would redirect their request to the correct location. So pretty simple. And this setup work as is, so perfect. Small problem, however, small detail, is that for American users that want to access the US location, that would be an unneeded back and forth. So not a big issue with GODNS. They resolved to the closest location, and that's fine. And in all different scenarios, well, they will still get routed to the correct location. And so perfect, no problem here. So now comes the question, how do we do that with Istio? Exactly. So for that, we'll average virtual services. So our gateways will serve two host names, one public, used by users, and another one internal, used by gateways to route to each others. So what the gateway will do, it will first check the presence of the HTTP header or query parameter to determine if the request that it's processing is meant for the current location or another one. If it's meant for another location, then it will send that traffic to that gateway, to the gateway of the other location. And if the request is meant for this location, then other matching rules will be used, and it will be then delegated to another Istio virtual service that will handle the routing and all the details to route to the correct endpoint of our services. So overall, this looks great. It's simple. But there is a downside is that it's a lot of virtual services to maintain and update and keep tracks of. We have a lot of services, lots of environment, and this number is increasing every day. It's error prone and tedious, especially for developers. So to tackle that, we decided to generate the virtual services from a configuration file with listing the endpoints for each service. So we don't have to modify by hand hundreds of files anymore. It's much quicker and less error prone, and devs are much happier. So everything looks good, but is it enough? Not exactly. There are a few HKS we still have to address. The first one is the global user authentication with that pattern. Because really, since an articulated user can log in from anywhere, we have to be able to do this login and find a location from anywhere too. So to do that, we went for the simplest way. So we shared the database for authentication. It's hosted in Europe. It's a simple caching. There is a simple caching at the application layer for performances. But the user cannot log in from any location. The data of the company stays on different infrastructure, and it's just the only piece of data that is shared across location. And I will let Hatch you talk about the second HKS. Thanks. So now let's talk about requests from third parties, because this is where it gets exotic. So as Bono mentioned, if we have a request coming in from our front end, the front end has authenticated the user. And so it knows where that user's data is, and it can set a header that our virtual source is used to route correctly. The issue is with requests that don't come from us. So these are requests, for instance, import calls that come from our user systems, or WALF callbacks that come from whatever single sign-on provider they have, or webbooks. For instance, we have a Slack integration, so we get webbooks from Slack. And most of these third parties, they don't allow setting request headers. We can't configure them to say, well, for this rule, if it's this Slack channel, please set this header to production EU1, please. They're not going to do that. And also requiring this header would actually break API compatibility for our own users. Those that have already been using our import API, we don't want to send them an email being like, OK, you have one month to start adding this header, or your imports aren't going to break. We don't want to do that. So what we did is we used an envoy middleware. And so this is where Istio comes in, because none of what we've been talking about until now is very Istio-specific, and now we're getting into it. And so the middleware, basically, when a request comes in, the middleware runs. And its job is to determine the target location of that request. And it can do so by communicating with the back end to get information. And once it's determined the target location, it sets the header, and then our routing logic from before happens, and proxies to the correct gateway. So to inject a middleware in Istio, I use an envoy filter, custom resource. I'm not going to go into the details, but basically you tell it where in Envoy's filter chain, you want to inject the middleware, and then you just inline the code at the bottom. That's pretty much it. So what does this middleware look like? First things first, it's written in Lua, which is pretty high level language and surprisingly fast. We were really impressed by the performance here. And basically what our middleware, to write a middleware, what you need to do is you need to implement a handler. This is the envoy on request function. And what our handler does is determines if it actually needs to do anything, if it doesn't, it returns early. Then it communicates with the back end to determine the target location for the requests that it's processing. And then finally, once it has that target location, it sets the X pigment location header so that our virtual services can do their job. So to get the target location, our middleware doesn't really know. Like if it gets a request with an API key in it, it doesn't have access to our database of API keys. So how does it know the target location? The way we do it is we take all of the requests metadata and data. So headers, query string, body, and we send all that to the back end to a specific endpoint asking the back end, hey, where is this request supposed to go? And so we send this to a special endpoint in our back end that has global knowledge. So that knows about all API keys in pigment EU and pigment US that knows all about that. And the back end can look at these headers, parse them, look at the query string, parse the body, make database calls, and then return a target location. But if there are any people, sorry, is among you, you may have heard that I am sending the body to the back end, which means I'm buffering it. I'm reading the entire body from the caller, keeping it in memory, sending it to the back end, and then once I have a response from the back end, and I can actually forward that body to the correct location. And so I'm keeping this body in memory. And so if an attacker, if somebody sends me a body that's one terabyte big, my gateway is not going to like that. And so we actually need to put a cap on the size of the body we're willing to read. And this is where Envoy is really great, because you can just read the content length header to determine the size of the body. And if the caller lied about the size of the body in the header, Envoy will just kill the connection the instant the caller just sends too many bytes. This is amazing. So this is actually really easy to do, but you've got to think about it. And if the body's too big, we just don't send it to the back end, which is enough for our use case, because the only time we need the body is for Slack callbacks, and the body is like 200 bytes. So how do you test something like this? The fact is this middleware integrates heavily with Envoy. It integrates heavily with our virtual services logic. It integrates heavily with our back end. And so actually unit testing is pretty hard. So what we opted for was black box end to end testing. So what we have is a few agents, different places in the world, that send at regular intervals, they send requests to our system, these requests, to a specific endpoint in our back end, which is that end points code is basically just respond with the back end's location. And that way we can send a battery of requests and check, OK, where are these requests landing? Are they landing in Europe? Are they landing in the US? And we can make sure that they're landing in the correct place. So for instance, in this example, the top API key is a belongs to an EU organization. And so it should be routed to the EU. And the bottom API key belongs to US organization. So it should be routed to the US. And so we can continuously run this. And the idea is that if the middleware ever breaks in production or before production, we get an alert. And so we know. So basically, if we want to change the middleware, we deploy it to staging. We wait for 10 minutes. And if we didn't get a message on Slack, we're good. And so at this point, we have the basic logic with the virtual services. We've handled the edge cases of authentication and third-party requests. And so we've met our requirements at this point. Our user data is stored and processed in specific locations. We have separate databases. We have separate back end instances. And we have a single platform. There's single authentication. They just go to pigment.app and it works. And there's seamless routing. Our EU users, they didn't even know when we made the US platform go live. They just didn't need to know. It just worked. Some of them actually saw decreased latency. They were like, wait, things are faster now. And regarding security, well, we encrypt the data in transit when it's in transit between gateways. And that was important to us. But we also have regular pen tests and a bug bounty program. So now we have 200 people who know how it works. If you find a bug, please call us. There's money on the line if you want it. And regarding our internal requirements, well, the extended infrastructure is very reliable. We've had no issues so far. We were able to add the US infrastructure piece by piece. And so this actually required no downtime on the EU infrastructure. We both deployed the cluster and then the gateway and then configured DNS, et cetera. The extended infrastructure is cloud agnostic. We use no cloud provider specific functionality here. It's just Istio, pure Istio config. And deploying to multiple locations is seamless. And I'll let Boudouan go into more detail about this seamless list. Yes, so actually mentioned some non-technical requirements that we had to address. One was that it had to be really seamless and not impact the rest of pigment R&D. And it's the case. Firstly, for backend services, our backend services, most of them, are not aware of the concept of location. They do not need it to do their job. And also, if we want to add a new location, we don't need to do any cut changes. And there is no downtime, as Arthur mentioned. So pretty simple on that matter. Also for developers, so we did not want to decrease developer velocity. It's especially important in a growing startup. So we have CI-CD tests and they deploy to all locations in parallel. So developers, it did not add any complexity to the job. And also, we just made sure to have location tags to dashboard and alerts to make sure that developers, especially on calls, would have the necessary information for troubleshooting. And there is also SRAs to think about. After all, SRAs put the system in place. The cross-location routing can be bypassed easily for testing, thanks to the internal loss names. Thanks to the black-box testing that Arthur mentioned, were alerted when the setup breaks even partially. And we've had no major outage due to this system in production since the go live. So pretty exciting for us. And well, even if everything turned out great, we still learned some lessons and made some mistakes along the way. So a couple of them. First is that we managed to self-doss ourselves. So we pushed a faulty manual configuration, we did a network loop between our two gateways and they just crashed. Thankfully, it was in staging. And since it was due to a manual config error, now we generate this configuration and as long as the logic in the code is good, this should not happen again. And something else also, we used the virtual service delegation pattern where parent virtual service would delegate the rest of the processing to a delegate virtual service. And we find out that for our use case, it was perhaps not such a great fit. So it's a pattern that can be error-prone, so you have to keep that in mind when you use it because it's easy to have a mismatch between the parent and delegate virtual service. Let's say, for example, you're matching traffic on a certain prefix at the parent level and you make a mistake, a developer modified just one chart and the delegates match another slightly different prefix. Then traffic will get dropped, so you have to keep that in mind when you use that pattern. And something else also, it's there is only one level of nesting. You cannot delegate virtual service, cannot delegate to a third virtual service. And for our use case and the image we have of infrastructure, we would have liked to be able to really separate virtual service for location routing, for backend service routing, et cetera. It's not possible, so it's something that for our specific use case is, well, we'd have liked to be able to leverage it and we're thinking of going back to one big generated virtual service instead of a lot of them. So we're getting close to the end of this talk. We're happy to be able to share this experience with you, with peers, like all of you. But the question is now, now we're looking to the future, like this works, this has been in production for nine months now. We haven't had any major issues and we've kind of moved on to other stuff because the company keeps growing. But regarding this setup, like what is next, what can we see in the near future? Or what do we want to do in the near future? I think the first thing we want to do is that we want to replace the authentication database. Like as Boudouin said, currently it's hosted in the EU and the US authentication service uses that database directly. The issue here is one of reliability. If the EU region goes down, the EU S region doesn't work anymore. The opposite isn't true. Like if the US region goes down, the EU region keeps working. And we'd like to split these failure domains. And so we'd like to have a global, highly available database for authentication. We'd also like more tests on the middleware, not because we think it's buggy. For now it works, right? But actually because as we add more integrations from third parties into our system, we keep adding more endpoints to the middleware than showing the code but manages some endpoints slightly differently. And the thing is today the feedback loop is basically we'll deploy the change to staging, wait a bit, and if it works, you're fine. But if it doesn't work, like the feedback loop is pretty slow. And so we'd like to have some form of unit or integration testing like in our CI so that we can quickly check if something works or not. We'd also like to extend this setup to other locations on the same cloud provider, so a different country, or different cloud providers. As we said, we made the conscious choice of making this setup cloud agnostic and we'd like to put that to the test. So that would be pretty exciting. We're eager to do it. It's not in like an immediate future but we can see it coming in like the next couple of years. And finally we'd like more cross-location features in the pigment app itself. Today like there are a few global features. The main one is authentication. But there are other features in pigment that make sense to be global. And we're working in collaboration with our dear backend engineers to actually make this work. And the fact is our network setup isn't perfect. For instance, like in this setup, services can send each other like HTTP calls via the gateways, but currently they can't send GRPC calls. But the backend mostly communicates using GRPC. So this is an issue. And we need to keep working to improve pigment on this. So thank you for your attention. And if you wanna join us, we are hiring. Thank you. All righty. And we've got a few minutes for questions as well. So if anybody has questions, go ahead and raise your hand and I'll bring the mic over. Sorry, I crossed the room here. So it seems like you're kind of a duplicating. You can correct me if I'm wrong. What I understood like you're kind of a duplicating the whole infrastructure for us. And you are like redirecting the request just on the basis of the gateway. So do you like in the whole process, have you thought about like putting some microservices as a central because these things we do because we have a PII data or these kind of regulations. So have you thought of like binding a few microservices which has no relation with like storing the data and PII data and just processing it centrally and then just restricting the reason-based things to the PII data or the data where we are storing or processing on everything? That's a good question. We thought about it. I think is in our case, it's not really PII data, it's business data. And just so happens that are sometimes there's PII data but we don't know about it, right? Sometimes users put PII data in our system. So we kind of need to assume that everything is sensitive. Today most, like I'd say like more than half of the pigment app itself is about processing this data. Fundamentally, it's a data processing application. And so I don't think it would make sense for us to just extract like a third of our microservices and put them and consider like, okay, these services aren't sensitive and so they're separate. I think we kind of like having this sort of monolithic infrastructure. Now we have to, I guess. I mean, we considered it and I think it's just a matter of preference that we went this way. Is there a question over here? More questions? I was wondering if you could clarify with Istio for the gateway, what, where does it overlap or what degree are you dependent on Istio for what you're offering? And then when sort of solo pickup or what was really cool about solo that you opted to use that to extend what you're doing? Gateway side I'm at. Yeah, a couple of things. Well, the first thing is we don't, we mostly use Istio for the service mesh. We just happen to already have the gateways and invoice capabilities at our disposal when we start to do work on this project. And to answer your question about solo, we don't use solo. Any other questions? Otherwise, I have one, I have one quick one while I'm walking over, which is, so for the middle where you implemented that in Lua, did you look at Wasm or not or was there any kind of debate or decision between those two? There was a discussion. We opted for Lua initially because the middle where isn't that big. Like I only showed like the gist of it, but overall it's maybe 300 lines of Lua. So it's not a big thing. So we figured like a scripting language was fine, but we may consider it Wasm for a more complex use case later. Okay, thanks for sharing. If I understood it, you have two regions like the US and Europe and have you thought about scaling these up? I mean, imagine you have 20, 30 points of presence. For example, you have zones in Germany, France, UK, several in Asia, like I don't know. Have you thought about scaling this infrastructure? I mean, it's expensive to keep the full set of your applications in each zone, yeah? And now you have to create something, you know, to put some services in Germany, some services are deployed in France, some of them are in the UK, and you have to split the traffic to route it across these regions, across these zones. Have you thought about this? I mean, when you have not two regions, but a lot of them. We have thought about it. Like how does this scale basically? What we're mostly worried about is the operational load of actually operating all these clusters. From a financial, like a cost standpoint, it's not really an issue. The core of our cost is our databases, and the fact is if we deploy a cluster and there's nobody on it, then it doesn't cause very much relative to the others. Okay, thanks. Hey, hello. You mentioned having issues with delegated virtual services. Have you considered using the Gateway API? Kind of what's, if you know the pros and cons? We haven't for the simple reason. So we're using Istio for four years now, and so we already had the gateways managed through Istio. So we didn't really, when we started the project, wanted to take care of the migration. Something we may revisit someplace in the future, but as of now it works. But yeah, it's definitely something we see a lot of work being done there. I'll just add one thing. As Boudon said, we already had virtual services everywhere, and we only needed to add a little bit of functionality to them, so we didn't figure moving to the Gateway API was what we wanted to do. It seemed like a big thing for just this small feature, but we are definitely keeping an eye on it. It is probably something we'll do relatively soon.