 Hello, everyone. Thank you so much for coming to our session. The room is so big it makes me nervous. Please go easy on us, okay? We took a very long flight, 20 hours flight from Indonesia, all the way down here. It's super excited to be here. So today, we will share our service mesh migration story at our company. The company is called Gopay. The Gopay scale, the business supports billions of dollar transactions. It's growing very rapidly right now in Southeast Asia. My name is Geery, and here's my colleague Imre. Both of us are infrastructure engineers from the Gopay. Since today, we're going to be talking about Gopay case study. I'm going to share you a little bit of the background and context, what Gopay is about, and what scale we are operating at. So Gopay, we started as a startup back in 2015. Then we were growing very rapidly. We became one of the unicorns in Southeast Asia, and right now we are the largest digital payment in Southeast Asia. We operate in multiple countries in Southeast Asia. We have the largest multi-active users in Indonesia since late 2017. Fun fact, Indonesia is the fourth largest country in the world. We have 270 million people, and 66% of the population is unbanked, meaning 66% of the population do not have access to bank account, do not have access to debit cards, and no credit cards. We are accepted at more than 700,000 online and offline merchants, including various shopping malls, various street food vendors, as well as Google Play Store and Netflix. We have integrations with more than 28 financial institutions, including banking, loan services, bills, and recently last week we added integration with Apple Pay. Gopay Engineering, we have around 230 plus developers spread across 30 autonomous teams. The word autonomy here means this is kind of like having 30 different startups under one umbrella. This is how our engineering operates, the model. The side effect is that we could maintain high delivery throughput the product team, but the resources are very scattered. We maintain around 30 Kubernetes clusters across testing environment staging and production environment, and we have more than 3,000 deployments per week to all these clusters. We started moving from virtual machine-based infrastructure to container-based infrastructure in late 2018 on top of Kubernetes. It took us almost two years to migrate 100% our entire stateless services to Kubernetes. This is around 1,000 applications that serve Gopay, including services, workers, and cron jobs. We have shared our lessons from this Kubernetes migration activity back in KubeCon 2019, the link is attached over here. After completing the Kubernetes migration, the next plan we wanted to adopt service mesh. We have been adopting Envoy proxy for each of our service since the virtual machine-based infrastructure. We also built our own Envoy control plane in-house based on the XDS API. However, because of stability and maintainability issue in our in-house mesh solution, as we grow and as we have more services to our system, especially in Kubernetes, this became one of our major tech that's in the organization. So we decided to migrate to a more mature solution with community support, so we chose Istio. We also shared the early migration strategy last year, virtually in KubeCon EU, where we had to maintain mixed state of infrastructure between Istio workload as well as in-house mesh solution. After one year running this migration process, we were only able to onboard less than 2% of our services in Istio infrastructure. This is far below our expectations. Now, what did we do wrong? We were able to gather these past migration activities, the learn lesson, and we wanted to craft out a better migration plan for Istio. What are those lessons? First lesson, they were very steep learning curve for our developers. They just get the hang out of Kubernetes concepts, pods, deployments, replicaset, whatever. And then we introduced another new infrastructure thing that they have to learn, that is Istio. They have to learn what virtual service object is, destination rule object is, so many things to learn. And developers don't get used to touch the infrastructure layer. Most of them just want to focus on their product feature delivery. And there were also frictions in keeping our standardized templates up to date. These templates are used by developers to help them deploy their application to our infrastructure. We are also very new to Istio, right? We learn as we do it, and as we find new things, we add more capabilities to the standardized templates, and we add new things over there. We had to ask a lot of things to the developers to bump up the Helm chart, introduce more configurations to the application, and things like that. And our mistake in the past, we did not set this Istio infrastructure deployment as the new approach. This new approach is not the default way to deploy their applications. And because of the frictions and the steep learning curve, the developers stick to the existing deployment ways that they are comfortable with. So the Istio migration was progressing very, very slowly. We also realized there's a very high migration overhead in our process, in our strategy, because the business keeps growing very rapidly. There's exponentially more services and edges. There are new services every month, and every week it's very hard to catch up with the product team. And as we find more use cases in our migration, the new use cases increase the complexity of our migration process. And I mentioned we also had to maintain the mixed state of infrastructure. Some of the workloads are running in the Istio infrastructure. Some of them are still running in our in-house mesh solution. This adds complexity. Another thing I wanted to mention, there's a very unclear ownership of services across those 30 different startups, the teams. So when we start planning the migration, it took us a lot of time to figure out who owns what. And as an infrastructure engineering team, we found difficulties to help them out. When we face blockers in the services, those services, it's very hard to figure out who owns the services to help them. And because of this, we were getting pushed back from the developers. And we started losing trust from them. What developers really wanted, they wanted to leverage the Istio power and reducing the organization tech debt by doing the migration that we started. And they want to focus most of their time, 90% of their time progressing on the future delivery. But in reality, what developers really get, they were only able to spend 10% of their time progressing on their team. And most of the time, we asked them to do a lot of things, 15 different tasks in order to migrate to Istio infrastructure, whether switching the Helm chart version, Helm chart types, or injecting a new configuration, bump up the client library version and many other things. So in order to do better migration and to be successful with this Istio migration process, we realized we need to fix all this migration overhead. First, we should develop an abstraction over the migrating infrastructure. So the migration process could be much simpler for developers and less tasks for the migration ask. And this abstraction could make the current migration to Istio infrastructure much easier. We also need to avoid leakage abstractions where developers still need to be aware of tiny details, like tuning the standardized deployment templates, injecting specific conflict values in the app, that we used to have. The abstraction can also make future migrations easier, not just Istio infrastructure migration, but also future infrastructure migration if we have in the future, so we don't repeat the same mistakes. Let me give you an example of kind of abstraction that we used to do in GoPay. When we want to deploy an application to Kubernetes and Istio, we have to create various manifests or config files for different objects, right? For example, there will be deployment objects, service object, config map object, and for Istio there will be virtual service objects, destination rules, and so on. And these objects can vary a little bit depending on the environment, production environment, can our environment or staging environment. And obviously, we won't let the developers write their own manifest, right? So as an infrastructure engineer, we create a bunch of standardized templates and we chose Helm for this use case to render the manifest. The Helm chart templates can be different for different programming language in our organization and type of services. So we have Helm chart for Java, we have Helm chart for Golang, we have Helm chart for service for worker, and for Istio we also created a different, a new Helm chart template. These Helm templates are technically abstractions for the Kubernetes and Istio resources, right? Because developers don't have to write the Kubernetes manifest by themselves. The Helm templates will help render the manifest for them. However, these are considered leaky abstractions because developers still need to be aware of which Helm chart to be used, also provide specific values for running their application using the Helm set commands and things like that. And sometimes the arguments can get very long and developers could get confused easily with the arguments to provide. So what's a better abstraction? A better abstraction would be to provide common developer interface, which is much simpler and easy to learn. Through this interface, this can be a developer portal or a common CLI tooling. We could program the Helm templates behind the scene in the back end of the system, which would generate the required Kubernetes and Istio manifest. As the time goes, if you want to enhance capabilities into the Helm templates or introduce another big thing in our infrastructure, we don't need to rely on developers' help to bump up the Helm chart anymore or to provide specific arguments anymore. Developers don't have to care, as they only need to interact with the common interface which stays the same on the daily basis. We will then use this abstraction to fix our migration overhead. By standardizing on 90% of the use cases first, we focus on the common use cases first and migrate Istio under this new abstraction layer. Now, Imre will share how we built our internal developer platform to abstract Istio migrating infrastructure. Okay. Hello, everyone. So now let's talk about how we create abstraction of Istio on top of our developer platform. So the platform that we call GoPaySH is actually trying to solve four main problems. The first one is about clear service ownership. Second is about how we abstract the deployment from developers' point of view. Third is about how we're going to make Istio enabled by default. And the last one is about enabling third-party integration so that other teams that want to integrate their service can also use our system. One of the old problems that we had was actually when you run the cube control inside the cluster, you simply had no idea who owns what, like who owns these deployment services, config map, and etc. We have been trying to solve this kind of problem by separating teams by namespaces, but it doesn't really work because the legacy deployment tools that we have give developer capabilities to deploy to any namespaces. So at that point, we don't really have control about where they deploy their application to. So with GoPaySH, what we do is when developer onboard their application, we immediately assign application to a specific team. We will discuss more later about this. The second strategy is that when the application is deployed, we're going to deploy that application on the namespace that we already prepared for them. This is completely new namespace. And by doing this, developer will have no idea about how to change the deployment because they need to... When they deploy, the namespace will be provided for them. And then we also have a feature, basically, in case a team wants to transfer the ownership of a service, and this usually happens when we do organization changes. So ownership might be transferred from one team to another. Okay, as Giri mentioned, the headache that we have is actually related to the helm chart. So when we, as a company scale up, everything needs to run very fast. And as a side effect, there are a lot of helm charts that created by developer in many different ways for every language. You can imagine how many charts that we have. And unfortunately, the past Istio migration also adds some unnecessary number of charts. And when Giri mentioned the initial adoption for Istio, which was very less, this unmaintainable chart is also one of the reasons of why. And then the next thing is about... There is also no clear ownership for all of these charts because at that point, developer can add anything to the chart. And they just add anything they need, and then they leave it. And then this caused a lot of issues because charts now are not standardized. Some helm charts enforce life and readiness probe, but some just make it optional. The bigger problem that we had was when there was the precation on one of the Kubernetes API, API version of Kubernetes deployment. This blocked our cluster upgrade for months because at that point, we really need to tell developer that, hey, you need to update the version of chart that you are using. And even after they upgraded, there are a lot of issues because they don't know sometimes the versions is a major breaking changes version and so on. So there are a lot of problems at this point. So this is the example of our legacy deployment tools as you can see here there are a lot of infra-related information like the name of the chart, the version, the target cluster, the name of the application config from our configuration server and bunch of helm overriding arguments, basically to set everything like setting up image and then enabling Kubernetes service or even for enabling Istio. So what we do now with our portal, developer can configure the deployment configuration like CPU, memory, request and limit and also configure replication config for whether they want to enable auto scaling or not and then what the run and migration command they need to be run upon the container start and this data now are not stored locally on the repository anymore but this is stored on our database centrally and by having this kind of system we can just change it if we need it from our end without really need to tell developer they need to update something and to apply these changes we basically will just need to retrieve their deployment. And this is pretty much how we abstract out the deployment from developer's point of view by using the information given by developer before. Now we only have few number of charts to be maintained. We have pretty much around three charts now. One is for GRPC, HTTP and worker and then one is for project and another one is actually for Istio-related resources like virtual service, destination rule and so on. And then to trigger the deployment developer now will use our new simplified CLI. I'm going to show you later. That tells us about the deployment context whether they want to deploy to Canary, production, staging or even whether they want to do rollback. And then by knowing this context we will use all of this information the helm chart and then the configuration and also the deployment context to basically construct all of the Kubernetes resource that we will deploy. We put it in our GitOps repo so that Argo CD later on can sync it to the clusters. And now I'm going to talk about why we need to adopt the service mesh. A majority of our internal service are GRPC and the client-side load balancing and service discovery is really important to us. We have been using XDS API to build our in-house info control plan for a while. However, this solution caused a lot of issues because of the stability and maintainability issue. So we came to the point where we agreed that we want to deprecate this by changing it with some major technology like Istio. So before adopting Istio we relied heavily on Kubernetes load balancing for canary deployment. Unfortunately, by using this way we don't have more control about how much traffic that should go to the canary. So we also want to utilize this kind of feature from Istio. And last but not the least we also want to make the code base more lightweight by delegating some networking part to the proxy like doing the retry and then setting up MTLS and then for basically setting up the timeout, idle timeout and so on. So in order to adopt the service mesh, the strategy that we have, we actually borrowing a concept from Melanice Bula talk about migration at scale on Airbnb where she said that make the new approach the default. So we're going to make all service deployed in Go PSH will be deployed automatically in Istio Enabled Namespace. Also when the developer create new application, this application will be generated by our generator and then all of the code there will be used, will be ready to be deployed in our Istio Namespace. So like GitLab CI and then starter code HTTP server or GRPC server. And this is the sample of UI about how developer will create the application and onboard their existing application. So they will have to go to several steps like defining the name and then defining the owner and then defining whether they want to create surface and so on. And after that we will generate the code base for them so that they can use it directly in the future. And this is now how we do the deployment. I mentioned about the simplified CLI. So this is the CLI. It now only accept few arguments like the release name, the cluster, the environment, the code image and probably some optional argument like canary rollback if they want to. And then for every stage of the deployment we basically configure several things differently. For example, we configure the virtual surface programmatically when they want to deploy to canary. So we set it up to five percent and when they roll out to the production we turn canary down into zero and then we set stable into 100. And all of this process is abstracted away from developer point of view. So they don't really know what happened after they triggered the CLI. So now I'm going to talk about the third-party integration. So the reason why we need to do this is because all of the infrastructure team right now are trying to provide their own solution. For instance, we have a platform named Burrito just for logging purpose. The management of this logging is being done on separate UI. There is also a team which provides public and private DNS generation and they also have different way in order to do that. There is also a team that provide a databases like stateful component. They also have Terraform in different way, something like that. And the question is how do we integrate with other system? So we split the integration into two different types. The first one we call it add-on integration. This type of integration actually has proposed to enrich a service capability. Let's say you want to create a database and then you're going to create a database on something like that. And we enable this by using open service broker spec to integrate with add-on provider. So the add-on provider will need to comply with some API contract like how to do provisioning, updating, and deprovisioning. And when we receive the call via our portal, we're going to propagate this to the relevant add-on provider so that they can start doing their operation. Not only that, the add-on provider can also modify the service behavior if they need to by return some information to us. So let's say in order to connect to the database that created, we need secrets, right? And then we will use that secret as an environment variable or even creating Kubernetes secret. And so that they can use it during runtime. Another type of integration is actually inspired by Backstage, and we called it a UI plugin integration. The problem is not all of the integration is actually used to enrich a service capability. Some of the integration might impact global configuration like setting up MTLS for some domain or even for showing the infrastructure cost. And for this use case, we use UI plugin integration so that other team can provide their UI component to be applied to our system. So now we are reaching the last section of this talk. Now we will explain about how we re-roll out STO to developers. Okay, this is interesting. So based on our experience, it is impossible to ask developers to create Kubernetes resource by themselves, even with Helm. So because we have bad experience with it. So we need an automation in order to simplify the migration for them. So we think about doing it programmatically. So how do we do that? For the existing service, we're going to give them a form. They're going to have to specify the release name. The namespace, they deploy the application and then the location of the cluster. And whether it is GRPC or HTTP. And after that, we're going to call Kubernetes API to gather all of the infrared information like CPU memory requests, run command, migration command and so on. And after that, we show it to them and then Dev will review it when they said, okay, we're going to save it as our metadata. So once the metadata is onboarded, we're going to generate the GitLab CI script that they can use in order to do deployment. So once the deployment is triggered, we will create several things like Istio resources and then we use new chart that we maintain by ourselves and do a lot of things. And indeed, at this point, there will be actually two deployments running. The first one is the old one and the new one is the one that deployed by Go Passage, which is in Istio enabled namespace. And then after this, Giri will tell you about how we migrate the client from the old service to the new one. Okay, all right. So during the Istio rollout process, there are three use cases of communications to the migrated service to Istio. So they are still able to receive call from the clients that can be anywhere, right? First use case, this is the most straightforward use case is where the client is calling the service from within Istio mesh. The client is already migrated to Istio. In this case, client application A is calling server application B inside the Istio mesh. This is the communication from pod to pod. For this case, the application A will simply call application B through its Kubernetes service endpoint. For example, app-b namespace service cluster local, this gets registered in the host of the virtual service B. So they could talk to each other. The second use case is when the client is coming from outside of the Istio mesh. As Imre mentioned, we have a lot of gRPC services. Technically, if the client application A from outside of the Istio mesh, technically this application can talk via the Kubernetes service endpoint. However, because of the nature of STP2 persistent connectivity, the gRPC load balancing will not happen properly. So we have to do something about it. So what we do, we provision Istio ingress gateway to front the virtual service. We still register the Kubernetes service endpoint in the virtual service. However, we provide our own custom domain that is internally resolvable that will be used by the client application A. Because of the developer platform and the abstraction, the developers can easily provision this Istio ingress gateway with a single click. And after they complete the provisioning, they will get the domain where the clients can just simply talk with it. The third use case is when the client is calling from the public internet. This is kind of similar to the previous use case, except that the ingress gateway right now is publicly resolvable. So after the developer's provision, the public ingress gateway, we provide them with some publicly resolvable domain, for example, bar.gopay.com to be used by their client. And after that, we also set up MTLS connectivity by default because it has become easier in Istio to enable MTLS connectivity. Now we executed the Istio migration in three stages. The first stage is what we call alpha rollout. Here, during this stage, we found several teams who are eager to help us and become our alpha users of the new developer platform, the new abstraction tools. We pick the teams who have the least critical services, which they own. At this stage, we assist developers to onboard their services to Istio through our portal and update their clients with the new domains depending on where the clients are located at and set the mesh, offset the mesh, or from the public internet. In this stage, during this alpha rollout phase, we took the chance to understand the developer's behavior during this migration process. We gather a lot of feedbacks, bugs, and we also identified missing features in our abstractions and developer platform. We use all those feedbacks and use them to improve the UI and UX and our automation process. The second stage is the beta rollout. Here, after we iterate on our process and abstraction a few times, this time we pick several teams who are eager to help us as beta users and let them onboard at least one service by themselves without our help. But we are still helping them out by observing how they interact with our platforms, how developers perform the onboarding and migration process. Here, we ensure there are no onboarding blockers, no migration blockers, and most of the onboarding use cases to Istio are covered through our abstraction. We also found during this stage we found very specific behavior with our Istio network. So, for example, I think our cluster we were using Istio version 1.7 and then the service inside this mesh was trying to call the other clusters in our system but those clusters were still on Istio 1.6. So, the gRPC service inside this mesh got downgraded to Istio 1.1 and it broke the gRPC connectivity between those services. We also made some specific tuning in Istio proxy like the ideal timeout and other things because of the nature the behavior of the client libraries that developers are using. Third stage this is the white rollout where we create several phases of the migration program. Through this program the developers own migration of all the services that they own and to end until completion. We also partner with the senior leadership team. We also partner with the executive with Istio. Istio endures our migration program and our developer portal and we follow up the progress weekly and assist the developers if they face any difficulties at this stage we were able to speed up the migration process and monitor the migration completeness the migration progress we have a weekly catch up with the engineering managers from all the teams to monitor the progress if the progress is slow we follow up we ask help from the engineering managers to help manage this. What's the result? After only four months after executing the alpha rollout stage we were able to onboard 28% of our services data services to Istio we were able to onboard more than 50% of the teams in GoPay organization into Istio and so far we are receiving good testimony from the developers developers are happy. The impact of abstracting over migrating infrastructure is very very simple. So apart from continuing our white rollout in this migration program what is next in our roadmap we plan to add more capabilities to our abstraction and developer platform and introduce more Istio power to developers because it's becoming easier and easier to introduce new features new infrastructure features to developers because of the abstraction. So for example we want to introduce traffic mirroring feature so we can have very low risk to test in production the developers. We also want to visualize our system better with the service graph that service mesh is offering we want to decouple the network logic from the application code base for example rate limiting and circuit breaker right now developers are implementing all this network logic in their libraries in the code base we want to make the application code base more lightweight by decoupling by delegating this network logic to the Istio proxy and lastly we want to explore the multi cluster capability for failover and high availability because it's becoming easier with Istio. So to recap there are at least three key takeaways from our talk first for successful and effective Istio migration or any other major infrastructure or system migration we have to develop abstractions over this infrastructure. Second we have to make our desired approach the new way the new approach to default in our case we had to make Istio the default way of deploying applications finally it's rate on the migration process and tools to ensure it is fully validated enabled and finished before performing the white rollout to the organization we are heavily inspired by Melanie's approach to infrastructure migration migration in Airbnb we met her in Kubekwon China I think back in 2019 we had a lot of productive discussions and those discussions became the foundation of our migration approach to Istio infrastructure we also inspired by Spotify's backstage their developer portal from Spotify we borrowed the adoption metrics how to roll out internal developer platform to the organization been really good so far we would like to thank to our team developer platform team who has contributed to the successful of this project thank you so much for listening to our talk you can reach out to us for questions or further questions thank you