 Hello everyone. Well, there's so many of you. This is our very first time in Chicago. We took more than 20 hours flight from our hometown in Jakarta, Indonesia to be here. A bit of a jet lag, but we are very excited to share today what we learned on growing our internal developer platform for companies, inside companies. My name is Geary, and here's my colleague Joshua. Both of us are infrastructure engineers from GoToFinancial. GoToFinancial is a financial arm of GoToGroup, the leading digital ecosystem in Indonesia. Our ecosystem provides various service offerings from ride-hailing service using motorcycle, bike taxi service, foot delivery service, package delivery service, e-commerce, and many others. For today, we are focusing mostly on the GoToFinancial case study. When the company was just started, we created a product called GoPay from the ground up. This GoPay is a digital payment product to serve the use case of our customers, to enable them to pay cashless when they order services in our ecosystem. GoPay right now is one of the largest payment companies in Southeast Asia. As the business grows, the GoPay engineering organization grows to more than 230 plus developers. Spread across 30 different teams, we maintain 30 different Kubernetes clusters across multiple cloud providers and data centers. These clusters became the foundation of our cloud-native infrastructure. As our team grows, the scale grows, and the cloud-native community is moving really, really fast. As an infrastructure team, we always want to introduce something new that we learn from the community to our infrastructure. This makes our infrastructure more and more complex every day, and developers suffer from the things that we ask them to. Let's say bumping their hem charts. There was a time when we asked them to bump hem charts like three times a day to prepare for, let's say, 1.16 Kubernetes might upgrade and so on. It took us a few years to realize that we had to create a platform to abstract this complex infrastructure. We started building this abstraction three years back with two purposes. The first purpose was we wanted to shield the developers from learning curve of Kubernetes and its ecosystem. And on the other side, we as an infrastructure team would want to maintain and govern infrastructure to easily cross our plate of Kubernetes clusters. So how does our platform abstraction look like? We expose an interface that developers or product engineers use on daily basis. This is nothing but a UI portal, a simple interface, and we expose only what the product engineers need instead of exposing entire Kubernetes yamls for them to configure and apply in our clusters. On the other side, we, the infra engineers or platform engineers, maintain a set of standardized Kubernetes templates for our use case. We maintain a couple of hem charts that are standard across our infrastructure. This creates separation of concern between what matters to product engineers and what matters to infrastructure engineers. The platform then orchestrates the inputs that product engineers have provided us using the UI portal, combined with the standard helm charts that we have. We generate the Kubernetes manifest, push it in our central GitOps repository, and leverage our GoCD to sync applications across the clusters that we have. Through this abstraction, this platform, the product engineers are able to run their applications, scale the replicas of their workloads, routing canary traffic to their workloads, adjusting CPU memory size, without writing single yamls or modifying single yamls. So we only expose what matters to product engineers through this UI portal. The impact of this abstraction for the past two, three years, we were able to roll out Istio that we learned, I think, two years back, to 100% of our services in infrastructure, which is around 1,000 applications, within only a year. We were able to remove deprecated Kubernetes APIs prepared for our upgrades when there's a breaking change. The community deprecate beta APIs easily without developers knowing. So we are more disciplined in upgrading our clusters regularly. We were able to schedule to move the workloads to on-demand instances to our pull-off spot instances. We made this self-service through our platform, so it becomes easy for developers to make it more efficient in running their application. And finally, we were able to improve the utilization of our clusters through automating resource adjustments regularly in our infrastructure. This abstraction, this platform grows into a developer platform inside our company. We call this internally as GOPSH developer platform. The platform consists of five different planes in our architecture. First plane is what we call developer control plane. This is the main interface, the gateway for product engines to interact with our infrastructure and set of toolings. This is nothing but a common UI portal where developers can configure their application and associated components. The second plane is integration and delivery plane. This plane is responsible for building the application code into image, container image artifact, push it to image registry that we maintain, and then first our GOCD to roll out those images across the environments and clusters that we maintain in our infrastructure. The third plane is monitoring and logging plane. This helps developers in debugging, troubleshooting, and monitoring their running applications across the environments. And the fourth plane is security plane. This plane is responsible, help us in governing resources access across the infrastructure that we maintain and managing our secrets. These four planes are working, collaborate closely together, making it a management plane, management layer, which eventually creates edits and sometimes delete the real resources in our infrastructure across Kubernetes spots and virtual machines. We make our platform extensible through implementation of Open Service Broker API specification. We have a concept called add-on marketplace. Each add-on is maintained by different infrastructure teams across the organizations, and sometimes product engineers also contribute to this add-ons that they want to make it available across the organizations. With this concept, they don't need to make changes to our front-end code base, and because there's a JSON schema concept provided from the backend, we could generate the UI form in the portal without asking them to contribute to our common code base. And then through this add-ons marketplace, product engineers are able to do anything beyond Kubernetes deployments, so things like provisioning red dates, provisioning database, resizing their dates, day-to-operations, configuring logging and exposing domain to public, sometimes exposing domain to our third-party partners. They are able to do everything from this UI portal without knowing too much details. And we are able to separate the lifecycle between the core platform and each of the add-ons lifecycle because of this extension mechanism. As the business grows and to support the business expansion, we acquired multiple companies. We acquired a company called Mid-Trans. This is an online payment gateway company to support our online payment use cases. We acquired Mocha. This is a point-of-sale system company to support our offline integrations across merchants in Indonesia, the whole Indonesia. We acquired multiple lending companies to support our GoPay later, the lending business installment features in our app, and we acquired a few other companies and made tight integration strategic partnerships with other institutions, and these are becoming go-to financial family. While these acquired companies provide critical contributions to our business, they gave us a challenge, each of these acquired companies has their own independent infrastructure team, or platform team. What are the problems with this independent infrastructure team? There's a couple of problems. First, there's a very high variance of our infrastructure stack and tools that are used across go-to financial, despite we are working closely together, collaborate together to maintain the products. So, for instance, some companies use Helm and Customize to deploy their Kubernetes manifest. Some are using Flux CD, and for us, we are leveraging Argo CD a lot as a standard. This creates unnecessary frictions to collaborate across engineering organizations. Imagine the monitoring dashboards are very fragmented. Company A has their own monitoring dashboard. The other companies have their own monitoring dashboards. So, it's very hard to correlate when there's an incident, and it's very hard to correlate one service to another service if the services are spread across engineering organizations. We identify these infra teams are redundant. They are trying to achieve the same thing, to serve and help making product engineers life better. And these redundant tools provide us with cost-saving opportunities. So, to solve these problems, we decided to consolidate the infrastructure team. From each independent infra team, we consolidate to be one infra or platform team, common shared team that serves not only one product, not only one engineering organization, but supports the entire engineering organizations across go-to financial. So, we tried to do more with less. However, again, consolidation gives us challenges. So, when we talk to many engineers in each of the organizations, we learn what are the differences in their practices, what are the gaps, what are the use cases that we haven't covered yet in our platform. We had a dilemma. The dilemma was this. Whether we extend our developer platform to support their new use cases or negotiate, ask them to deprecate their existing tools, existing practice, and adopt what we have at the moment. Another challenge is that because we are running in a very highly regulated industry, the financial industry, each company must comply to different set of regulations. So, for instance, for our payment business, we need to comply with the Central Bank of Indonesia regulations. For our lending business, we need to comply with financial services authority, which gives sometimes set of different compliance requirements from the Central Bank. And for our payment gateway, we need to comply with the PCI DSS standards. We got lucky. All of these acquired companies are already running on Kubernetes. They have containers, their applications completely. So, this makes them a good candidate to onboard to our developer platform. And their Kubernetes clusters are running across the GCP AWS and private DNS center. Next, Joshua is going to share our consolidation journey and how we onboard these acquired companies to our developer platform. Okay. Thank you, Giri, for the great introduction. So, what do we mean by when we say that we are going to onboard the workloads of the other team to our platform? Giri has just mentioned that, fortunately, all of the companies that we are going to onboard are already using Kubernetes, right? So, this means that the respective companies and team has made sure that their workloads are able to run on top of Kubernetes. And most importantly, containerized. We had an experience back then where we had to onboard workloads from VM to Kubernetes, and it was way more painful indeed. This means that there's already a running Kubernetes cluster with live workloads, with incoming production traffic. And we were thinking, in order to make the process faster and more efficient, why don't we just use this existing cluster rather than creating new ones from scratch? So, the first thing that we do to enable and unlock the features of GOPSH on a particular Kubernetes cluster is that we are installing several operator, controller, and agent that is needed on the cluster and correspond to our product offerings. Each of these clusters is, interestingly, managed as its own ARGO CD application. We then define several namespaces that we label as GOPSH Manage. So, let's say for the workloads that are going to be deployed to this GOPSH namespace, they will be auto-injected with Istio Sidecar. Namespace, in our use case, are mapped to the structure of the team for the visibility of cost, versioning, resource tagging, and other. And if possible, if we see a potential of clusters being merged, we merge them. So, let's say that there were no blocker of compliance and turns out that services from the different companies are serving the same purposes than less cluster to manage. The second thing is the theme itself. We define particular Google groups where the infra team and developers are part of it, and then we add them to the larger Google group as the users of our platform. We proceed with the creation of the namespace, and then we define our back. So, let's say product engineers will have limited just view-only access to the Kubernetes cluster, no edit, no delete. On a default scenario, they can see they're running workloads on our ARGO CD dashboard that we exposed, and they can also file just-in-time access requests for emergency. While the infra team will have the privileged access for the clusters. Currently, the way that product engineers use GoPSH as their deployment tool is that they copy a deployment script that will be available on the UI portal to their repository pipeline configuration, which will trigger a GoPSH client binary that we'll call our deployment API. This call will retrieve and pass several metadata, like what's the name of the surfaces, in which cluster and namespaces that they want to deploy this to. And most importantly is the URL of the content image from an artifact registry. And since that has been mentioned that all the companies are already using Kubernetes, these info of... They already have their own way to build and containerize their application. Thus, their repository will already have this information of this content image that we can just utilize. If they want to create a new application, they can also create new one from scratch, and we also provide a uniform way to build, like let's say with predefined base image. So the gap usually lies and is covered in the structure of HelmChart because this usually reflects the nuts and bolts of the running surfaces. These differences to the one that we centrally manage by our infra team. So for example, we found that in other companies, they are using the deployment with the type of stateful set for their corresponding use cases. Or on others, their services need file mount for their Java application for its startup configuration process. We knew that there won't be any workaround or other approach for this kind of needs and use cases. Thus, we need to extend our support of our platform to be able to accommodate this scenario. Monitoring, logging, and alerting, this is essential for product engineers to enable them to analyze the health of their running surfaces and debug when there's any kind of production issue. On our side, our infrastructure team hosted our in-house ELQ stack. For custom metrics, since we have already supported either the push method of StatsV Explorer to Victoria metrics or the pull method where developers can send data to, let's say, a particular slash metrics endpoint in which then it will be scraped by an agent. Hence, developers from the other team can just use either of these methods according to their needs, whichever they find the most comfortable to. And one of the interesting things that we encounter on this onboarding journey is that some of the companies need to comply to a certain regulation as has been mentioned, right? In which we will have to make sure that all of the data, including the metrics data, to be located locally in Indonesia. So we make changes that previously our agent pushed the metrics data to our central infrastructure cluster, but now we reverse it to the VM cluster data on the respective local cluster. So metrics data can stay there, but monitoring dashboards are still intact and accessible. For logging, we found some companies that are writing their application log-to files, so we encourage them to write to STD out instead, as favorable by the 12-factor app standards. Some are having their own log parser, the app as a sidecar. So in case rather than we try to extend our support to be able to add another sidecar other than the STO-injected one, we port these log parser to our existing TD agent Daemon set. So there are some considerations that we can abide by if we encounter these kind of gaps and differences in this onboarding journey. The first one is that if there's already a guideline that if not all, perhaps most of the developer, members of the developer community has agreed that it has indeed some upsides and benefits, there seem to be no reason as why not to follow this existing so-called golden path. The second one is definitely compliance. FinTech is a highly-regulated one, as has been mentioned, so if there's a need for that, we definitely can't say no to putting effort in making that to comply, because or else, our company can't operate. The third one, if we found a discrepancy, which means changes need to be made either on our side as a platform or on their side, both in terms of their workloads and infrastructure, we only need to be aware that we should keep it to the minimum, because we assure that other than consolidation, they already have and busy with developing and supporting their business as usual processes. And the last one, I think, is the big theme that is applicable to most companies in recent times that I think most of us in the room, the infrastructure team can relate to. This kind of consolidation event or even if there isn't any that is currently going on in your company. If your company has scaled to a point where silos are likely to happen across teams, it might be a good reason, it might be a good opportunity to audit and recap once in a while what are the technologies and the engineering culture that is happening across teams. Even if autonomy is a good thing, that everything has its own flavor, of way of doing things, it surely has to support a bigger goal and backed by a strong reason. So let's say productivity and redundancy or contact switching, it's not one of them. I want to mention that either the decision to support or deprecate these technologies are sometimes, and especially in our case, highly contextual. It is definitely not a testament to the quality of the services and the products that we mentioned in itself, but more often than not, is about the capacity, the bandwidth, the familiarity and experiences that the teams already had. So how are we making this consolidation to happen? The first thing that we do is that we create a task force that comprises of the one to two members of each of the smaller team in our infrastructure department, in which they will be responsible for the whole consolidation journey and will have regular cadences with the engineering representative for each of their respective companies. We knew that if we want to go far, we need to go together. So finding allies and having their buy-ins are playing a big part in making this consolidation successful. One thing that we learned is buy-in needs to work both ways through the leadership and also to the engineering teams. We want to make sure that this long process of consolidation will, at the very end, have the goal to benefit all of the parties. And for that to be able to be understood, we need to talk in their languages. Even to the most technically proficient management, if they understand how it works, they're less likely to be able to convince, let's say, the upper leadership or C-level if we weren't talking about, perhaps, cost or better productivity these will benefit the company going forward. For other product engineers and infra teams, this happened with making sure that they are convinced that their lives are to be made easier, both in the process of onboarding itself and also after. We are achieving this by several things like finding the right balance on how we cover the gap and making sure that the product offerings of our infra team is something that will be useful for them all. For example, we found that the product engineers of the onboarded teams, they are highly interested in our future of canary deployment with the traffic weight and routing that will be able to be used by every users of our platform out of the box once they are getting onboarded. And for the infra team, they even admit that it is painful in creating and managing certificates manually. They are drawn to our centralized cert management. The vertical cooperation to leadership is what we call mandate, and the horizontal one to the engineers is what we call movements. Sometimes we tend to just focus on one, thinking that winning one side will help us to win the other side, which might not be the case. In our scenario, we found that winning both of them are equally important. Along the journey, we found several people that I'm going to borrow this term. We could call allies, people that are convinced by the vision and the goal of this consolidation that shows enthusiasm and are willing to help. We found them both in leadership when we present to them, which later they help us by giving guidance and alignment to the engineering team that they manage. We found them in engineers and infra team that they can see the benefit from this consolidation as well. We do these over impromptu lunches, sharing sessions, and whatnot. Usually onboarding of these companies are done in several steps. We start with discovery analysis of what gaps are there, between what our infrastructure can serve and what they're expecting. We pick two to three surfaces and try to onboard them to our platform as the proof of concept. Then we proceed with the preparation for the platform and the cluster as has been mentioned. Planning is done by picking the least critical surfaces, educating the service owner and the infra team, and then we estimate the duration of how long the onboarding process might take. When we onboard the workload, this is done by creating the exact copy of the running workloads to the GoPaySH Managed Namespace, which can be easily be achieved by our GoPaySH platform UI portal. We then check with the service owner and the infra team whether the onboard service already meets their expectations, like whether the connectivity is already there, whether one service could already talk to one another, logging and monitoring our presence on the dashboard or not. Then we cut over by routing the call destination of the client, removing the workload, deprecating the older pipeline. Then the last, we make sure that handover of all process and components managed and all the informations related to it are being documented well so that it can be passed to the respective team later. We make the workload onboarding very easy with just several steps and clicks through our UI portal. They can just input the running workload name. The portal will call the Kubernetes API that will retrieve the object-related data and specification, so let's say expose application port or probe configuration, CPU memory resources. We'll sorry on our end, and then this will generate the deployment script on the UI portal, and then they can just copy-paste this script to be triggered later from the repository for the deployment of their workload. So an interesting story here, as has been mentioned, the initial motivation in us building the GoPay HH platform is for us to be able to onboard the existing GoPay workloads that was already on Kubernetes to be using service mesh with Istio. We're trying to onboard live production workloads of product engineers serving live traffic with zero downtime, which was why we also made the onboarding really easy as have shown on the previous slide. Giri and my other colleague, Imre, was talking about this in-depth in the previous Qtcon LA. We tried this manual way of tracking all of the services and spreadsheet and circulating info and guidance to the product engineers, and after a painful 12 months, that will be failed because it got stopped, which turns out that it's bound to happen again. So none of the companies that we are currently onboard, none of them are using Istio. And the initial reaction is that they are hesitant to it. They were asking, what will happen to their workloads? Will there be any downtime? More cost. And if we try to think from their perspective and try to pour ourselves in their shoes, that is very understandable to change the way how a running workload, which must be up 24-7, that serves millions how they are running is to be changed. Our journey with leveraging Istio in production of three years might not be long compared to others, but through the experience, we embodied this in the way we are onboarding them to our surface mesh-enabled platform. We picked a surface and made them SPOC, and then they can see the benefits and upsides that they can get. We pair with them through regular cadence. We also use data-backed arguments in response to their inquiries. So the two most... The two questions that we get the most is that if we introduce more components, wouldn't that mean that it will have more infra cost? And the second one is that if we introduce a proxy in front of their layer, doesn't it mean more latency? From our experience, these two are negligible, and the benefits of using Istio weigh outweights these two concerns. So let's say they will have a leaner code that their application code does not have to handle all the things related to network and security. They can have canary deployment that will route a smaller percentage of their production traffic. They will have the surface graph of Keali and Yeager to visualize the surfaces call and the other mountains of wealth of metrics generated by Istio and Anavoy in regards to the application. All of these doesn't even exist in their setup and application earlier, but they can get these out of the box by just getting on a virtual platform. Now, Giri will continue the story with the current result of the consolidation that is still in progress. So how did we do with the consolidation so far? We are still in the middle of consolidation. We are currently migrating... We have migrated more than 700 services to our platform including 11 clusters and eight teams across the required companies. We have started this consolidation activity since the middle 2023 this year. We have deprecated 10 redundant tools and we have identified, discovered 50 more tools can be deprecated. We have deprecated more than 20 health charts and before this we have deprecated more than 100 charts. We have freed up 30 developers across these companies so they get more bandwidth to prioritize working on their product features. And because we were able to deprecate redundant tools, we have seen the reduction in our license spending collectively under the go-to financial family. This consolidation activity adds more workloads and responsibility to our infra team. Right now we maintain around 50 Kubernetes clusters across AWS, GCP and private data center in Singapore and Indonesia region. This consists more than 700 compute notes of 3,000 pods. This adds extra workload to our ArgoCD. Currently our ArgoCD is centralized. Single instance ArgoCD which obviously hits the scalability issue. We have to tune the performance of our ArgoCD along the way. Right now our central ArgoCD is responsible for 11,000 applications, 6,000 repositories, around 60 projects and watching more than 380,000 objects. I'll be talking more about this with my other colleague tomorrow. How we tune our central ArgoCD instance. If you are interested, please come to our talk at 11 a.m. tomorrow. To recap, there are three takeaways from our talk today. First, consolidating tools, whether it's teams or acquired companies definitely bring cost benefits. We were able to recover productivity by freeing up developers to work on what matters to them and improve collaboration through utilizing the shared toolings and shared dashboards across engineering organizations. Second, finding allies were very, very helpful across the companies and engineering organizations. So we were able to address the gaps early and getting the requirements before we start the project and decide whether to add support to our platform or ask them to deprecate. And these allies are helping us in advocating for adoption within their respective engineering organization. Finally, we favor rolling this out consolidation through movements. As we've seen how we rolled out platform for the first time organically within the organization because of the simplicity that we offered. But sometimes there are teams or organizations, they are a bit hesitant that are a bit unconvinced. We had to ask the leadership for sponsorship and mandates to at least speed up this alignment during the discussions. We are heavily inspired by similar journey in Expedia and we refer a lot to guidance by Elastic, JFrog and Google. Thank you for having us. We are happy to take questions. We still have five minutes. Please scan the QR to leave the feedback for our session. Thank you. Yeah, so I was wondering if you can talk a little bit more about how you manage app developers code versus the configuration you're managing for all the infrastructure that backs it. I think you mentioned a single repository or something. Is all the configuration actually centralized across all those applications or how are you organizing that? Does the question make sense? Yeah, good question. I'll repeat the question. How do we manage the configuration for each application? Whether it's one big monoripo or separate repository, we use a kind of mix of them so because through this portal, this portal we have let's say users want to create an application through our platform. We usually generate the repository for one application. This one repository application is nothing but a GitOps repository which can contain multiple sub-directories to multiple target clusters that they want to connect to. Let's say there's a staging cluster, production cluster, production Singapore, production Indonesia. Smaller, smaller directory structures within the application. This application repository we generate the manifest and push it to this repository. These are the ones that our GoCID applications get seen to. Similarly, if we have, let's say, 1,000 applications, there will be 1,000 repositories to manage the GitOps repository. On the other side, in an earlier slide, there was a cluster runtime standardization. For the cluster runtime standardization, we use MonoRepo. One repo for the base layer of the layer's components in order for cluster runtime to work on. That we do MonoRepo. Tomorrow, there's a technique. How do we tune performance for our GoCID in MonoRepo and also multiple repos? It's a different problem that they give us. Hope that answers your question. Any other questions? You can use the mic. Great talk. How are you guys managing environment promotion? Do you guys visualize that for developers in your UI portal and maybe a little bit more information around that? The question is, how do we manage promotion across environments? So far, we haven't abstracted our pipelines yet. So we leverage a lot in existing pipelines. Some of the teams are using GitLab runners. Some are using GitHub actions. So as long as they have a job that fetches our GoPlace's binary which runs and call our APIs, that's good for now. But the next roadmap we're planning to bring that into our platform because it's getting harder to control the flows and standardize the flows. That's our next plan. Thank you. Have you run into an issue with offering multiple networks with the acquisition of customers? For instance, with the PCI, I know that was a big issue we had to deal with with that network segmentation with on-prem and cloud. How are you dealing with that? That's a very great question. How do we manage network connectivity across the cloud? For us, one of the companies require PCI-DSS. For the ones who require PCI-DSS, for now we don't touch the network. So they stay as is. We haven't solidated networking yet. So for how do our single-incense ArgoCD connect to those target clusters? And obviously if we create tunnels to each, tunnel mesh to each of the target clusters, it's a headache. There's no guarantee their cyber network ranges are unique to what we already have. So far, right now we prefer public network over MTLS because we don't need to ensure the unique network to connect between them. MTLS makes it very easy for us to connect to target clusters. And because we are already running Istio in Istio, it's very easy to manage the MTLS connectivity with outdoor UL certificates and so on. I hope that answers your question. Next question. I'm just curious. One of your slides you mentioned saving 30 developers by migrating onto your development platform. I'm just curious how you quantified that. Well, it's quite easy. In each of the acquired company, they have their infrastructure teams. Like, certain numbers are there, 30 infra-engineered spread across acquired companies. We don't take them into our common team. So they get to work on focus on the product works on their group. That's basically how we do it. Alright, I think the time is up. We'll continue questions in Hallway Track. Thank you so much.