 Hi, welcome to the Linux Foundation open source summit 2021. My name is Venkat. I work as a senior architect at Rezapay and I have with me my colleague, Srinidhi who works as a senior software engineer at Rezapay. Today, we would love to talk about our journey into improving our developer experience and actually how we built a cloud native dev stack for hundreds of engineers. A quick preview into what Rezapay does, Rezapay enables frictionless payment, banking and lending experience for merchants of various scale and sizes. Today, we process billions of transactions for millions of merchants across the country. In the last five years of our existence, Rezapay has been in the forefront of a lot of payment and technology innovation in the country. A quick overview of what our growth metrics looks like and this is sort of a motivating factor in terms of why we in fact went ahead to even embark on this journey. Over the last four years, our engineering headcount has grown by about 10x. We have actually scaled out to full-fledged pods and views, four views to be precise and about 30 plus pods as we are still growing. We have about 100 microservices in production over the last two years alone. We have about 50 plus microservices and it's still growing. We have acquired three companies in the last three years and that has led to a polyglot stack with multiple languages that we have in operation today. That also means that we have about 1,500 deployments per month that is sort of going on. A quick overview of what our CI CD practice looks like and the motivation again for this particular talk itself. Traditional development process, developers raise their PRs or commit code onto GitHub. We use GitHub Actions as a continuous for doing a lot of our build and operations. This is something that has evolved over the last five years. We have gone through a variety of CI CD stacks and this picture that you're seeing here is what we have today. So GitHub Actions is used generally for creating all our build images, running through basic unit tests and a variety of other things with integration to report portal and a bunch of other reporting tools. Once the images are available, a developer would be able to deploy their code on a pre-prod environment. We use Spinnaker as a continuous deployment tool. Spinnaker for those of you who do not know is an open source platform created by Netflix. So in Spinnaker, we run a bunch of tests, whether it's integration tests, whether it is performance tests or any other tests. And once the developer is satisfied with the test run, they go on to deploy their code onto a canary infrastructure and production. We heavily rely upon canary infrastructure for many use cases, primarily because our work literally hards on one of the most sensitive aspects of human life, which is people's money. So we need to be really, really careful in terms of how we roll out code to production. So developers deploy their code to canary. Spinnaker again provides a mechanism called Kayanta. We deploy a lot of threshold tests on canary with our monitoring infrastructure on Prometheus to a distributed tracing to what not. And once these canary thresholds pass, the deployment is automatically promoted to production in case the thresholds fail, the deployment is reverted back. Pretty much we use all our infrastructure runs on AWS. We are, by the way, we probably are the first in the country in India at least to have gone in with a full grade production, full-blown production grade Kubernetes infrastructure somewhere in late 2016. Today, we have rewritten all of our infrastructure to run on EKS, which is AWS's managed version of Kubernetes. Again, since we are on Kubernetes, we have a heavy dependency on Helm charts. All our deployments are packaged as Helm charts, Spinnaker is something that we spoke about. And the entire infrastructure is again encoded in data form. And specifically for making this a self-serve, we have you, we are using an open source framework called Atlantis that allows developers to freely provision infrastructure in a self-serve fashion. And again, GitHub actions is something that we already spoke about, which is the CA part of the story. Now, one thing that might be quite obvious is that this is typically a development process that happens for every single feature release in more than many ways. And a lot of time goes in primarily because of the way the CACD process is constructed. And the CACD process is not very different from many other places, independent of whether it's a continuous deployment or otherwise, there are always going to be challenges in terms of how you roll out both the production and what it primarily means is that there are development challenges which actually hinder agility and hinder productivity. A quick overview of what some of the challenges look like. On the first side, what we have is general sequential development challenges. Today, because of the number of developers that we have and also because of our microservices journey, the development process is largely sequential. And what we mean by that is that there could be, tons of developers are working on a particular repository. Each one has a dependency in terms of how the features are getting loaded out and they have to sort of coordinate amongst themselves in terms of, even if it is a two member team to a five member team to a 15 member team, the process is in terms of coordinating how development can actually happen in a more systematic fashion. The second part of the problem is that we have a lot of dependency on AWS specific cloud services like SQS, SNS, et cetera. Most of this is today mock in a local dev environment that reduces the confidence begin with. The other part of the problem is that even a single change, single commit requires this entire process of the CI plus Spinnaker plus deployment plus a bunch of the things. On the right hand side, is our problems very specifically with the shared environment. And what do we mean by shared environment? What is that I as a developer, if I'm working on a service, I'm only going to be working on one service and I don't probably require multiple copies of dependent services to be running in production in a pre-prod environment. And because I have a dependency on another service, it's quite possible that another developer is working on the dependent service could have actually made change which could be breaking my test. So how do you actually scale out the shared environments for hundreds of developers and also for hundreds of services? That's one side of the story in a cost effective fashion. The other part of the problem is because of the fact that there are the shared environments, how do you meaningfully demo feature changes to product and business teams without actually bothering about another developer overriding my changes? The third part of the problem is today in our world, we have a lot of our work depends on integration with third-party partners, gateways and banking providers and a bunch of other things. So many of these integration sometimes can run from days to weeks which also means that this every developer whoever is integrating with these partners or gateways, they need to have an environment that is exposed for several weeks sometimes. And how do you sort of really keep it up and running in a seamless fashion? The other sort of problems are again in terms of infrastructure provisioning itself. Like I mentioned, all our infrastructure is encode via Terraform. Even though we have Atlantis, there is a burden on the developers to actually learn Terraform DSL to be able to provision most of the infrastructure that is needed. The other part is in terms of the debugging challenges itself. And like I mentioned, because every code commit goes through an entire development and deployment process, even to debug a simple code break, the developers are going to go through a lot of context which between the actual application which is running to actually going ahead and looking at something like a trace or something like logs, et cetera, et cetera. The other part of the problem is the time that it takes to build and deploy the code itself. Again, once the code is committed, there is an entire pipeline of building a Docker image, running the unit test and actually pushing the image to the central Docker registry. This takes a significant amount of time. And testing small features, again, it just increases the burden on time for the developers to actually be able to move faster with quality. So going through all of these things, what we actually realized was that a lot of time is actually getting spent on the developers themselves. A lot of this time that can be used for doing a lot of productive things while it is actually getting spent only on deployment challenges itself, which means that we need to have a way to simplify the developer workflow. And also it means that we need to reduce the time for rolling out features while each developer can independently operate without having dependency on other developers. With that, I'd like to hand it over to Shrinithi Maikuli who will walk you through how you've tried to solve this problem. Thanks Venkat for the brief introduction and talking about the goals of the project, we are primarily focusing on reducing the cognitive load on developers. And that is being achieved by one, having a streamlined workflow, two, having environmental consistency and three, providing a faster feedback loop from local development. The following are the key decision factors on which the solution was built upon. The first one being to rely completely on open source and have zero vendor lock-in. The next one is to make sure that the solution is Kubernetes native as our environments are in Kubernetes. The third important one is to have a hassle-free onboarding which means that we wanted to onboard application with minimal changes and also not drastically change the development or deployment lifecycle. And last but not the least, we wanted to have a cost-effective solution that will make sure that we don't burn cash in order to get the FMRL infrastructure. Our solution is lightly opinionated and is what fits our use cases best and is not a complete pass solution. So the core name of this particular solution is DevStack and this is RazerPace journey towards a better development lifecycle. So let's try to understand the solution by asking a series of question and answers and coming to a logical conclusion before the demo. The first one is on how do we bypass the CI-CD loop for the iterative development and the need is to have directly expose the local code onto the cloud environment. So as the V1 version of the solution, we went ahead with telepresence which followed a proxy-based approach. So telepresence was a client that was sitting in both the local system and the remote cluster and made sure the connection happens through a tunnel. It took care of the DNS resolution, volume mounting and also networking. But there was a major drawback due to the whole tunneling approach which was that the responses were slow, the connectivity issues were with the database and also sometimes connectivity issues with the cluster in itself and also with VPN, there were a lot of bottlenecks. So how do we avoid the network limitations and how can we ship code to the remote container without relying on the network? So we had to shift the approach towards a file sync base where we used a tool called DevSpace. So DevSpace syncs the local code into the remote container in a very efficient manner and it also does library loading by container restart. DevSpace also has put forwarding and lock streaming that provided scope for further features. As it all looked good, there was an hiccup sort of a limitation per se due to the container restarts. So the container restarts which are bound by Kubernetes Pro were flaky at times and also not completely reliable. So we wanted something that was better and faster. So the next question is on how do we avoid this container restarts? So in order to avoid the container restarts we had to put in a library inside the container that will take care of the hot reloading. So there are a lot of libraries per language like CompileDemon for Golang, NodeMon for Node.js and so on who rely on watching for the files that are changed. These are the files that are synced by DevSpace and then it rebuilds the binary and then runs the new one which is made available to the server. So this effectively avoids the container restart thus not breaking the flow. So we could have a generic Docker container per language build and use that container with the DevSpace command in order to make sure this onboarding is seamless. So all of these three questions helped us to solve the problem of local sync. So we now short circuited the feedback loop by just running a command. So next question is on how do we provide or orchestrate multiple services? How can the developers declaratively define and apply service dependencies? And the solution for that is Helmfile which is a wrapper on top of Helm. So Helmfile helps us to compose several charts together to create a comprehensive deployment artifact for anything from a single application to the entire infrastructure stack. So we define a term called service fleet here which is a collection of microservices that are required by the developer for his or her workflows. So this Helmfile works seamlessly with the existing Helm packages as all of our applications already had Helm packages and we didn't want to or need to write extra thing. We just had to wrap all of them in a single YAML file and provide it as it is given in the, in this right side of the screen. So Helmfile when took care of all the Kubernetes resources orchestration and that solved major of a problem with respect to providing Kubernetes resources like your deployment service, ingress, job, cron job and whatnot. But the application is not wholesome without the other requirements which are like for example, SQSS or databases or AWS resources which are not completely on Kubernetes. So the next question is on how do we provision FMRL infrastructure resources? So we used or relied upon Helm hooks for this provisioning. So Helm hooks provided a plug and play model to maintain the dynamics of applications auxiliary requirements. What we mean by that is we have written a two to three or per requirement base Helm hooks which can be plugged into application based on a requirement. So if for example, an application is using SQS queues they can plug in a SQS configurator as a hook whereas an application using Kafka queues could plug in a Kafka configurator. Similarly for all the other resources the plug and play model will fit in. And in order to also avoid the AWS overridden or maintaining AWS resources which was done via Terraform, we used local stack which provided a framework for mocking AWS components. So this is how a Helm file workflow will look like on running the command. The chart is being verified and it runs a bunch of pre-installed hooks which makes sure that the auxiliary requirements for the application is up and then it loads the charts, the Kubernetes resources and deploys them into the Kubernetes cluster and then does some post install books as well which can include like the English route configuration or validation of the manifest generator and make sure that the FMRL service or the FMRL infrastructure per developer is available by just running one command. So this is how we solve the problem of having a streamlined workflow for the developers in order to bring up their FMRL service fleet. So now with the FMRL infrastructure using Helm file and local sync using dev space, there is one piece in the whole puzzle that needs to be sorted which is on how do we make sure the traffic is routed to the right user service? So for that we use header-based routing our ecosystem or our Kubernetes cluster have been using traffic for Ingresses and we upgraded that to traffic 2.0 with supported header-based routing out of the box. So the traffic Ingress route configuration will have multiple rules in order to guide it to the dynamic routing. So for example, in the right-hand side of the presentation, we have two services which are API Web and API Web Trinity and based on the header, traffic will make sure that the request that comes with the header is propagated to the API Web Trinity service which in turn puts the request into the API Web Trinity deployment or apart. So this is how we have enabled header-based routing to solve the routing problem and the next set of question is on how do we make sure the upstream services are routed properly as well and we use header propagation there and we have piggybacked on open tracing where open tracing by default propagates the header and we rely on that where at every service the traffic will read the header and route it to the appropriate service. Let's walk through a use case on how does that routing work, the routing overview. For example, let's take a use case where we have two applications, app one and app two and assume app one being a gateway service and app two is an actual service that processes a request and gives a response. So we have three developers who are working on both of these microservices where developer one is explicitly working on app one and developer two is working on app two whereas developer three is working on a feature that spans between app one and app two. Now by running the helm file command the developers would have configured the infrastructure. Now let's see how the request routing happens. So developer one on passing the request header dev one in the request, the ingress route that is present in front of app one will make sure that the request is routed to the dev one instance of the app one and then the request will propagate into the app two and given that the dev one label is not present in the configuration, it will route it to the default shared infrastructure. So taking into a use case of the second developer on passing the header dev two into the request the request will flow into the shared stage infrastructure of application one as there is no instance of dev two running here and it would propagate it to the dev two instance of the app two version that is running because the configuration would be there and also in the last case when the header feature one is passed the request will be routed to the feature one instance of app one and then the feature one instance of the app two as both of these are present. So this is how the request routing will now happen across the microservices and this enables us in order to run only a subset of microservices required for the functioning of the application. All the other routing can happen smartly to the stage infrastructure that is already there and running with the header based configuration. So let's move on to some of the practical implementation of this solution and see how it works in real world and this is the demo. Let's open our terminal and run a command now which will enable us to create the FMR infrastructure. Helmfile sync is the command that is required and the command now has started to create FMR resources for couple of services and let's look into what they are while the build is happening or the deployment is happening. So we have a file called helmfile.ml which will define the service fleet. So in this case, we have used two services, dashboard, which is a front end of Razor Pay and API, which is a backend of Razor Pay and API is being written in PHP. The image here is the commit ID of the brands that I'm working on and also the same is the Docker image tag. So the image tag we use in Razor Pay is the commit hash of the brands that we are working on and just take a notice of one thing which is called the DevStack label. So this particular label holds the key for the FMR infrastructure. All of our infrastructure FMRally are created with the suffix of this particular label. So now that the helmfile has completed its job, it has created FMR resources for API and dashboard and it says how we can access it. So we'll have to either access it via passing an header for header propagation or using the URL directly which is a preview URL. The same thing works for the other microservices as well. Let's quickly check what are the Kubernetes resources that it has created. So when I do Qubesetaly, get pods, hyphen and API, we can see that there is a SchrinnyDemo that is running and also if we can just check the dashboard resources, there is also via SchrinnyDemo that is running. Also take a notice of all the other labels that are working this and this is the live infrastructure that we are running on and all the other developers are parallelly working while we are demoing. So this is the standard dashboard of Rezapay. Let's see how we can access our specific resource. So we'll use a plugin called ModHeader which injects the header into the request that go through your browser. You can also use Postman by adding this header or use CurlRequest to add the header. So let me now just refresh the page to see whether the changes of my branch are being reflected. And as you can see, the color has changed to green which is the feature that we are probably demoing. So let's go and see the code of the same branch that is there and the color is green. So now at this point of time, we have created fml infrastructure separate from all the developers working and we can access it by giving just a header. So let's just see how we can work on iteratively working on the feature from the local code. So we just change the content in the file, go to our terminal, run another command called dev space dev. So dev space dev takes care of syncing the local code into the remote cluster. So in this case, the code in user stringy the code app API is being synced into slash app slash app and also the service PHP file which I just changed is being shown as synced. So we can quickly validate that by refreshing the page and seeing whether the color has got changed. And here you go, the color has now changed to blue. So let's for some reason, I don't like this color. Let's go ahead and change it to another one and see how it will reflect. So I'll change the color or change the code in my IDE, go to my browser and refresh. So dev space in the backend would have synced the file automatically and the change is reflected in my browser. So this is how simple the workflow will be with the adoption to the DevStack. And this particular example is for an interpreted language like PHP and we'll just jump into another example where we see how a compile language like Go works. Now for the demo purposes, we have already created another service called capital cards which deals with the cards of Razor Pay and we'll just walk through the Helm templates before going into the demo. So there are multiple deployments that are present which are your Kubernetes resources. These are the hooks that we mentioned about which will be run as either post install or pre-install. So this is a configurator which will update the ingress route. There is a preview URL hook as well. There is a secret updater hook that runs which updates the secret and there is also a SQL configurator. So this particular application is an asynchronous one with the web worker and SQSQs and we are configuring SQSQs dynamically and also having a service which is a Kubernetes resource. So now that we have already created the deployments, let's just validate that. So I'm doing a QCTL kit parts. There are the web and the worker parts that are created. You can also go ahead and see the SQS resources that are created and this is the local stack UI. So this is the particular SQSQ that is created for capital cards and this is the URL in which it can be accessed. So likewise, we have created the Kubernetes resources and also the ephemeral infrastructure resources. Let's run the dev space command in order to sync our local code into the remote cluster and on inspection while the sync is happening, let's walk through the dev space ML file. So it has multiple divisions where it replaces the parts with the DevStack Docker container that we built which is nothing but the compile demon one and then it syncs our local code into the remote and also it excludes a few parts which will optimize the syncing. So these are the files that are probably not required for the binary to work and also it has locks part where we can see the locks of the container that we have synced with. Let's see the status of the dev space command and yes, it has started to run the command. It has started to sync which is basically at this point where it synced the code and then it is running the build command which is nothing but the go build hyphen O and then it has run the binary that has been built and these are the logs that it prints in order to make sure that the debugging is easier. So let's just access this by making a call request and as you can see, the request is now reaching because I have used a preview URL. So let's just go ahead and make some change into the code which is a Golang code. So what we'll do is probably add the, add the loggers in order to print the request logging. So there are two loggers that I've commented out, I'll just uncomment it and save the file. So right now the workflow is that dev space watching these file changes has synced the file of helpers which is there here and then the same process of building and running the new binaries has happened and let's just validate that by hitting a request. And here you go, the changes that have done are being reflected dynamically in an interpreted language as well. Moving on to the additional features that are supported by DevStack. The first one is the preview URL and as seen in the demo, this creates a specific URL per DevStack label for an application by using the inverse route and the preview URL configuration is provided in the slide. The next part of it is the ephemeral databases. We currently support three different ways of configuring the database which are one, a local database where the developers run their DB instance locally and we reverse proxy the request into the local system. The second one is the ephemeral database where developers can span ephemeral databases based on the preceded data and also make sure that the schema is synced with the pre-prod environment and also have the dev migrations run. The last one is the persistent database where the developers can opt to use a pre-existing database of stage or beta or a pre-prod which is regularly controlled by the data ops flow. So the workflow of an ephemeral databases is that we copy the stable stage environment into S3 which is acting as a base and then the data container based on this will be generated and migrations are on top of it which gives an ephemeral database which is isolated from others. So are we really looking into the cost? The primary goal of us is to have a cost optimal solution. So cube generator is a tool that takes care of all the cleanups. So our resources, all of the ephemeral resources have been tagged with the TTL of six hours by default and the developers can override it based on their requirements. So generator will make sure that it cleans up the resources when the TTL is expired. And also in order to solve the requirements of ups and downs, the upscaling and downscaling we use cluster auto scaler and spot notes. We also do monitoring of all of these resources via Prometheus where we attach labels to every deployment and all of them are queried in the Grafana dashboard. So this is how the solution looks like from the angle of our tools where we have different components of dev orchestrator, cluster manager, infra monitoring and routing. And this particular diagram specifically explains us how the pre and post-dev stack will be where a sequential development workflow is replaced by multiple parallel ecosystem of things working with local sync. What is the impact on their productivity? We have seen a 10 to 15x reduction in time to take feature live because per feature their time has been reduced from five hours to 30 minutes and per iterations it has come down to two minutes which was 20 to 30 minutes earlier due to container builds and all the other regular process. So these are the some of the features that we are planning to go ahead and given that we emphasize on open source we wanted to give back to the community as well which is where we are recording all of the details and reference implementation in this open source set for feel free to raise issues or contribute back in order to make developers life easier. Thank you and any questions?