 Hi, my name is Venkat and I work at RazorPay as a senior architect and I have with me my colleague Srinidhi who works as a senior software engineer at RazorPay. Today, we would love to talk about improving the developer experience and how we build a cloud-native dev stack for hundreds of engineers. A quick overview into what RazorPay does. RazorPay enables frictionless payment, banking and lending experiences for different classes of merchants of various scales and sizes. Today, we process billions of transactions for millions of merchants across the country. RazorPay has been at the forefront of innovation over the last several years in terms of transforming the financial ecosystem of the country. A quick motivation into why we actually embarked upon this journey over the last four years, our engineering headcount has actually grown up by 10x. In that process, we have scaled up our teams into full-fledged pods and BU's, 4 BU's and 30 plus pods. We had embarked upon microservices journey a couple of years back and today we have about 100 plus microservices and just in the last two years, the number of microservices has increased to about 50. We've had three company acquisitions across the course of time. That has led to a polyglot stack of various languages like php, LangPython, Java, etc. Overall, we do roughly about 1,500 deployments per month. Now, a quick look at what our CACD practice looks like. This is not really very different from the way many companies operate, but just to give a perspective and build up the motivation for the kind of problems that we're dealing with. At a very, very high level, developers commit code into GitHub. We heavily use GitHub actions for our CI integration. The GitHub actions basically builds images, which are Docker images, runs a variety of unit tests and pushes it into our private Docker registry. As soon as these images are available, the developers typically start deploying that code with Spinnaker. Spinnaker is an open source CD platform created by Netflix. The deployment process is very similar. In a pre-prod environment, there are a variety of integration tests, load tests if needed, and a bunch of other tests. Once the developer is satisfied with the test reports, the developer goes on to deploy the code in a Canary environment. Again, via Spinnaker's Kayanta, we run a variety of Canary threshold metrics. What that means is that we leverage or monitoring and tracing infrastructure for actually making sure that the Canary tests pass. Once the Canary thresholds are met or processed, the deployment is moved forward into production. In case the Canary thresholds do not pass, the deployment is rolled back. In general, we use AWS as an infrastructure where most of our code runs. We are a heavy proponent of Kubernetes, very specifically somewhere in late 2016 is when we started our Kubernetes journey. We can probably say that we are probably the first company in the country to have actually managed a full production-grade Kubernetes ecosystem. While today, we run managed Kubernetes via EKS. Like I mentioned, because of the fact that we are on Kubernetes, all of our packaging is done via EKS. Spinnaker is something that we mentioned. Our entire infrastructure is on code via Terraform. Specifically for self-serve infrastructure provisioning, we have enabled Atlantis where developers can actually come and provision any cloud resources that they probably want. Again, GitHub Actions is something that we use for all the continuous integration and the Docker image building. With this principle in mind, we can notice that there is a tremendous amount of deployment time that the developers spend in terms of rolling out their features faster and actively into production. Specifically, in the development dependencies themselves, because of the number of engineers and the number of microservices that you're dealing with, this leads to a sequential development process. What that means is that imagine there's a single service and there are a bunch of developers who are working on those features. Each developer will have to take some time to actually deploy their code into a pre-prod environment. This development process is largely sequential. They have to coordinate with other developers in terms of deploying so that their changes are not overwritten by other developers. The other part of the problem is because there's a heavy dependency on AWS resources, most of the AWS resources are marked on the local development environment, which leads to a lot of confidence issues in terms of pushing the code to production. Testing again, even a single line of code change requires the entire development process before the developers can start testing their code. On the right hand side is another set of problems with shared environment. What I mean by shared environment is this. I as a developer, I'm working on service A, while service A has a dependency on service B. Obviously, because my change is only affecting service A, I don't need to necessarily build service B. While I'm testing my service, imagine there's a developer who's working on service B, and they end up changing the deployment of service B on a pre-prod environment, my test starts baking. How do you cost effectively scale not just my own individual services, but also shared environments? Today, that's quite difficult, unless and until there's a lot of cost thrown at it. The other part of the problem is that this leads to also a fact of the shared environment, imposing a lot of constraints in terms of being able to demo the product or the features to the business or the product counterparts. Again, there could be demo restrictions or there could be as M domains, things can change. This leads to a lot of time loss in one sense. The third part of the problem, in the nature of the environment that we live in, we indulge in a variety of integrations with third party gateways, payment gateways, partners, etc. Many of these integrations sometimes can run for weeks, maybe even sometimes a month. What that means is that every single such integration requires a separate service fleet of sorts, which is useful for that particular integration itself. How do you really build these kind of shared environment in a cost effective fashion? While all of this is there, there are also challenges on the inflow provisioning and specifically, like I mentioned, all our infrastructure is encoded in terraform. This requires developers to go through a cognitive overload of learning terraform DSL for provisioning even simple things like an SQL endpoint or an SNS endpoint. More specifically, like I mentioned, because of the entire development process, debugging even trivial one line change sometimes requires the developers to context between the actual application and probably like a logging platform or a distributed tracing platform. More specifically, there's a lot of time that is spent in just building and deploying the images themselves, depending on the complexity and the nature of the tests or the unit tests themselves. There's a significant amount of time and this means that future development is largely sequential and iterative and it's not independently, it cannot be done today. What it actually means, all of this actually means is that while the development time can be used for doing a lot more productive and constructive things, today there's a lot of time that is getting wasted by the developers just in setting up the deployment of the corresponding infrastructure and being able to just debug their code and move forward with their future development. What this simply means is that we need to simplify the developer workflow, we need to reduce the time to roll out of the cycle time, as they say, in an independent fashion and in a simple fashion. With that, I would like to hand over to Srilithi, my colleague, who'll walk you through the journey on how we are solving these problems. Thanks Venkat for the brief introduction and talking about the goals of the project, we are primarily focusing on reducing the cognitive load on developers and that is being achieved by one, having a streamlined workflow, two, having environmental consistency and three, providing a faster feedback loop from local development. The following are the key decision factors on which the solution was built upon. The first one being to rely completely on open source and have zero vendor lock-in. The next one is to make sure that the solution is Kubernetes native as our environments are in Kubernetes. The third important one is to have a hassle-free onboarding, which means that we wanted to onboard application with minimal changes and also not drastically change the development or deployment lifecycle. And last but not the least, we wanted to have a cost-effective solution that will make sure that we don't burn cash in order to get the FMRL infrastructure. Our solution is slightly opinionated and is what fits our use cases best and is not a complete pass solution. So the core name of this particular solution is Dev Stack and this is Razer-based journey towards a better development lifecycle. So let's try to understand the solution by asking a series of questions and answers and coming to a logical conclusion before the demo. The first one is on how do we bypass the CICD loop for the iterative development and the need is to have directly expose the local code onto the cloud environment. So as we one version of the solution, we went ahead with Telepresence, which followed a proxy-based approach. So Telepresence was a client that was sitting in both the local system and the remote cluster and made sure the connection happens through a tunnel. It took care of the DNS resolution, volume mounting, and also networking. But there was a major drawback due to the whole tunneling approach, which was that the responses were slow, the connectivity issues were with the database, and also sometimes connectivity issues with the cluster in itself and also along with VPN, there were a lot of bottlenecks. So how do we avoid the network limitations and how can we ship code to the remote container without relying on the network? So we had to shift the approach towards a file sync base where we used a tool called DevSpace. So DevSpace syncs the local code into the remote container in a very efficient manner and it also does live reloading by container restart. DevSpace also has put forwarding and log streaming that provided scope for further features. As it all looked good, there was an hiccup sort of a limitation per se due to the container restarts. So the container restarts which are bound by Kubernetes Pro were flaky at times and also not completely reliable. So we wanted something that was better and faster. So the next question is on how do we avoid this container restarts. So in order to avoid the container restarts, we had to put in a library inside the container that will take care of the hot reloading. So there are a lot of libraries per language like CompileDemon for Golang, NodeMon for Node.js and so on who rely on watching for the files that are changed. These are the files that are synced by DevSpace and then it rebuilds the binary and then runs the new one which is made available to the server. So this effectively avoids the container restart thus not breaking the flow. So we could have a generic Docker container per language build and use that container with the DevSpace command in order to make sure this onboarding is seamless. So all of these three questions helped us to solve the problem of local sync. So we now short circuited the feedback loop by just running a command. So next question is on how do we provide or orchestrate multiple services? How can the developers declaratively define and apply service dependencies? And the solution for that is Helm file which is a wrapper on top of Helm. So Helm file helps us to compose several charts together to create a comprehensive deployment artifact for anything from a single application to the entire infrastructure stack. So we define a term called service fleet here which is a collection of microservices that are required by the developer for his or her workflows. So this Helm file works seamlessly with the existing Helm packages as all of our applications already had Helm packages and we didn't want to or need to write extra thing. We just had to wrap all of them in a single YAML file and provide it as it is given in the right side of the screen. So Helm file when took care of all the Kubernetes resources orchestration and that solved major of a problem with respect to providing Kubernetes resources like your deployment service, ingress, job, cron job and whatnot. But the application is not wholesome without the other requirements which are like for example SQSS or databases or AWS resources which are not completely on Kubernetes. So the next question is on how do we provision FMR infrastructure resources? So we used or relied upon Helm hooks for this provisioning. So Helm hooks provided a plug-in play model to maintain the dynamics of applications auxiliary requirements. What we mean by that is we have written a two to three or per requirement base Helm hooks which can be plugged into application based on a requirement. So if for example an application is using SQS queues they can plug in a SQS configurator as a hook whereas an application using Kafka queues could plug in a Kafka configurator. Similarly for all the other resources the plug-in play model will fit in and in order to also avoid the AWS overridden or maintaining AWS resources which was done via Terraform we used local stack which provided a framework for mocking AWS components. So this is how a Helm file workflow will look like on running the command the chart is being verified and it runs a bunch of pre-installed books which make sure that the auxiliary requirements for the application is up and then it loads the charts the Kubernetes resources and deploys them into the Kubernetes cluster and then some post install books as well which will which can include like the English route configuration or validation of the manifest generator and make sure that the fmwarel service or the fmwarel infrastructure per developer is available by just running one command. So this is how we solve the problem of having a streamlined workflow for the developers in order to bring up their fmwarel service fleet. So now with the fmwarel infrastructure using Helm file and local sync using dev space there is one piece in the whole puzzle that needs to be sorted which is on how do we make sure the traffic is routed to the right user service. So for that we use header-based routing our ecosystem or our Kubernetes cluster have been using traffic for ingresses and we upgraded that to traffic 2.0 with supported header-based routing out of the box. So the traffic ingress route configuration will have multiple rules in order to guide it to the dynamic routing. So for example in the right hand side of the presentation we have two services which are apiweb and apiwebshrinity and based on the header traffic will make sure that the request that comes with the header is propagated to the apiwebshrinity service which in turn puts the request into the apiwebshrinity deployment or apart. So this is how we have enabled header-based routing to solve the routing problem and the next set of question is on how do we make sure the upstream services are routed properly as well and we use header propagation there and we have piggybacked on open tracing where open tracing by default propagates the header and we rely on that where at every service the traffic will read the header and route it to the appropriate service. Let's walk through a use case on how does that routing work, the routing overview for example let's take a use case where we have two applications app one and app two and assume app one being a gateway service and app two is an actual service that processes a request and gives a response. So we have three developers who are working on both of these microservices where developer one is explicitly working on app one and developer two is working on app two whereas developer three is working on a feature that spans between app one and app two. Now by running the helm file command the developers would have configured the infrastructure now let's see how the request routing happens. So developer one on passing the request header dev one in the request the ingress route that is present in front of app one will make sure that the request is routed to the dev one instance of the app one and then the request will propagate into the app two and given that the dev one label is not present in the configuration it will route it to the default shared infrastructure. So taking into a use case of the second developer on passing the header dev two into the request the request will flow into the shared stage infrastructure of application one as there is no instance of dev two running here and it would propagate it to the dev two instance of the app two version that is running because the configuration would be there and also in the last case when the header feature one is passed the request will be routed to the feature one instance of app one and then the feature one instance of the app two as both of these are present. So this is how the request routing will now happen across the microservices and this enables us in order to run only a subset of microservices required for the functioning of the application all the other routing can happen smartly to the stage infrastructure that is already there and running with the header based configuration. So let's move on to some of the practical implementation of this solution and see how it works in real world and this is the demo. So let's open our terminal and run a command now which will enable us to create a the fml infrastructure helm file sync is the command that is required and the command now has started to create fml resources for couple of services and let's look into what they are while the build is happening or the deployment is happening. So we have a file called helm file.ml which will define the service fleet. So in this case we have used two services dashboard which is a front end of raise a pay and api which is a back end of raise a pay and api is being written in php. The image here is the commit id of the of the brands that I am working on and also the same is the docker image tag. So the image tag we use in raise a pay is the commit hash of the of the brands that we are working on and just take a notice of one thing which is called the dev stack label. So this particular label holds the key for the fml infrastructure. All of our infrastructure fml are created with the suffix of this this particular label. So now that the helm file has completed its job it has created the fml resources for api and dashboard and it says how we can access it. So we'll have to either access it via passing an header for header propagation or using the url directly which is a preview url. The same thing works for the other microservices as well. Let's quickly check what are the Kubernetes resources that it has created. So when I do kubectl get podsyphon and api we can see that there is a shrini the demo that is running and also if we can just check the dashboard resources there is also be a shrini the demo that is running. Also take a notice of all the other labels that are working this and this is the live infrastructure that we are running on and all the other developers are parallely working while we are demoing. So this is the standard dashboard of Rezapay. Let's see how we can access our specific resource. So we'll use a plugin called mod header which injects the header into the request that go through your browser. You can also use postman by adding this header or use curl request to add the header. So let me now just refresh the page to see whether the changes of my branch are being reflected and as you can see the color has changed to green which is the feature that we are probably demoing. So let's go and see the code of the same branch that is there and the color is green. So now at this point of time we have created fml infrastructure separate from the all the developers working and we can access it by giving just an header. So let's just see how we can work on iteratively working on the feature from the local code. So we just change the content in the file go to our terminal run another command called dev space dev. So dev space dev takes care of syncing the local code into the remote cluster. So in this case the code in user stringy the code app API is being synced into slash app slash app and also the service PHP file which I just changed is being shown as synced. So we'll can quickly validate that by refreshing the page and seeing the seeing whether the color has for changed and here you go the color has now changed to blue. So let's for some reason I don't like this color. Let's go ahead and change it to another one and see how it will reflect. So I'll change the color or change the code in my IDE go to my browser and refresh. So dev space in the back end would have synced the file automatically and the change is reflected in my browser. So this is how simple the workflow will be with the adoption to the dev stack and this particular example is for an interpreted language like PHP and we'll just jump into another example where we see how a compile language like go works. Now for the demo purposes we have already created another service called capital cards which deals with the cards of Razor Pay and we'll just walk through the helm templates before going into the demo. So there are multiple deployments that are present which are your Kubernetes resources. These are the hooks that we mentioned about which will be run as either post install or pre-install. So this is a configurator which will update the ingress route. There is a preview URL hook as well. There is a secret updater hook that runs which updates the secret and there is also a SQS configurator. So this particular application is an asynchronous one with the web worker and SQS queues and we are configuring SQS queues dynamically and also having a service which is a Kubernetes resource. So now that we have already created the deployments let's just validate that. So on doing the QCTL get parts there are the web and the worker parts that are created. We can also go ahead and see the SQS resources that are created and this is the local stack UI. So this is the particular SQS queue that is created for capital cards and this is the URL in which it can be accessed. So likewise we have created the Kubernetes resources and also the ephemeral infrastructure resources. Let's run the dev space command in order to sync our local code into the remote cluster and on inspection while the sync is happening let's walk through the dev space ML file. So it has multiple divisions where it replaces the pods with the dev stack Docker container that we built which is nothing but the compiled even one and then it syncs our local code into the remote and also it excludes a few parts which will optimize the syncing. So these are the files that are probably not required for the binary to work and also it has locks part where we can see the locks of the container that we have synced with. Let's see the status of the dev space command and yes it has started to run the command it has started to sync which is basically at this point where it where it syncs the code and then it is running the build command which is nothing but the go build hyphen o and then it has run it run the binary that has been built and these are the logs that it prints in order to make sure that the debugging is easier. So let's just access this by making a curl request and as you can see the request is now reaching because I have used a preview URL. So let's just go ahead and make some change into the code which is a golang code. So what we'll do is probably add the add the loggers in order to print the request login. So there are two loggers that I have commented out I'll just uncomment it and save the file. So right now the workflow is that dev space watching these file changes has synced the file of helpers which is there here and then the same process of building and running the new binaries has happened and let's just validate that by hitting a request and here you go the changes that have done are being reflected dynamically in an interpreted language as well. Moving on to the additional features that are supported by DevStack the first one is the preview URL and as seen in the demo this creates a specific URL per DevStack label for an application by using the inverse route and the preview URL configuration is provided in the slide. The next part of it is the ephemeral databases. We currently support three different ways of configuring the database which are one a local database where the developers run their DB instance locally and we reverse proxy the request into the local system. The second one is the ephemeral database where developers can span ephemeral databases based on the preceded data and also make sure that the schema is synced with the pre-product environment and also have the dev migrations run. The last one is the persistent database where the developers can opt to use a pre-existing database of stage or beta or a pre-plot which is regularly controlled by the data ops flow. So the workflow of an ephemeral databases is that we copy the stable stage environment into S3 which is acting as a base and then the data container based on this will be generated and migrations run on top of it which gives an ephemeral database which is isolated from others. So are we really looking into the cost? The primary goal of us is to have a cost optimal solution. So kube generator is the tool that takes care of all the cleanups. So our resources all of the ephemeral resources have been tagged with a TTL of six hours by default and the developers can override it based on their requirements. So generator will make sure that it cleans up the resources when the TTL is expired and also in order to solve the requirements of ups and downs the upscaling and downscaling we use cluster auto scaler and spot nodes. We also do monitoring of all of these resources via Prometheus where we attach labels to every deployment and all of them are queried in an Grafana dashboard. So this is how the solution looks like in from the angle of tools where we have different components of dev orchestrator, cluster manager, infra monitoring and routing and this particular diagram specifically explains us how the pre and post-dev stack will be where a sequential development workflow is replaced by multiple parallel ecosystem of things working with local sync. What is the impact on dev productivity? We have seen a 10 to 15x reduction in time to take feature live because per feature the time has been reduced from 5 hours to 30 minutes and per iterations it has come down to 2 minutes which was 20 to 30 minutes earlier due to container builds and all the other regular process. So these are the some of the features that we are planning to go ahead and given that we emphasize on open source we wanted to give back to the community as well which is where we are recording all of the details and reference implementation in this open source set for feel free to raise issues or contribute back in order to make developers life easier. Thank you and any questions?