 Hello all, thank you for coming and I'm really excited to be here, we both are. So today we are just going to share our personal journey of, you know, testing in production and then leading, it leading to, you know, API testing without writing any test cases or data mocks, which is a really, really painful task earlier. So I'll just introduce ourselves a bit, we are both maintainers of Kefloy, we've been into open source contributions from, maybe we started from GSOC, GCI, Outreachy kind of programs for students and then mentors, you know, really loved the ecosystem. And previously we've led, you know, data engineering and Office of CDO teams at Indian startups, Pari and Lenscott. These are logistics and e-commerce companies from India. And so there our role had a very key challenge that was strict timelines. So we needed to experiment with multiple things in, you can say that, you know, we needed to build and launch within two weeks and every day we were iterating through code and releasing. So the problems were due to the strict timelines, we had, you know, very limited time to test. We did maybe, you know, a couple of happy developer flows and that obviously lead to regressions, you might have figured out yet. So, you know, to fix those regressions, again it changed the deliverables that we had and testing was not done enough. So all we needed to reduce the bugs was just three things. One, that we needed to do the functional testing because it was experimental, non-functional was not really important. And we needed to, a tooling that could create the test cases and update the test cases so easily that we don't really have to spend time on it. And we needed something that could automatically, you know, orchestrate the testing infrastructure. So that's all we needed, nothing much. So you know, we explored a couple of solutions, tried some solutions. I'm just going to walk through all of the solutions and limitations that we faced and then later on, you know, showcase with what we ended up doing and the demo of that. As usual, we started with writing the automation test suites that everybody does. And there were challenges with it that you had to write the test cases, all the automation test scripts. Those are very brittle because we are, you know, changing the application code base so fast that the previous test cases needed to be maintained. And that led the frustration that I'm spending more than two times of the developer time just in testing or maintaining those test cases. Also, it also had the problem of writing the synthetic data mocks which were not really close to real world scenarios. And people had shared test environments in our team. So if somebody changed anything in the test environment, maybe a database or a configuration, something, so the rest of the teams test automation seed starts breaking, right? So again, that was a brittle approach that could not work. And then somebody told us, hey, why don't you test in production? And we were like this, right? Hey, dude, come on, you're crazy. Why would you do that? So later on, when we researched more about it and it made more sense that your application actually is going to be deployed and, you know, released in production. And ideally, you want all your test environments to be just like production or the best effort environments that you create like production. So if you can test in production without having any side effects, that's the best case scenario for your application. So we explored some ideas. Let's take an example, the shadow testing. So this is an application serving the user traffic. And you have a new application version, let's say, Application V2. And you try to mirror the traffic. And in the shadow testing, you compare the responses of your currently deployed and the new application version. And you compare if everything is matching, working fine, is compatible. That sounds like a good approach, easy one. But it is good only for the stateless applications, like maybe audio streaming or something like that, not really for the stateful applications. We had a stateful application, so couldn't take this approach. So in stateful applications, your application is basically talking to multiple dependencies. Twilio, Stripe, multiple dependencies, especially in microservice architecture, there are internal calls and external calls that your application makes. So we're curious that with Application V2, what do we connect as a dependency? Cannot directly make it talk to the production dependencies. But then we figured out that some companies are actually doing that. They're connecting the application V2 to the production database. But the catch was that they're aware and they guarantee that their application works in an idempotent manner. So it supports idempotency. It means that if you're doing one operation multiple times, then your application behavior do not change for the same one. And it responds with the same initial behavior. So, I mean, this was okay with that, but our application was not idempotent. We could not guarantee that we knew that there are going to be side effects. So we moved on to introducing a proxy between it that filters the read APIs. So let's say if there is a read, get API or any read API. That would go to the production database. You can compare the responses easily and you can test those. But you cannot test write. So this was the limitation with this approach that we could not test the write APIs or the mutations. So the next approach was that how about introducing a replica of the data production database that the stable application is talking to. And this also sounded like a very good approach initially. But when we implemented this, the real challenges came in. So the problem was that one, it was a huge operational effort to set up this pipeline, whole pipeline. More than that, it was expensive to set up the complete replica of the database. And even more than that, there was a replication lag. So what I mean by that is, let's say you have a write API going through your application v1. It writes to the database. And while your database is syncing it up with the replica, the same API call is being replayed via v2. So there are multiple cases that might happen. Either the happy cases, the test would go through because the replica did not sync. But because your replica sync, if it did not sync in time, could result into database corruption. Again, would give wrong test results. So this also did not work really well in the write calls or the mutations. So we were like, how about testing it later in time? Why do we need to really do it in real time mirroring the traffic? So what we did was we recorded the traffic from production and we replayed it in a non-prod environment. Instead of creating a replica that keeps in sync, we created a snapshot database in a non-prod environment. And we replayed the traffic. So first we captured the user traffic from production, then we replayed after setting up the shadow database or the snapshot database. So this also was a huge operational effort because you first try to recall, then you set up those snapshot TVs which are expensive, and then you need to update these snapshots timely. So these were also brittle. And then you need to write to compare the responses. So it was a very sound approach, but was expensive and included a lot of effort. Then, okay, I'll just summarize until now. So the upsides of record replay and shadow testing that we talked about are it is a low-code approach. You don't really have to write the test cases. We can easily achieve high coverage with it because you're capturing the user traffic. So there are multiple flows that you are capturing and will be able to replay, which increases the coverage of the code base that your application or your API test cases are going through. And there are sometimes unexpected user flows that you discover just via the real user traffic that the developer did never code for, but they discovered it from the real world. The downsides are one I think you might have figured out the dependency states are hard to manage. And this approach is good for load testing or stress testing, but not really good for functional testing, I would say, because if some of the APIs fail, there are multiple user traffic calls that fail, and you have to go to each of the call and debug it. So that takes a lot of effort, time to debug, and causes a lot of frustration. So that's practically, you know, very time-taking. And of course, handling writes is always tricky in these cases. So we were here. We were doing the record replay via a snapshot DB from a product environment to a non-prod environment. And we thought, how about if you are capturing and replaying the API request in response, how about we also do the same for the database queries? So by that I mean we create a virtual database, which is just the database query request in response and not the complete database. We would deep dive a little bit into this. Yeah. But I would say that with this approach as well, there are some downsides like, you know, you have to add support for each of the dependency that your application is talking to. For example, you know, if your application is talking to MongoDB, then you need to add support to mock the MongoDB queries or to capture and replay that. So there are different approaches that we are going to talk about SDK level or the agent level at the network proxy. And this becomes brittle in case your API schema completely changes and you have to re-record the user traffic and replay it. So these were the downsides, but the upsides were better because, you know, the complete database did not need to be replicated to some other environment or, you know, or neither it was expensive because we were storing just the query data. Yeah. I would just pass it to Shubham to take it forward from here. Yeah. Thank you, Nail. So essentially what we just discussed is we are going to virtualize the dependencies or basically virtualize the infrastructure around the application. To give an example, so let's say we have an application, you know, it returns what sports a particular user plays, right? So the user is Thompson and we have Cricket, Volleyball, Karam, and Boxing. So in this case, we have the application talking to MongoDB, which has all of the relevant data. Typically, if I'm going to, let's say, record replay, you know, into a different environment, I would, you know, capture the request, run it again in my test environment. And this time, you know, maybe user Thompson isn't there, right? I mean, it's not the same state or maybe user Thompson likes different sports. So the problem we're solving here is how do we ensure that, you know, the exact state is consistent with the test case that we captured. So same example, you know, once we do dependency mocking, we can, instead of maintaining a test database, we instead kind of maintain a mock. So when we're capturing the get games request, we capture the queries and the response that we got from MongoDB and just package it along with the test case. So now when the same thing happens in our, you know, while we're testing, we can just return the same response that we got for that particular request. So it's mocking, but we're actually kind of replaying that exact database response that we captured. Yeah, and yeah, and then, you know, things will be consistent. So now once we are here, the next obvious problem to solve was should we build an SDK for this or should we build an agent? But agent, what it kind of means is something like a proxy. So imagine, you know, it could be an on-white filter or it could be a network proxy, which, you know, we could install on something like a Kubernetes cluster. So we went through, you know, pros and cons of both with an SDK it's, you know, easy to map request to dependencies. So I think Neat just a bit on this. When you have, maybe we captured a million requests or maybe, you know, 10,000 requests and we're going to replay that, when there is a legit failure, that particular API, I mean, there's probably 1,000 occurrences of that same particular API endpoint or function endpoint. So, you know, many of them would fail. We see that while we debug production applications as well. And I mean, when we're making a legit change, which is not a bug, it's very hard to go and update each one of them. So it's easier to map request to the actual dependency calls within an SDK because we are in the application runtime. We know what's going on. But at the network layer, it becomes very difficult because although there's now work being done, you know, in the open telemetry SDKs to add metadata to SQL queries and stuff, but, you know, it's not a universal standard yet. So it's hard to really map out which database queries or which API calls exactly belong to which traffic. With the API calls, it's far easier if you use open telemetry already, but not with, let's say, if you use the Redis protocol or the MySQL protocol. So there's no guarantees there. Yeah, code level integration and ID integration become easier because if you are integrating, let's say, with the testing library, everything that a testing library supports would automatically be supported. And definitely, you get access to the application context and runtime, which can help us debug the application better, understand what's going on inside the application. But yeah, now, agent has some significant upsides as well. You don't have to make code changes because it's running at the network layer. Whereas with the SDK, we'd have to write, I mean, at least there would be some steps involved into, you know, integrating the SDK into the code base. It's faster to deploy and adopt, you know, because again, less things to change, less things to break. Low development overhead for the developers of the agent or SDK. So like for us, I mean, within SDK, I would have to write the SDK for different languages, for different frameworks. Whereas with the agent, it will be network layer, so it's independent of the language. And we would basically end up implementing different protocols, you know, MySQL protocol, HTTP. Yeah, so now this is kind of what we ended up going with the SDK approach for now. Although we are also working on an agent, I'll kind of show that in the future scope. But with the SDK, essentially what happens is in production or anywhere, in fact. In fact, many of the users, they are using the SDK locally, like on their laptops while developing to kind of capture the request that they're making locally. And along with all the dependency calls, which can be used as test cases. So the idea is simple. You integrate the SDK into your code. You perform a bunch of requests. It could be local on your laptop or on your production environment. They get recorded. And they are bundled along with those dependency calls. So you can replay them anytime easily. And yeah, it can be done again locally or as part of a CI pipeline. Yeah. Perfect. So now I'll quickly show you a demo of what we ended up building. So Keploi, it's open source to get started. In fact, by the way, we use Keploi to test Keploi. So since we started with the Go SDK, and Keploi is also written in Go. So we added support for Go test. So what essentially happens is if you look at the... So we use GitHub actions. Since it's available on GitHub, we essentially just run the Go test command. And Keploi runs automatically as part of the Go testing suite. And all of that coverage gets uploaded to the code coverage tool. And essentially, that gives us the code coverage. Same thing would work, let's say, in an ID. So we don't have to create any new integrations. So to get started, we have quick starts and examples. But you could either use a Docker compose to do it locally or there's a Kubernetes hem chart. So I'll be using the Docker compose to quickly show an example. So this is the Keploi dashboard. So it's very simple right now. It just has test cases and test runs. Test cases here, you see the application. For example, demo three is an application right now. And these are some requests which were captured. For this particular demo, I would be using... So we have a URL shortener application to demonstrate how this works. It's also available in the Keploi repo, in the Keploi group, in the Go samples. So it's a simple application. It does three basic things, so I'll show that. And this uses a Postgres database. It has only one dependency, which is a Postgres database. I'll also run that locally using Docker compose. So we have the Postgres up and running. Now we can start the application. So we're using the echo framework for the sample. And do now run a bunch of requests. So to start with, I mean, it's a simple post request. We can send in any URL to return the shortened URL. In this case, yeah, it returned the shortened URL along with the timestamp. So timestamp is important because anything time sensitive will change. So these causes flakiness and false positives. So Keploi also handles that, which I'll show in a minute. And once I've done this, then I can get this... Okay, so it's already placed on top. So this redirects to github.com. Then I can maybe change it to Bing. So I did that. And just a delete call. Let's just delete that shortened URL together. So I did a bunch of requests. Now if I go to the Keploi dashboard, I can see that here many of them were captured, right? It looks like the get did not get captured because it was cached by the browser itself. So one way to do that would be create it again. We can do a call. Yeah, we can do a call. I think that should do it. Perfect. We have the get as well. So it's basically redirect. Now if I actually go inside, right? So this is a request. This is response to standard things. What becomes interesting are the dependencies, right? So it's able to capture... So here what we're seeing is the metadata of all the SQL operations that happened. It would be same for other databases or other dependencies. What it's not showing right now are the binary data. That's currently stored in the Keploi database, which will be used for basically virtualizing the infrastructure. So yeah, let's get to that. So here we have a good test integration. So I'll stop the application. I can, in fact, stop the database because I don't need that during testing. And I can run the test with coverage. So what this is going to do is it's going to download all the test cases that we just saw on the UI, run them one by one, and give us some kind of result on what happened. As we can see, it shows 74% coverage because of the integration with Go test. All of these calculations are done by Go test. And now if I go to test runs, I can see all those five tests here. I can see what response it got back. And so like I was talking about, the timestamp, right? So as you can see, the timestamps are different. The way this works is it shows a second call and it compares them. So in fact, if you go to the raw events, you can see we add it as a noisy field. So body.ts, there's a timestamp field. Now, yeah, what happens if there is, you know, a bug or a known regression or a real change? So for that, what we could do is maybe let's change one of the keys. So I'll change the URL to URLs. And run the test again. So as we can see, a bunch of test case failed. And if I go to the UI, I can see, you know, both the post failed because I changed the key of, you know, one of the parameters. Now, I mean, this could be a bug or it could be something that we want. So for that, so considering that this is an expected behavior, we can normalize it again, just right from the UI here. And now the test cases are updated. So if I run the test case again, it would basically fetch the updated version of the test case that we just normalized. Okay, so I just normalized one of it. So I have to normalize the other as well. There's also bulk normalized, you know, just for this purpose. So I have to go here as well and normalize this as well. Perfect. Now all of these should pass. Cool, yeah. So as you can see, they all pass. So you can work normalize if you know, you know, it's from the same cause. So that's a quick demo of Keploi. Now coming back to the presentation. Yeah, so current state, right? Where are we right now? So we added support for Goal. Like we are more familiar with it. That's what our tech stack was. So we made the Go SDK first. We're currently adding experimental support for Java, JavaScript and TypeScript. We have a UI to edit test and visualize test support, which we just saw. And we keep making changes based on feedback. We integrate, we try to integrate as much as possible with native test tooling and open standards so that you don't have to make changes to your pipelines or any of that. We can easily detect and ignore noisy data, time-sensitive data, which we again saw, for example, timestamps. Those get automatically detected. Now the limitations, right? Everything has limitations. There's no tool that should work for all use cases. So this does not test the impact of network or infrastructure failures. So as we can see, we are virtualizing the infrastructure. It would test functionality of the system under test. Now the system under test could be one application or it could be, you know, all of your microservices. But anything outside the system under test is not tested. So things like, you know, impact of network failures or, you know, maybe our dependency failing, those are not tested. So those are basically out of scope. To run this at scale, right, since it's an SDK and it's running inside your code, we are implementing deduplication and sampling, but that is not tested in production yet. So that's something we have to try. Obviously we'll iterate through it, but right now it's not. Currently, yeah, I mean, we have language and framework specific SDKs, like I mentioned. There we're trying to, you know, explore how well they can use agent. Maybe we can use it for, you know, maybe not entirely for functional testing, but the other aspects of testing. Currently, like I said, the system under test, the current version of Keploy supports one application, but we're extending that to multiple applications soon. But right as of now, we support one application. So all of the applications are being tested individually. But if you want to test maybe two, three applications together, that's something that's coming. So, I mean, if you would like to try that and that's interesting, we'd love to, you know, see you on the community channels. Data streaming is not implemented yet. So let's say if you're using GRPC streaming or WebSocket, we don't support that yet. So that's something, again, that we're getting as user feedback. Sorry, yeah. So we don't support that yet. Future work. Yeah, contract testing. So essentially, if you think about it, we are kind of doing contract testing, like we're verifying if things are working fine, but there are some aspects of contract testing, for example, integration with the clients, knowing if the client did something wrong and not just breakage at the server level. So this is going to be server-driven testing, but we've been getting feedback that there's a lot that we can do on ensuring that the client-side contracts are also working fine. So that's something we'll be extending. And more on the, to make it on par with the existing contract testing tooling. Recording from live environment, like I mentioned. So right now, like from production environments at scale, that's something that we need to work on, that we're benchmarking. It works well for, you know, maybe beta environments or even locally. So that's where we've been trying out so far. We'll be releasing Java and TypeScript SDK soon. In fact, we have the experimental SDKs, but yeah, the stable releases are coming. We'll have an agent implementation. And yeah, we're trying to borrow, you know, things like, I mean, implement things like first testing or, you know, use context from the test cases that we have to generate additional test cases, which can add more coverage to the existing tests. Yeah, I mean, thank you. I mean, that's basically what we wanted to cover. We are available on GitHub. You know, please check it out. And we also have a Slack channel. We are a young growing community. So I would love to have, I mean, would love to have all of you and, you know, discuss your use cases and, you know, solve new problems. Yep. I think now we are open for questions. Yeah. Correct. So as of now, it's very rudimentary because, you know, we just captured that and replay like you said. So before that, I'll just repeat the question. So I think the question is that how do we protect data, right? Because we are capturing requests. We're capturing responses. We're also capturing database queries. And we're replicating it. So everybody who has access to the platform has access to all of the data. That's a great question. And we are planning to add, I mean, that's a, that's a feat that we've been getting that, you know, an ability to redact the data so that users on the platform don't access sensitive data. And also, I mean, since it's open source and you can host it yourself, you know, provide ability to encrypt the data, you know, in case somebody has access to those machines where it's stored. So right now, yeah, right now it's not supported. Right now it's very simple. So for that, that's a feature that will be added to ensure that, you know, that sensitive data is not directly applicable to users or have some access control layers based on how the organizations want to do it. Yeah, yeah. So from the feedback that, okay, first I'll repeat the question again. So I think your question is that, how about creating test reach? Right now it looks like these are individual test cases. Can we bundle them together? Maybe it could be part of a PR or part of a particular, you know, use case and have those separate as test suites. Yes, I mean, seeing how most QVW workflows are done today, I think that's a very critical requirement. And yeah, we are working on that. So you'll be able to, you know, create named test suites and all of that context would be within that particular named test suite. So they can, but those test suites would be independent because they would have their own virtual infrastructure. So they can run in parallel. They will not depend on each other. But within that test suite, you can have an option to keep them dependent. Perfect. Any other question? Do we have any virtual questions? From the live stream? We don't have? No, not yet. Oh, perfect, perfect. I think that's it then. I mean, it was really cool, you know, presenting Keploi and talking about a journey. Please feel free to reach to us anytime and, you know, discuss any problem or use case that you have. And yeah, if you try and face any problems, please join our community channels to, you know, discuss them. Yeah. So currently we host monthly community meeting and rest all our, you know, all the queries and discussions are done on Slack. If somebody has a specific use case, we just share a Zoom or a Meet link and we get them on the call and discuss what is the use case and if we can add it in the roadmap. Yes. Monthly on 25th of every Friday. Yeah, and you can feel free to, you know, chime in on the Slack channel and ask any questions. So we're all just there. Awesome. Perfect. Thank you.