 So my name is Puneet Khanduri, I work at Twitter and worked at Twitter for the past three years, different parts of the company, varying from platform engineering to core ads where we make all of our money and then most recently some work we did for an India-specific effort out of Bangalore. These days I'm working out of San Francisco again, so I was on the plane, the 30-hour journey ahead of me managed to get sick, so please try to keep up with me. So this talk is about catching bugs without actually writing tests and why did a company like Twitter decide to invest in something like this requires some justification and motivation. So how many folks here remember November 2015? New user registration was basically broken for a week. So new users couldn't sign up on Twitter for almost a week and stock was down 5%. How many people remember that? Okay, nobody. Later that year, December 2015, we accidentally deleted all of Amitabh Bachchan's tweets and his followers were very, very pissed off. How many people remember that? All right, so for the folks who put their hands up, please put your hands down because that never happened. And the first one also never happened. And it didn't happen because we had this tool. It prevented this bug from being shipped into production. Yeah, sorry about that. I told you I had been on the plane for 30 hours. Sorry. Anyway, so as a developer we constantly have this fear. Every good developer should have this fear. And if you're a manager and you have a developer who doesn't have this fear, then fire them. Because I mean you have mission critical software that could cause an incident in production. You should feel scared when you're shipping code. And you should do everything you can to mitigate that risk whenever you ship changes, big changes into production. And oftentimes, as you've written code, tech debt builds up, a lot of legacy systems evolve. There is a lot of courage that is needed on the engineer's part. And even in some cases at the organizational level to do these huge refactors and say, no, no, we're going to completely do things from scratch. And that creates a huge risk for the people who have to make these decisions. So the other problem here is that if you are in DevOps, if you are a site reliability engineer at Twitter, for example, then you don't even know what's in the code that's being shipped. So your responsibility is to determine whether this is good or not, and then ship it to production. And how do you know that it has something that is potentially disastrous for the company? How do you know that? So naturally we go back to our software engineering 101 and we start thinking in terms of unit tests. So we write classes, we write methods, and then we look at our method. So if I look at all the possible code paths that are in my method, then there's five of them. In this example, this graph represents an entry point and a return point, and there are five possible paths, different paths that can be exercised through the code. So if you write five tests, you have 100% coverage. If you have one test, one test is equal to 20% coverage. So how many people have written these small-sized tests here? People have written unit tests? Unit tests are good because you still get pretty good bang for the buck. And if you come from a test-driven development background, then this is something that's highly encouraged. Then you start getting a bit more ambitious and you say that, you know what, testing one method or one class in isolation is not enough because I have a service, right? And that service has, in this example, six methods on the request path. So when the request hits your server, before the response comes back out, there are six methods that are going to fall in the path. And each of those methods have five independent code paths. So that's about 15,000 code paths, right? So is anyone here motivated to write 15,000 tests for any service? Good. So we begin to see that, you know, already the incremental value that we get out of writing every individual test is getting smaller, right? And then when you get to integration testing, where you say, hey, I want to look at the entire system, how my change affects not just this particular class or this particular service, but because I am part of a larger service-oriented architecture, how does it affect the entire system as a whole? So then in this example, you have four services with six methods within each service and then five paths within each method, right? So this is about 60 quadrillion code paths to get a sense of scale as to what we're talking about here in terms of path coverage. As a company, if Twitter today decided that we wanted to do 100% coverage for all of our code, we would basically need to shut down both of, like, all the data centers that we have, dedicate them completely to running tests and those tests would run for a thousand years, right? Meanwhile, the life cycle of the code is less than a week or two weeks because you're constantly deploying changes, right? So 100% coverage at this scale is, you know, impossible to achieve and is not very meaningful because what you're really interested in is answering the very basic question, will something happen in production during the life cycle of this change? When this, for the duration of time during which this change is deployed to production, will something bad happen? So we see that, you know, there's a super exponential cost of coverage. We talked about the three, like, small, medium, large scales of testing and, you know, the relative cost that you get of the same percentage of coverage is basically growing faster than exponentially. So the bigger the system, the more complex it becomes, the more impossible it becomes to get 100% coverage. So we came up with this Diffie approach for thinking about meaningful coverage and how we can leverage that insight to, you know, basically catch bugs before they get shipped into production. And the idea is very simple. If you look at the anatomy of a test, it basically has an input, an external stimulus, right? So you provide your code an input and then you get some output from your code and then you need to be able to make assertions on that output to assert that whether the output is correct or not, right? So the most meaningful input that you can give to your code is the input that it would see in production, right? So what we started doing was we started sampling live production traffic and basically what that means is that when you're running against live production traffic, you're basically simulating production environment for this test code and answering the question, well, if this code was in production right now, how would it be behaving? And then you compare that code's behavior to the behavior of the code that's already deployed in production. Assuming that you're not in firefighting mode right now, you assume that what's in production is correct. That's your reference point. And then by comparing the behavior of the code in production with the code that is not deployed in production, you can say with some reasonable confidence that, yes, this is okay. This is not going to blow stuff up. But there are a lot of things that create a lot of noise and can render this sort of an approach useless. For example, if you send a request and you get a response back, your response might have server-generated timestamps in it, right? Now, two servers will never have the same clock. It doesn't matter how hard you try. And it could be that the request was sent with a certain slight offset of a few milliseconds, right? So even if you use some protocol that synchronizes the clock across all the servers, because of the skew between the requests that were sent to the production code and the undeployed code, that skew can result in different timestamps coming back to you, right? Now, those timestamps are not equal, and it's not your code's fault, right? So this is a source of noise. And then how do you ignore it? You might have some random number generators in your code. So this server generates one random number. The other server generates a different random number. Those random numbers don't match up. Well, how do you know that it's okay to ignore that and, you know, not okay to ignore other things? So really what I'm getting at here is, like, how do you separate noise from signal? How do you basically focus on the differences that are meaningful that you should look at and filter out the differences that, you know, don't have any meaning or basically can be rejected as noise? So it takes half a second to have this realization here where we say, you know, the candidate in this diagram. Are you guys able to see the slides here? Is the color a bit off it? Let me just try to describe it. So at the top here we have this candidate, this guy here, right? This is new code. This and this, primary and secondary, these are old code. We have two instances running the same old code. Old code is good code. New code is potentially bad, right? And so we send the same set of requests to all three instances. And then here we are comparing the response that we get back from the new code and old code. So looking at the differences between these two guys, these are your raw differences. These differences contain noise and the meaningful differences. But by comparing the old code to itself, if there are any differences that show up between these two guys, those differences cannot be attributed to the code itself, right? And that by definition is your nondeterministic noise. So now you can subtract this nondeterministic noise from the raw differences that you got here and this filtered set of differences that you get at the end are the meaningful differences that you should look at because those differences can only be explained by differences in the code. They have nothing to do with the machines being different or anything else. Everything else is the same, right? Make sense? Okay, let's do a quick demo of what this looks like to users. So I have a dummy service here. This is my old code. It has some stuff in it. Don't try to read it. It's meaningless. This is the new code. And there's something different between the new code and the old code. Now let's see if Diffie can catch it. So what I'm going to do now is I'm going to deploy two instances of the old code. So this is my first instance deployed. Now I'm deploying my second instance. So this is my primary and secondary old code, two instances deployed. And now the new code. Yeah, I'll get to that. So I'm going to deploy these instances in the topology that I showed. So I'll deploy those first, and then I'll show you the interface. Let's see. I would say, like, don't bother too much about this. Let me just get through it very quickly because it's not very meaningful. Okay, so my Diffie instance has started. And then let me throw some traffic at it. Okay. Let's go here. This font size is not going to work. Aha, how about now? So if we look at this here, we have, you know, time stamps that are coming out as differences, right? And obviously, like, this is the example that we just talked about, that you'll have noise in the timestamp. And then there are these differences that are more interesting. So here we see that there are some parts of the response that were lowercase expected in the old code, but the new code is capitalizing the first letter, right? So now let's turn our noise exclusion logic on. There we go. So you see that this timestamp got grayed out. So basically what Diffie's telling us is that it's okay to ignore the timestamp. It's okay to ignore the date, right? It's smart enough to figure out that these differences because of the relative frequencies at which they occur are about the same. So it's okay to ignore these, but this one cannot be explained, right? So then you focus on this one and then you see that these are the actual differences that are showing up. Now, if you want to see the request that triggered this behavior, you can also do that. So I just clicked on that square at the corner and I can see, you know, let's get request with this path parameter and then I got these full blown responses back, right? So this helps with debugging because now you can replay the same request, fix your code, use the same request again, and then iterate until, you know, the bug is fixed. No, so timestamp and date are part of the response. It's something that the application is returning. So Diffie is schema agnostic. It can parse any JSON, Thrift, XML. You can change what? Right, right. But ideally, you don't want to do any exclusions manually. You know, the UI does allow you to do exclusions. So for example, you could go in and say that, you know, let's turn the noise exclusion off and I could manually turn this off, right? I could say I want to ignore these. I could gray it out myself, right? But we discourage people from doing that because that can create blind spots. So in one particular change that you're shipping, you might want to turn some parts off and say that don't look at these things because I intend to change them, right? But in the next release, you might forget to turn it back on, right? So the cost of maintenance of those flags is pretty high for the developer. And so we asked them to rely on Diffie's ability to differentiate and, you know, work with that. But yeah, it's possible to do that. So that's a quick demo of the Diffie UI itself. Sorry, just one second. Yeah, go ahead. Yes, this is open source. At the end of the presentation, there's a slide that has the open source link. So you're more than welcome to download it and play with it. There's also pre-built jar. Some people struggle with SBT building the code. If you just want to use it, you can just download the Maven jar and run it. There's an example with the documentation as well. Absolutely, yes. So for brand new services, this is not very helpful because there is no reference point, right? If you don't have old code, you can't compare new code against anything. So you're absolutely right. That is a limitation. And most of the times that we find this to be useful is, you know, for all our tier zero services that have existed for a while that need to do migrations. So this allows them to take on more risk that, you know, for example, one of the things that prompted us to use this was when Twitter migrated away from Ruby onto Scala. So basically the entire stack was rewritten from scratch. And there was tremendous opportunity to break lots and lots of things. And it was tools like this that helped us, you know, know that the Scala implementation is equivalent to this Ruby implementation. Like it's functionally equivalent for all practical purposes that we care about. So we have some ideas on how this could be leveraged for those using annotations. But getting inside the code then requires you to provide SDKs for every different platform, right? One of the nice things about building a tool at the service level is that you have a networking layer abstraction, right? So you don't care whether the code itself is written in Python, Ruby, Scala, Java, C++, it could be written in anything, right? But if you want to develop something that, you know, basically allows you to test inside the service, the code inside the service, then it ends up being language and platform specific. And so right now, you know, we haven't made any investments in that direction. And honestly, like, tools like this are more valuable at the larger scale because, you know, the relative helplessness is high for the developers. So that's another reason why, like, there's been less motivation to go in that direction. What about the command? Right? It is, it is. So I'll talk about that in a little bit when I talk about automation. But yeah, there is scope for that as well. Okay, so now let's go back here. All right, let's talk about automation. So as soon as we built this tool and teams started using it, initially we saw that individual developers would use this tool to test their unmerged code. They would compare their code against master. And then, you know, when they felt comfortable that, yes, they've done enough testing, they would send out a code review and then that code would get merged in. Pretty soon teams started requesting that, hey, you know, it'd be great if you could automatically compare master against production. Because even after merging into master, there is some additional amount of vetting and validation that needs to happen before you can deploy to production. So we built this automated CI job for DeFi where it is able to take the latest in master and, you know, take the latest tag from production and then, you know, do the comparison and then send out an automated report. And let me show you what one of those reports looks like. So this is an example of an email that you might get when you come into work at nine o'clock. There's an email waiting for you saying that we compared master against production for four hours last night. And these are some critical differences that we think are important for you to look at. So this is going to get deployed in two days. You have the opportunity to prevent this from being shipped. If this is a feature, if you want this to be shipped, then you do a manual override. But otherwise, you know, a human is required to find out if DeFi finds some differences. And, you know, one of the things that we realized at Twitter very earlier on was that, you know, we had the very organic growth in our engineering culture. So different teams had different practices in terms of how they did CI. And the way we built this was, you know, we exposed APIs, HTTP APIs. So any CI job can interact with a deployed DeFi instance through the HTTP API. And it's a documented, published API. So you can use that to hook up your CI. All right. So the other kind of bug that is a bit more subtle is performance regression. It could be that your code is functionally correct and it produces the exact same response as the old code that is currently deployed in production. But what if your code is, you know, 100 milliseconds slower than the previous code? And what if you have, you know, upstream SLAs where the client starts timing out because you're not meeting the performance service level agreement that you have? Right. So one of the things we realized was that because of the topology that we have for DeFi, let's go back here. One of the things that's happening here is that the same identical traffic is being replayed onto all three instances in lockstep. Right. So every request is multicast to all three instances at the same time. Right. So the code is the, you know, the traffic load that is seen by all three instances is exactly the same. Right. So what we did here was we realized that our service architecture uses finagle as the underlying library. One of the benefits of having that is that there is great visibility into metrics. So you can get metrics for everything from, you know, your heap allocations to garbage collection to CPU utilization, memory utilization. You name it, you have a metric for it. And all of those metrics are available to you. The problem is that there are more than 6,000 metrics on average that every service exports. Right. And how do you find the two or three metrics that are offending out of those 6,000 metrics? It is humanly impossible for someone to sit down and go through eyeball all the metrics. And, you know, we've seen examples where humans are terrible at doing this because of like how we visualize these graphs. So what we said was we're going to use Diffie, a different part of Diffie to do this performance regression analysis. And here we also realize that if you're doing performance regression analysis, you can't base your measurements based on a single instance for anything. Remember in the earlier topology what we had was we had, you know, one instance of the new code here with one instance of the new code and there were two instances of the old code. In this topology, we have three instances of the new code and three instances of the old code. And for performance regression analysis, you need this size because otherwise if that one node that you took for candidate, for example, for the new code, what if it was connected to a bad switch? What if it has a slow disk? So to eliminate the noise that can come from, you know, hardware variances, we take measurements across clusters rather than individual instances when we're trying to measure performance. And so then we wrote a classifier that takes all the metrics that are coming out of all of these boxes and then tries to classify them as past, which means that there's nothing wrong with this metric. Ignored, which means I don't know what's going on. So you're on your own and then failed where it says there's definitely something wrong here. So please take a look at it. And talking about classifiers, we tried various approaches to building these classifiers. The simple sample count classifier basically is just that you need to have a minimum number of samples in order for any measurement to be statistically significant. Relative threshold, absolute threshold, and then the median absolute deviation was the most robust one. So I won't get into too many details about what these classifiers are all about. But one high level thing that is worth taking away from this presentation is that you can't take averages. When you're doing performance regression analysis, you can't look at averages of your data set. You have to take medians because averages are very susceptible to outliers. Imagine a line that is oscillating flat like this and there's an outlier that shoots up and then comes back down. Now when you take the average, the entire curve gets lifted up because of that one outlier. Whereas if you take the median, when you sort everything, the outlier goes out and the thing that's left in the middle is a better representation of what normally happens in the system. So that's why a lot of these classifiers are based around percentiles, specifically P50 in this case that we're talking about the median, as opposed to averages. Or standard deviations for that matter. And then the other thing we realized was that each classifier was good for a certain category of metrics. And then there would be another category of metrics for which that classifier would have a blind spot. So it would not work very well for some metrics, but it would work extremely well for other metrics. So the approach that ultimately worked for us was combining all of these classifiers in a meaningful way. So here what's happening is that we're saying that sample count classifier, you need at least 40 samples and one of these should be true. So one of these classifiers should say that there is nothing wrong with this metric in order for the metric to be considered passing. So we basically wrote a code in a way that allowed us to combine and experiment with combinations of classifiers. And just before I finish the talk and take questions, I want to show you what this looks like. So here's an example of an email that you would get from DeFi with all the metrics. And I don't know if you can read that. Let me just increase the font size a little bit. So it's basically saying that of these hundreds of metrics, you've downsized it to this many and you need to look at these guys. So if I asked you to look at this many 600-odd metrics, you'd probably not do it. But with 12, I think it's a reasonable request. And then if we open that up, I think I already have this open right here. We can see what's happening is that your P and S are here at the bottom. And then the candidate is going off on its own somewhere else. So DeFi doesn't know anything about what the meaning of this metric is. But it's basically just doing this analysis to say that, hey, there's something weird happening here. This metric for the candidate instance is going off in its own direction. And then the reference cluster is very tightly constrained. Like, there's no variation almost, right? It's basically a straight line, right? So it's important to highlight this to the user because one bug, for example, that we've got using this is when a service started accidentally making twice the number of downstream calls that it needed to. So what it was doing was that it needed to read a value from a downstream service. And it only needed to read it once, but it was reading the same value twice. Now, functionally, this is correct. You can read the same value twice. But from a performance perspective, this is horrible because you're getting the same value twice. You're creating twice the traffic for the downstream service. And when you do this at a scale which involves thousands of machines, the request per second load is insanely high. So this sort of bug is caught using frameworks like these. So as I promised, the gentleman at the back, Diffie is open sourced and you can grab it from this link. And that's my Twitter handle. By the way, if you didn't take anything away from the stock, one thing that is very important to remember is use Twitter. If you're not active Twitter users, please use Twitter and follow me on Twitter. And I'd be happy to chat with you about Diffie or anything else that has to do with service oriented architectures or things. And questions, please. The gentleman at the back. Excellent question. Non-prod setup. And the reason is I think you know what the reason is. So unless your traffic is idempotent, which means that replaying your traffic in production is safe, you cannot risk playing the same traffic again in production because you would end up playing it twice. You would play it once when you sample it, and then you would play it again when you replay it. So we do not play traffic in production. These instances are isolated with staging profiles in a test environment, not, you know, the real production machines. Sorry. Right. So that ends up being service specific. So service owners are responsible for maintaining staging profiles for, you know, this data to be replayable. And it's something that the service owner provides. So Diffie is like the scope of Diffie is limited to, you know, taking sample traffic and replaying it to whatever targets you provide. And the targets, it's interacting with those targets at the network level, right? So you could very easily just pass production instances and Diffie wouldn't know the difference, right? So it's not a Diffie concern. It's a service owner concern. So if you want to use Diffie in a safe way, then you're responsible for coming up with a staging profile where you can replay the traffic that's been sampled from production. Yes. No. So we add and remove fields all the time. And it basically looks for changes in the schema as well. So if there are certain fields that are expected and go missing, then it produces a missing field error. And then if there are fields that are not expected, then it produces an unexpected field error, right? So any differences in the schema structure itself are also identified by Diffie and reported back. As I said earlier, it's schema agnostic. So it passes the schema at runtime and then compares the object trees. It basically is diffing the object trees at recursively, you know, throughout the nodes and then finding like the lowest possible node where it can identify and justify the difference. Yeah. So I mean, it's possible to shoe on some of these use cases into Diffie. But to be completely honest that it would be hard and it's not really meant for that use case. It's meant for testing services in service or inter architectures where and mostly services that are stateless. What you're talking about is basically comparing state, right? And so like one way I could imagine doing that is, you know, writing a very dumb service on top of the state and then basically running traffic that accesses every row in that state and then doing a comparison like that, right? But that basically ends up being a little hacky and a lot of work. And I mean, then the maintenance of that service and all of those things, you know, they don't really justify the cost. And I think like maybe there is a variant of Diffie. So Diffie has multiple packages. There is a differencing package, which is the core of Diffie where basically just takes two object trees and then compares them. So maybe that object core can be extracted to write an offline tool that can basically iterate through all your records and then do these comparisons so that you don't have to write a service. But that would need to be a different tool that hasn't been written yet. But I encourage you to take a look at the code and see if you feel like writing one up. Please, what sort of architectures Diffie would not work. So it wouldn't work in stateful services. So for example, if a service holds state in memory, right? And there are some services that I know that do this right for performance reasons. Then your state can diverge over time and that divergence in state between A, B, and the other A that you have deployed will create a lot of noise. So if you have stateful services, then Diffie will at some point not work for you because there'd be too much noise. Everything would start getting canceled out and it would basically not find any bugs for you, right? So that's one example where it wouldn't work. So sorry, this is the last question I'm being told. But I'm happy to stick around afterwards and answer questions in person. Please go ahead. Sorry, sorry. Yeah. Yes. So Diffie has been written. It was originally written for thrift, which is the protocol that we use heavily for internal services at Twitter. And then it was extended to support HTTP. But the core of it has been written to be protocol agnostic. So we have this lifter package where you can write a lifter from protocol X to a normalized schema that is internal to Diffie, right? And once that conversion happens, then all the differencing and all the reporting you get for free, right? So it is extensible and it's been built to do that. Yeah. Thank you so much.