 Carnegie Mellon vaccination database talks are made possible by Ototune. Learn how to automatically optimize your MySeq Core and PostgreSQL configurations at ototune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Hi guys, welcome to the second session of the vaccination database talks. We're excited today to have David Daly, who's a staff engineer on the performance team at MongoDB. David got his undergraduate degree from Syracuse University and his PhD in Electroengineering from UIUC in Illinois. So he's been at MongoDB since, sorry, 25 years now, so I guess 2015? 2014, yeah. 2014, okay. So we're super happy he's here today. Again, we wanna thank Ototune and the Steve Moy Foundation for Keeping It Real for sponsoring us. And like as always, if you have a question for David, please interrupt him as we go along. I'll meet your mic, say who you are, where are you coming from and answer questions. You want this to be interactive as possible, right? David, the floor is yours. Thank you so much for being here. All right, well, thank you for having me and definitely reiterate on the, please ask questions along the way. It's definitely more interesting that way. So let's go talk a little bit about how we test performance at MongoDB. And hopefully we don't waste a lot of time and money anymore on testing performance, but we definitely have in the past wasted time and money and it's easy to do and I suspect a lot of people do that. So in this talk, I'll go into how we test it a lot. Hopefully the big picture generalizes the testing of other databases and other software also, but I'll leave that to you guys to decide. So like to start with a little bit of a story. So I joined MongoDB in 2014 and my first release cycle that I went through with MongoDB was the 3.0 release. It started out as the 2.8 release and very honestly marketing decided 2.8 wasn't a significant enough number for that release. So it got up to 3.0. And it was my first release, it was a learning experience. And at the time we didn't have infrastructure, but we had a performance team that we put together to test all the things. So we get to release candidate. Here's the first release candidate and we run a lot of tests. And so here's somebody's run a test and found a 22% regression. They go and they bisected it to find the commit where the change happened. We invest a lot of time in it. What happened? Well, it was closed, it's cannot reproduce. This was not a thing. So it turned out after this person tested it, isolated it to commit, they handed it off to a server engineer to debug and they spent time trying to figure it out, but it didn't reproduce for them. And then it went back to the first person and when they rerun it, now a month or two later, it didn't reproduce for them either. It just was a giant waste of time. So that's not good. We don't want things like that. We want to be able to give people useful information so that we can make the server faster. So a lot of this talk is responsive. How do we avoid doing what we just did in this ticket? And what are we ultimately trying to do on my team? We want to understand the performance of MongoDB, of our software and when it changes. And when I say when it changes, what I mean is in terms of development time. So if it gets to November and someone puts in a commit and that makes query optimization slower, we want to know. So someone puts in a commit and it makes journal synchronization faster. We want to know. We want to know those things so that we can put in effort where it's needed to improve things and we can hold onto those changes that are improvements. So there's ways to do this and ways not to do this. So things you want to avoid, how not to do it? You don't want to have machines with personality. You shouldn't have to rerun a test on the same machine to show a regression. You shouldn't run tests by hand. You run tests by hand, one you don't know exactly what you did anymore. Humans make mistakes. You run by hand and did you do a regression or was the regression in the user running the test? You don't want to wait until release time. So obviously the ticket we were just looking at has done all of these things. You wait to release time and what happens is the code's been baked in. You have to spend a lot of time trying to figure out doing that bisect. What commits actually put this in? What parts of my software are responsible for this performance change? It just makes it harder. And also if you're at release time, you're under a time crunch to fix all these things now. You want to ship and so is this regression important enough to deal with right now? Finally, dimension I was hired into a performance team to test all the things. You don't want a dedicated team that is the only ones responsible for performance and that separate from development. You end up siloed without a good understanding of what is really going on. The flip side of this how to test everything? Well, first off, instead of testing things by hand, you automate everything. If you automate everything, you have a record to actually do the same thing twice. You want to minimize the noise. Automating everything doesn't guarantee that you get the same answer each time. You got to invest effort to get a low noise performance environment. Computers are giant, non-deterministic machines that if you want to personify them, take glee and giving you different results when you run a test a second time. You got to put in real effort to control that. You want to involve everyone, especially if you're in a growing organization, the only thing that's going to scale with your organization is if everybody is helping to test performance. And better yet, like if you involve the server engineer who added a new feature, they know that feature the best of anybody. They're gonna know where things could go wrong, what are the interesting aspects of it. You want them involved in testing of it. And finally, you want to always be testing. Probably not a shocking concept, but as with testing, the earlier you find an issue, the easier it is to fix. If you catch it, it's easier to isolate it. It's a more recent thing. The state of the change is in the developer's head. It's easier to work on. Other features haven't been built on top of it yet. It's just a lot easier to fix things if you can catch them early. Presumably, you're gonna talk about this, but you might have to enroll everything yourself. There isn't anything that does any of this shit, right? That's correct. Yeah, okay. And my joy when I joined the company was explaining to people the great part about this job is that there's no infrastructure. So we get to define everything and we're not confined by existing stuff and the awful part about this job is that there's no infrastructure. Yeah, so we spent some time building up infrastructure. At MongoDB, we have sort of six main performance-related use cases that we support. We want to be able to detect performance, impact, and commits when they go into the server. We call that waterfall. You want to be able to test the impact of proposed code changes. So even better than detecting a performance regression soon after it goes in is actually catching that regression before it ever goes into the code base. We need to be able to diagnose these performance regressions so the diagnostics and profiling and such. Release support before we put out our next release. Our next release is gonna be 5.0. How does 5.0 compare to 4.4? Where are we faster? Where are we slower? Are there any blockers to releasing? We, it's critically important information for us. Adding test coverage, we're always trying to increase our test coverage. That's both workloads and configurations, but in general, you add a new feature. We want you to add performance tests for that new feature. So that's a case we have to support. Can you roughly say what the long gonna be covering percentages now for the current, what you guys can do currently do? I don't have good static analysis code coverage data right now. And a little bit of, there's two ways to do it. How many, what fraction of our lines of code are covered is at one end and what fraction of our customers are covered is sort of the business end of it. I don't have good answers for either of those today. I would like to have good answers for both of those. Then finally, performance exploration. We're in the business of understanding how our software performs. And so we'd like to be able to run lots of what if experiments on it and understand, what happens if I scale up these instances? What happens if I give it a ton of memory? What happens if, just any kind of what if, question our customers do everything. So we'd like to be able to answer some of those questions before we go. For this talk, I'm gonna focus in particular on the detecting the performance impacting commits. That's sort of our bread and butter and the release support. All the other parts really flow off of the same infrastructure. And these are probably our highest impact and distinct things that I can talk about out of it. I can talk to any of these things, but those are the ones I have queued up. All right, so detecting performance impacting commits. We have our own continuous integration system called Evergreen. And it has a view that shows commits going into the server. And that's called the waterfall view. I guess it's always flowing. I don't know, it's a term that I've always known since I've been at MongoDB is the waterfall. And so as we hinted before, when I joined MongoDB, we didn't have any automated testing, any automated performance testing. We had a lot of automated correctness testing. So we wanna build it into our existing system. So in order to have performing testing in our continuous integration system, there's a number of steps that you need to do. You need to set up a system under test. You need to be able to run a workload and measure its results. You need to be able to report those results back to someplace. You definitely wanna be able to visualize the results to see how it's performing over time. You also wanna decide and alert if the performance changed. Once you get more than a handful of graphs, you want that automated. And you wanna automate everything and work to keep the noise down. And as I said before, automating everything doesn't guarantee that you keep the noise down. That requires its own special focus. So starting at the top, setting up a system under test and running a workload. We run tests at three levels at MongoDB. At the top, we have system level tests. We call this Sysperf. These are real multi-node clusters that are deployed in the cloud running end-to-end tests. These cost real money, they take real time, but they're most representative of actual use. At the far other end of the spectrum, we have unit level performance tests. We use the Google Benchmark framework to test our C++ code. These run snippets of code. So things like you wanna test your lock manager or the sharding catalog. You can have a very focused test for those sections of code. These are small, they're inexpensive, they're very, very focused. And in the middle, we have something that... Google Benchmark, in your experience, have you found it to be, you've found like for our test, because we use Google Benchmark here, that's like sometimes randomly things just get slower because of it. And like, you know, you run the same test and everything looks okay. Because like, something's happening with machines. Like, do you guys, you know, like, is it just, you know, for one PR or one commit, you run the suite of Google Benchmarks on one machine and that's enough if you run redundant copies of it and then add it across? We run a lot of tests. And so the unit level performance tests, I probably spend the least time thinking about because they're focused enough that we've pushed them, ownership of them onto the teams that have added them, right? If you're running a test on the sharding catalog, it's only touching the sharding catalog. And so the sharding team can own that. We've done a lot of work. So we run those on a range of hardware. And some of that is dedicated hardware that we've tuned by hand to be low noise. We've done a lot of work to do things like turn off hyperthreading, pin to cores, disable frequency scaling. And still there's gonna be noise in those. And some things are noisier than others. It's an ongoing, it's an ongoing challenge. Okay. And it's all sounds familiar. Yeah. All right, sorry, keep going. All right. The middle one is, we call micro benchmarks, they're single node CPU bound tests. So we actually spin up a MongoD, but we're running all of the stuff on one two socket box, the client and the one MongoD. We've done a lot of things to control noise. It's dedicated hardware. We have a set of boxes that are identically configured and that we own. And we've run the tests on those. All of them we've done work to control the noise. And I maybe talk about that a little bit more. We have resources on what we've done there. For this, I was gonna focus just on the system level tests because it's the most interesting and the hardest part with multiple nodes in the cloud. When we started, I severely doubted whether we could get usable results running in the cloud. It was a pleasant surprise when we discovered that we could. There's so much stuff that you can't control in the cloud. Our software to support this testing, to set up the clusters and run the tests in the cloud is called DSI. It's open source, it's available. It's specific to MongoDB, but the concepts aren't. And it's what we use to set up systems under tests and to run the workloads. Real clusters automating everything and trying to get consistent, repeatable results. When we built it, the design goals, there were several design goals we had. Obviously automation, I've probably said it five times already in the talk. It needed to support both continuous integration and manual testing. It's a really important point. It needs to run in continuous integration, but if you want engineers to be able to debug something, they need to be able to run it themselves. So you need the manual support. It runs in the cloud. So if we have a surge of demand, we can scale up and handle that. We want everything to be configurable. You can use MongoDB a lot of ways, there's a lot of options. We wanted to be able to be future proof and be able to test any of that. And we'll talk a little bit about that configuration. And in specific, we picked YAML for the configuration. We've built a little bit on top of YAML also, but all of it is controllable by YAML configuration files. Then it all needs to be diagnosable and repeatable. If it's not good enough to know that the performance changes, you have to be able to run the tests to understand why the performance changed so that we can do something about it. And then as part of the configuration, we've built it into a set of orthogonal pieces. So Bootstrap is just getting started and doing all the configuration. The infrastructure provisioning is just getting the hardware. You get the cloud resources. System setup is really included in Instructure Provisioning today, but is setting up that system once you have it. Any turning off hyperthreading, any of the knobs that you change in the kernel would be there. You set up the workload, MongoDB setup sets up the actual cluster on the hardware that you got. Test control runs the test, analysis what happened, and then we need to be able to tear down the hardware afterwards. I said all of this is controlled by YAML files. Here are two configuration files that we have. These are slightly simplified, but are basically correct. On the left, we have a configuration file for a MongoDB setup. It's a re-node replica set. At the top, MongoDB configuration files themselves are YAML files. We have a YAML blob for the default configuration to use for this cluster. We can actually override it so that have one node have a different configuration on the others, but by default, they're gonna use the same thing. And because this is YAML, we can put in any MongoDB configuration option in here. It's just a matter of changing this file. The topology shows the actual cluster. So we have a replica set here. It has three nodes. And here we actually have a reference to the infrastructure provisioning layer. So we need three nodes. All we need to know from the infrastructure provisioning levels, give me the IP addresses of three machines. So this funny notation with the dollar sign in the braces is going to substitute in that information that was an output from the infrastructure provisioning level. It doesn't care what nodes were provisioned. It just cares that there are nodes. So it could be, you change the instance type, you can change the operating system. You could run Graviton instead of Intel. It doesn't matter. It just needs to know which IP addresses it's talking to. And at the bottom, it has some outputs that are passed on to other things. Most notably here is the MongoDB URL. So, ah, I didn't mean to click. So if you have a MongoDB cluster, you have a connection string. And this URL is the connection string for this cluster that's going to come up. On the right, we have a test control. It's showing YCSB. And again, it's all YAML. And we have two phases, the load phase and 100% read test. Embedded in it was this workload config. This is just going to be a text file. It's a text block. That's going to be written out to this file, Workload Evergreen. And we put all of our YCSB configuration there. It's written out to the file. And that's the file that's actually used when we call YCSB. You'll notice in order to run YCSB, you need the connection string. And so here we have the link back to the, that MongoDB URL from the MongoDB setup. And so that's the only point of connection I need between them. So I can run the same test control against this three node replicas that I have here or a three shard cluster or a standalone cluster. They'll all have this output URL and that's all that it cares about. If I want to change anything, I can just change it here. I want to change the right concern. I can change the URL. I want to run more documents. I can increase the record count. It's all controlled through the YAML. So that's the workload, setting up the infrastructure and running it. Then I need to report the results and visualize the results. And that's going to be the same across all the three layers that I talked about before. So the results get reported back to Evergreen, our continuous integration system. And here is actually that waterfall I was talking about. Let me, yeah, there's a lot going on here. This is a giant grid. Each large column represents one commit into our source repo into GitHub. So the most recent one here when I took this snapshot was an import of wired tiger. So you see this, you got one commit, got another commit, have a group of commits that didn't run. We don't run on every commit and another commit. The rows are different variants. On the correctness side and the build side, this would be different OSes and different configuration options. For us here, it's more like the different configurations that we would run MongoDB in. So down here, we have a three-shard cluster that we run against. We have a three-node replica set. We have a one-node replica set. Don't use one-node replica sets. And the variety of things. And then each box represents a task and a task is just a collection of tests that ran on that commit, on that configuration. So I can zoom in, green means that it passed, red means that it failed. We have a few other colors. We can come back to the question of what does it mean to pass a performance test in a moment. But then we can zoom in at one commit. So this is everything that ran for that last commit that Luke had. You can see all the variants again, and all the tasks. Of note over here, as we have a manifest, we've captured the git hashes of every project that we've used to run this test. So not just the revision of MongoDB, but also the revision of DSI, of our test infrastructure stuff, and a few other things. YCSB is in there, TPCC, we pin all of that. And then we can drill down one layer further into a task. So I said a task is a collection of tests. So here we have a collection called CRUD workloads. It runs things like inserts and queries on a three node replica set. You can see at the top here, we have the commit number. If you're paying real close attention, you might notice I've switched to a different commit. You can see how long it took to run the tests when it ran, what host it ran on, all sorts of stuff. On the right, we have pass fail stuff. And these are mostly correctness tests in here. So if our test runner threw an error, we would report it here. Test failed for some reason. The test couldn't run. If we get a core file, we check for core files, that's a fail. We check for a handful of interesting things. If you get an error message in the log files, we'll fail it. But for the most part, we handled the actual performance results separately. And I'll get to that in a second. Of note here are the files. We have a DSI artifact. We save all the log files in output from the run and save it to S3 and link it here. So we get all the diagnostic data if we wanna go look at it. And then at the bottom, we have actual trend graphs. We see performance over time. And we can zoom into this a little bit more. I actually picked these on purpose because they're particularly interesting. So on these top two, first thing that you should notice is there are no flat lines here. Ideally, if we could control all the noise, we'd have flat lines punctuated by changes to other flat lines. But we don't have that. So even with all that we've done to control noise, we have ups and downs. But then on these, we have a very clear down here and it stays down for a while and that goes back up. And I should add the dotted line is just a comparison that I've added on. It's a feature of the UI that I can pick a point and say put in a comparison all the way across for that particular point. And I've added it here so I can see that the performance before this drop and after the drop is about the same. We've basically fully recovered this. And the diamond actually is, we track performance changes in Jira tickets. It's a private Jira project, but we track them in Jira. And the diamond is a link, is a notification that there is a BF ticket associated with this drop. And if I hovered on it, I could actually, and this wasn't a screenshot, I could actually get the idea of the BF ticket and a link to it and see exactly what happened. Okay, I have a question. Sorry. Yeah. Yeah, I'm laying, I'm a PG student here. So it's difficult for me to infer the numbers from the graph, but I'm curious, do you have roughly an idea of what would be the, let's say the average noise you are getting here when you're around those things and would you have a kind of like acceptable range of how much that noise would be? Yeah, that's a great question. And I don't know the number right here on this one. I would guess that it's maybe on the order of, hopefully under 5% on this one, but it's a wide range. We normally would like to be able to detect, easily detect 10% changes on most of our tests, but there's a range. And we've had discussions, we have some tests where they're just inherently noisy, but being able to detect a 50% change is still useful. And so we've kept them. But dealing with that range itself is hard and a problem. And actually the next section is gonna talk about work that we've done to handle that. And I should add for these graphs here, higher is better. That's an interesting thing that got baked into our system at the beginning that we're trying to relax. But originally everything was always throughput and higher was better. Latencies, we want more and more latencies, latencies lower is better. So we're getting there. This lower graph is fascinating for a different reason. So we see we had steady performance and then performance went up and then it did this thing and then kind of went up over time. We normally don't like to see graphs like that. That's really strange. I actually know exactly what this is though, which is this first up, we turned on a new feature. This feature was replicate before journaling. It was a big performance improvement and it did improve performance. Its first version on the development branch had a bug and that change got reverted here. So that's why it went back down and it didn't just get turned back on in one go later on. It got turned on in a couple of goes. So here was when we first did it, revert, and then finally here we had all of the parts of that project turned on. And that affected all three of those there. So from a service, not human engineering, from an engineering standpoint. So like you add this feature, it makes things faster and like, oh, I got reverted. So then you turn it off and then now all your alerts freak out because things got slower. Do you disable it temporarily? Or you just tell people to ignore them? So I'm gonna get into actually how we do alerts, but ideally there what we do is we document it. We have a BF ticket. We say this went and we close it and we know why this is and we close it. But we, someone coming back in and that's exactly what these diamonds are here. If someone comes in to try to understand what's going on, we want them to be able to find out very quickly. So that's reporting the results and visualizing them as your great questions are getting to the hard part then becomes deciding if the performance changed. And I would add, I did a search recent not too long ago. We have 300 for a full build, we have 300 such tasks to look at, representing about 3000 distinct tests and about 6000 distinct results that you could get for one commit. So you don't wanna do that by hand. When we first turned it on, we did, but after you get through like a handful of things to look at, it just breaks down and humans are bad at staring at graphs. Our eyes glaze over really fast. So we had a paper last year at ICPE on what we did and I'll talk a little bit about it. I need to click there, cool. So when we first, we sort of had three iterations of how we test performance. I mean, about how we decide if the performance has changed. So the first version is have humans look at graphs. As I said, we have a lot of graphs that is not a good system. We then on our line of automate everything, we went to the simplest thing that you could think of and it's probably used in a lot of places which is we used to threshold. So we set up a system that compared the results to the previous run and a baseline and if it changed more than 10%, if it dropped more than 10%, we flagged it and we open, have some go look at it and figure out what was going on. And compared to looking at graphs, this system was great but overall this system was soul-suckingly bad. There was a lot of false positives because some tests are no easier than other tests. Find the right threshold so that you catch the important stuff but don't have too many false positive, it's really hard. We built actually a system on top of it to tune it per test and it was fiddly and bad. Worse, and then you pick that level, you miss things that are smaller than that level. Worse, it would identify things at the wrong time. So say I had a 10% threshold, if I have an 8% change, that's not going to trigger it but if I have an 8% change plus 4% noise three weeks later, I'll get a signal three weeks later. And now I have to go back through and figure out, well, when did it actually drop 10%? It was not good. So we moved to the third system which was what's used modern math to solve this. And we had fun reading some papers and actually an intern of mine saved me because I had a stack of papers to read and I had a paper like, I think this is the one but never quite the time to read the paper and to give it the attention it needed. And so my summer intern, who's now in grad school somewhere at Columbia, read the paper, skipped past the scary math, saw that right after the scary math was simpler math that could be turned into for loops. And at the end of the summer, he did a prototype of it that got us on our way. So the math is a thing called change point detection. And when you think about what we're doing with testing performance in our CI system, what we wanna do is we wanna detect which commits changed the performance in the presence of noise. And that is exactly what change point detection is. It's process of detecting changes within time-ordered observations. So we used invested time and did a prototype and now have in production a system using a algorithm called e-device of means. There's lots of change point detection algorithms. This is the one that we used and has been working well for us. Ah, a scary slide, I'll stay a second. The e-device of means has a nice feature which is that it's designed, it really works best if you have a constant plus noise, which is what we normally have. We don't have a system with drift or periodicity. We don't have people logging. The performance doesn't change during the day as more people come on or go off. And so this algorithm is really good for our use case. And so it identifies commits. And so then we need to go through those commits to figure out what's going on. And so we built a dashboard. I'll talk through this, don't try to read it all at once. But basically we're gonna group all of the things that change on a particular revision and put them together and put summary information. The top half is showing the things that haven't been processed yet. So we have this unprocessed thing. We have humans that look at this board and process it. Is it real? Is it not real? Open a ticket if it's real and move on. Because those humans were doing a good job, we're left with the really uninteresting ones on the top. But I can also look at what's been processed at the bottom. And so you can see, so here's a commit. We have a lot of things that got slower and faster that are all dealing with a link bench here. It's three node replica set, one node replica sets across a few varieties. I can click, all of these blue things are links and I can click through to them to get to those graphs that I was showing before. So the build bar and the human that's looking at this after they acknowledge it, they're going, well actually before they acknowledge it, they're gonna click through and look at the graph to try to figure out what's going on. And so if they did, it would take them to a graph like this. Here you see the trend graphs over time. The yellow dot is the commit that I clicked from. And that green highlighted line, that's the change point. So we've annotated the graph with the change point. So we can see where the algorithm thinks changes have gone. So the build bar would come in, say that's a pretty clear change. They would go run commit to isolate it down to a single commit where the change happened and open up a build failure ticket and assign it out to whoever put that change in. And so this is the build failure that was created. It is assigned to me, but that's because we had warning before this commit went in and I knew exactly what it was. But you can see we get a lot of metadata. There were a lot of tests actually impacted by this. It's a perf improvement. We like improvements. This was the same thing. And you can see very quickly, the change was on February 27th. We created this ticket on March 2nd. We resolved it on March 3rd because we knew what it was and we went on. We're near release at this point. This is nice and easy. And this was actually the same thing that I showed you in the graphs earlier. This was the replicate before journaling change that happened. But so we can very quickly identify things as they're going on and address them. Yeah, so that's the change point detection and decision. We have a full paper and recorded presentation on that as well. And it's an area that we're still working on. And I think I have at least one person on the call who is, if he hasn't worked on it yet, probably will be working on it in the future. All right. So then that was all how we run stuff in regression in the waterfall. The much shorter last part is the release support. How we decide can we release this software or not? At least for the performance part. And so we wanna be able to answer several questions. How is the performance? How does it compare to the last release? How many open issues are there? Are they getting fixed? Are they stuck? And do we have coverage for the new features? And we wanna review this regularly throughout the release cycle and especially as we get close to release. So we have, yeah. As I said, a lot of tasks and a lot of results. We built a dashboard to let us look holistically at all the results between two commits. And that's what we have here. So here is a snapshot at some point. This was a recent build in Sysperf. We're comparing it to what was the most recent stable version on Ford 2 at the time. And it has all the results that show up between those. So you can see what the build was, build variant, what the task was, what the test was. It's configuration. And then this ratio is a percentage. It's just straight up, what was the performance of one compared to the other? I've sorted this to be from best to worst. So you're seeing the really good things right now. So we have a bunch of things that got a lot faster. You can see a commonality. There's a lot of Best Buy Ag stuff. So I can actually say, okay, take this graph but show me only the things that are Best Buy Ag task. So I've simplified it down. And I can get a sense of what's going on with those. And I can also do the negative filter, which is what I would do next actually, is remove that and see what still remains. And so in 4.4, we had a bunch of improvements to aggregation. These were all things that allowed us to, there were simplifications that allowed us to use a faster path and they get some set of simple tests get a lot faster when you do that. And so we knew, we know this, we've documented this, we're able to talk about it when we release the software. I do look at the other direction also. So looking at what is currently slower and I can click through and look at that also. So that's the board. And I said, we automate everything. We try to automate everything. We also have humans in the loop. So we periodically review everything. So weekly, well, every day we have build barons looking in and triaging things. Weekly, someone's going through and make sure that tickets are being moved along and working. Now that things are ticketed, that things are being worked. And monthly, we have a review to see where, where do we stand relative to the previous release? Are things being worked and we can escalate things that need to be escalated to get the right people in the discussions. A lot of times things just get faster or just get slower. Those are easy. Sometimes there's trade-offs. You add in logical clock support, which is desperately needed in order to support transactions. There's overhead to that. So you'd talk about what is the acceptable level of overhead for that. That went in several releases ago, but we had those discussions. Sometimes things make some things faster and other things slower, right? There's trade-offs. And, you know, computers not gonna answer those. You need people to discuss and come to an answer of what the right solution is on those parts. But in general, we try, well, we have humans in the loop and we probably will always have humans in the loop. We're always looking for that next piece that we can pull out and automate and simplify the process. All right. So those are the, that's really support. A talk like this is always interesting because it's always a moving target. We're always trying to automate that next step. We're always trying to make it as efficient and high leverage as possible. So this is sort of the pitch part of the deck. Just one slide. I give talks like this in part because I want people working on our problems. We have real problems. We love to work with people about them. We're happy to talk about what our problems are and try to share things. So I have a list of related work here. The noise reduction work was a series of blog posts that we did about everything we did to control noise in the cloud when we ran benchmarks. Now we've had some more recent papers talking about in detail exactly what we're doing at MongoDB about the infrastructure. Our code for the most part is open source. The signal processing algorithms and infrastructure and DSI are both open source and available. They're linked there. The regression environment, evergreen is open and the platform is open source. The performance results in that regression environment are not open. I'm sorry about that. But we're working to share data with folks. If you've got an interesting problem and you think you could do something interesting with that data, I can get you access to the data for that. And so, we'd love to have people do that. And even if you just align loosely, but our data will help you advance your research, reach out. We'd love to have MongoDB more often as proof points in people's papers. And that's what I have. Thanks for listening. Thanks for the questions along the way. Hopefully maybe there's one or two more interesting questions. I have a lot of questions. So anyway, I will go on behalf of everyone else. Thank you so much. This help is very interesting. So we'll open up to the floor. If you have a question for David, please unmute yourself and ask it. There you are, of course. Hi, I'm Kevin from California, formerly an architect at Crunchyroll. I had a question about production cluster configuration. So given that MongoDB has hosted service offerings, does the ops team also use these automated performance tests to tune production cluster configuration, like maybe comparing IOU ring to AIO for evaluating the return on a Linux distro upgrade? Yeah, so that's something that my team is actually, my team has evolved over time. And so I have a team now that gets to play with all these tools and other people get to own them. And so we have a group of people who, amongst other things, can go and ask questions like that and do those investigations. So we would do that for the cloud team in conjunction with them. We've done a number of things like that and we want to do more. We can run experiments with this. We can actually aim this exact infrastructure at Atlas itself, our managed offering and run tests directly there also. And there's a whole host of stuff we want to take advantage of from the fact that we manage an offering. So we have lots of information. And so we're talking about the best ways to leverage that. Oh, so it sounds like it's bidirectional, very cool, thanks. Yeah. But I mean, the person that IOU ring was AIO, that requires a human to go actually write that, right? Like that's not just like click a button. But like, what about like, presumably you're doing sweeps across all different versions of TTT or Clang, all different versions of the Linux kernel. Like how widespread is all of that? That's better. Yeah, so we're narrower than I would like to be. And that's a focus area for me is you get a common toriel explosion, as you look across these things. And so we have depths in some dimensions and lack of depth in other dimensions. And we've been putting, or I've been putting a lot of thought into how to best expand our coverage in those things without boiling the ocean. I would love to have coverage across all of AWS's, all of the OS's that people run us on and all of the configurations and all the clouds. But I have a finite budget and I don't wanna boil the ocean. And there's a lot of neat research going on in the performance world on stuff like this. And we're trying to leverage it and build in a smarter system. I used to work on the Oconor project in Wisconsin. And we had to support a bunch of super computer centers that were running like alpha machines or age box. And we had that in our CI pipeline which was huge pain in the ass because like the alpha machine always broke and someone always had to go down and it was really the new person that was me. Go down and hit the reset button every time it crashed. I hated that thing. Okay, any other questions? So one interesting thing about MongoDB is that you have a heavy dependency on an external or semi third-party project, like WireTiger. You guys own it, you guys control it. That is sort of a separate, has a separate release process. So my first question would be, did WireTiger use all of the evergreen infrastructure you guys built to test themselves? And then how do you make sure that like, WireTiger is vetted before it's introduced into the MongoDB ecosystem? Yeah, it's a big thing. So WireTiger came with a lot of its own infrastructure. And so they've been over the years migrating over to evergreen. So some of the stuff is in evergreen, some of the stuff is still in their old Jenkins. And we're trying to pull them along and get further with all those things. So we also work closely with the team so they know the tests, they know the way often there's a lot of interaction there. The WireTiger drops are fascinating. So as I said, it's a vendor. And vendoring something regularly, but infrequently works sort of counter to that thing I was talking about, of test off and test early. Because you have fewer drops. So it's harder to isolate things going across because there's a lot of commits in each drop. And we've worked a few different ways to deal with that. For a while, we ran a separate variant that would just pull the WireTiger development branch and run it all the time. We're moving towards much more frequent drops from WireTiger into MongoDB. For very good reasons, WireTiger is gonna remain its own project for as long as I know. But we can make that from MongoDB perspective look like one code base. We can get, we can do translation, we can make it so that every commit shows up. And we're working towards that. That's where I would like to be. Interesting, okay. Actually at a curiosity, you mentioned that you started at Mongo in 2014. It didn't have any automated performance testing. Was Evergreen already started when you started? Was that something you took on? Evergreen was well-established already at that point for building the server and for internal testing. So when we talk about going to a continuous integration system, there was one choice. Okay, so it's a curiosity and this predates you, your time there. But like that's a major undertaking for the database startup to like, okay, let's go bit our own CI pipeline. You know why they didn't pick Jenkins? I guess before that it was Hudson. It wasn't as many pools as there are now back then, but like, why roll your own? What was missing? I don't have deep insight into what we thought was missing. We obviously thought something was missing. Sure. But it's, I don't know the backstory so well there. Okay, that's fair. All right, my last question would be sort of like a more sort of forward-thinking. So you showed the by-section of the commits where you were really identifying like, okay, this commit calls the error. That's super useful as it is by itself. Have you guys looked into sort of automated methods for identifying like the lines of code potentially that calls if within that commit that calls the problems? That is the dream. We need to, we're always walking before we run, but yes, we would love to have much more, much smarter analysis in there. Anything that's gonna save people's time is gonna be useful. And so we have code coverage support for on the correctness side of things. I'd like to use the code coverage more. I'd love to correlate tests to, what tests fail when different parts of the code get changed. Yeah, that'd be awesome, yeah, yeah. Yeah, oh, I have dreams, I have grand dreams. I'd also like to know like things along those lines, again, which aren't close, but you could imagine correlating all sorts of things against what's the probability that something introduces a correctness bug or performance change. And actually, and some people have done this in places. It's pretty cool, but like basically putting up a score. So you put up your code review for your server change to put up a score next to it and say, hey, you're a high risk. You need to test this more. You're high risk for these reasons. You need to test this more, consider breaking it down into multiple low risk changes, whatever, like, oh, it's the dream. It's like an Uber score for developers, like you're a bad Uber rider, you're a bad driver, yeah. So we're collecting data, we'll work there. There's plenty to do to keep us busy for a long time. Okay, let's end all that, because that's when it was amazing because I've heard of them in a while and I've never thought about this. Basically, don't be shaming the developer. Hey, look, you're gonna crash out before. Maybe you try harder this time. But not necessarily shaming, but guiding like when you've made these changes that looked like this in the past, stuff is broken. Do you wanna take another look at that? It's still awesome, yeah. Okay, David, thank you so much for doing this. This is awesome.