 All right, today, we're pleased to have Jinhao Jung. He is a fourth or fifth year PG student at Georgia Tech. He is working with Tazo Kim, who's his main advisor, who is CMU alum, but for this particular paper, they're gonna present the Apollo framework that they developed that was published in DOD 2020 with my former student, Joy Rubash, who is also CMU alum. So this is a very CMU flavored talk, even though the work was all done at Georgia Tech, right? And to be clear, you're not a database person, you're a security person by trade, but you're exploring the beauty and the majesty of databases, is that correct? Right, it's correct. Okay, all right, go for it. And so again, the way we'll do this is that if you have questions, please unmute your mic and then interrupt and ask your question, but be sure to say who you are and where you're coming from, so we know who's who, okay? Okay. Okay, all right, go for it, thank you. Okay, thank you. Thank you, Andy, for the kind introduction. And I am Gino Jung, a fourth year student for Georgia Tech, and today I'm going to introduce Apollo version one, a system to automatically detect and diagnose performance regression queries in the database systems. I named it as version one because you are still working on the next version of system to handle multiple classes. So this work is joint work between the Georgia Tech and eBay, and also combined work between the database and the security majors. By the way, I'm working mainly on the security majors and especially I focused on the fudging techniques and anti-fudging techniques. And fortunately, I met Joy in the Georgia Tech and at the beginning of his Joy's career, and Joy proposed several research items to my advisor. And my advisor think, okay, we can use some several techniques from the security to the database problems. So that's how we started our research. So as the name implicates, the Apollo is holistic tool chain for debugging database systems. And first, we automatically try to find SQL queries, which shows performance regression. And second, we also wanted to automatically diagnose the root cause of the problem for the developers. Our motivation begins with the observation that the code-based side is getting larger. So for example, of the Postgres and SQLite, the size has been increased by seven times within 20 years, which also means that the internal relationship between components are getting more complicated. So now it is a challenging to build a system with a predictable performance. So let's assume one scenario. So a user of the database installed everything to the one specific database and he recently upgrades the DVMS installation. But query suddenly take much longer time to execute. Like this example of the Postgres, from three seconds to the 12 minutes. And example of the MySQL, this public brochure says that some one second query execution take now more than approximately three minutes. It's because of the unexpected interactions between different components. Now we report this behavior as a performance regression. This performance regression could be a critical problem because it can hurt users productivity. Because users expect this is a very short query or interactive query, but it could become a overnight one. Let me show one simple example. This query looks very simple, but it shows more than 10,000 times slowdown on the latest version of a Postgres. The reason is because of the optimizer update. The developer made a new policy for choosing the scanning algorithm. So only if the latest version have enough memory for choosing the bitmap scan, it choose the bitmap scan. But somehow this query overestimated by the optimizer. So they estimate the estimated as a big size of tables. So the latest version choose to use the slower sequential scan in this case. So the result is in evaluating all the predicated on the multiple in the table. So that costs more than 10,000 times slowdown. Okay, sorry, can I interrupt? Can you go back to the slides? Okay, so I read your paper, it's interesting. And I read this particular thing. I just try to understand what is actually happening here because according to my understanding of TPCC, this is empty query or report nothing or zero. Yes, right, the result is empty, right? But that their optimizer estimation is overestimated actually, especially due to the list here, function here. So when they tried to do some constant folding, they made an error. So it resulted in overestimation. So both the old version and the latest version have the same problem with overestimation. But the result of the overestimation affected it differently. Even if the earlier version overestimated, but you still use the fast pitchmap scan. Okay, okay, okay, sorry. So I forgot one thing. You can interrupt at any time. Yeah, I know, as Andy said, I forgot one thing. So I'm interrupting myself. So this is the first time I'm joining this. My name's Don. I'm graduating from University of Utah. So I'm going to be a assistant professor at Penn State University. So that's my brief introduction. Yeah, so anyway, so back to the question. So first of all, you said both version of least zero to one, the constant folding doesn't work. So it doesn't really report to zero. It reports to something, identify or whatever. So from that perspective, does that actually means that no matter how hard you try to build like MIM max boundaries or histograms, it doesn't help. So it just like a very bad estimation versus even worse estimation difference. Is that what you mean? Hold on, let's keep going. This is just showing you that you can have performance aggressions. The implementation of the optimizer is not relevant to the point he's trying to make right here. Okay, cool, cool. I will leave it be. So, here we go. Okay, and I'm keep moving. So in here, we could ask several challenges. We can observe several challenges. The first and most important one is how to discover the queries with the regressions. A second challenge is for the developer, when the user discovers some weird behavior or the regression behavior, how to minimize the query down to the essence of the problem because if the user just sends the lengthy and long query to the developer, developer should spend the time to narrow down the problem. So we wanted to automatically minimize the size of the query while maintaining the problems. And the final question is how to diagnose the root cause of the regression automatically. On the database systems. And to tackle the problem, we propose Apollo system. So first, our system accepted database connection from at least a two version of the database. In this case, we showed the older version and new version. And to tackle the first problem, is covering queries with regression, we introduce a SQL pause component. And also in the SQL pause component, we adopted feedback-driven pausing. And to answer the second question of how to minimize the query, we introduce a SQL minimizer by using bi-directional query reduction algorithm. To diagnose the problem automatically, we introduce a SQL debugger, which combines statistical debugging and commit by section. Now let's begin with Apollo tool chain. The first component is SQL fuzzer. So which he tries to detect performance regressions. This diagram shows the two database connections, one to the older version and the other to the new version. But in our experience evaluation with Postgres, we use the four different connections because Postgres supports four different versions at the same time. So in this case, we wanted to find any regression between buzzards. The first, the query generator create random queries based on the SQL grammar probability table. And then execute to run the query and measure the time and compare. If the differences are larger than the threshold value, we let the query as a candidate. And finally, we validate the query to reduce, to remove any post-positive cases. The first component is SQL jam. Okay, go ahead. Hi, this is Juan. I'm a student at CMU. Earlier you said if the query performance goes beyond some threshold, how would you pick the threshold? Actually, this is some empirical analysis. So first we do some cases study about the performance regression using the bug report so far. And we noticed that the difference should be at least larger than three times to get acknowledged by the developers. So the answer is that we do some cases study empirically to make the threshold. Okay, thank you. Great, so you're saying like the developers only pay attention if your query got 3x slower? Yes, right, so like 50% of the regression, they just tended to ignore, not just ignore, but tended to ignore. So is it difficult to get some attention? Yeah, so just to add up, so I'm Joe Errol Raj also from Georgia Tech. So just to add some context, the absolute time also has to be higher than some threshold for the developers to take notes. So we kind of learned a lot by interacting with the developers of both MySQL and Postgres. Jinho extensively like interacted with them to kind of figure out a bunch of rules for filtering these queries. And what is the absolute time where they'll actually pay attention? Charlie, I should show some minutes. And the faster query shows some one or two seconds and the slower query, better to show some, at least some minutes. Like, so if it is milliseconds, it's difficult to get some attention. Are you gonna talk about how you generate the query, that's the next slide, the fuzzer. Okay, all right. So I'm sure how we generate the query. Firstly, we retrieve the schema from the database. And we also read the initial SQL grammar probability table. So based on this probabilities, we generate random queries. So after we generated some valid set of queries, we also check the complexity of the query. So for example, if the query uses too many number of joins or too many number of sub-queries with these columns. Because I will explain later because if the query is too much complicated, so the developer also complained about the realistic city. So if the query is not realistic, so they think this is too much artificial. So that could not be a performance regression. That's also another heuristic that we learned from the developer. Hi, so this is Dan. I'm a CMU student as well. I was just wondering if the probability table is just to ensure that they're realistic queries or was there something else done with that? Okay. I will explain using the next slide, okay? Oh yeah, nevermind. Okay, thank you. After we generate, after we create set of queries, we run the query using the executor. If we find any regression, we read the query and we count the each class when we find the regression. And then we update the table so that if one class likely make regression more, so then at the next round, we use those class more than the previous round so that we can increase the probability of the most regression. That's how the SQL grammar for the table works for the query generation and execution. Hi, so just to interrupt here. So I'm done again. So I'm just trying to understand if there are two things as a dependency, for example, cast and least at the same time causing the problem. How can you kind of keep track of those rather than you know, cast is here at one, one up because one query can be slower and cause regression is not only by one operator might be a combination of multiple. I see, it's interesting and good point but we didn't consider that problem. In this case, we calculate all the frequency of the class. And so we don't consider the dependencies here. So like your probability table would have also leased as well, right? Yeah, my problem is like one problem may cause not only by cast, cast maybe just one reason. And you can also do things like every time when you generate a query, you said this is slow and it put everything bump of one, but maybe like among like a hundred operators or like a hundred is pretty much like you have 10 operators only two of them actually cost issue. So it only happens when it's buzzing. So that's why like you just run this forever and then what would happen is eventually you would see, all right, I got slower because when it was just cast and leased by themselves and not the joint. Right, because we are running a massive number of queries. So there could be a case or case without leased so that we can differentiate the case. Okay, so you mean like as you go along long enough so it can actually see the difference like over time, is that what you mean? Yes, right, you're right. And actually I didn't explain in this slide but explained in the paper. So if we too much use the probability table, actually the probability is over pitted into specific class actually in our actual evaluation to prevent the over pitting. So we reset the probability table periodically like once a day or something like that. Okay, thanks. This is Dan again. I was just curious if you also track the speed up as well as the slow down. Cause I imagine when you're building database you're kind of making some sort of trade-offs occasionally. Actually, no, we already check the slow down compared to the newer version. Okay, thank you. But I think it's also some interesting aspect. Okay, thanks for the question. And after we discover the set of performance regression in these inquiries, then we have to reduce many positive cases. So in our evaluation we apply the set of filtering rules and then report the case that we really believe with as a performance regression. And then developer also gave us another feedback to update the filtering rules so that we can keep increasing the rules. This is very important to process to remove false positives. For example, we have to remove non-deterministic behavior because the query is made with random manner. Somehow we can compare the specific times of the column with the current time. So since we are using the current time to make a comparison, the behavior cannot become deterministic. In this case, we have to discard that case. Are you fuzzing like insert update delete queries that things actually modify the table or is it only selects? We only do the, we only consider select but in our next version of Apollo we are considering creation, delete, update, so everything. And then so non-deterministic behavior, that means like the OS was doing something in the background and your disk controller was doing something weird and you got salar. So how do you handle that? You just update, like, will you fuzz the same query multiple times? Oh, yes, right. So after we discover the regression, we run the query multiple times to check the actual running time. Got it. With the average value, okay? Cool. Another interesting perspective is a non-executive plan. Okay, there is a query using the subquery and older version and newer version DBMS take different run path in the optimizer. But somehow the older version returned nothing from the subquery but newer version returned several tuples from the subquery. What's this? Old version returned nothing from the subquery. The older version didn't execute to less of the queries, less of the run path, but the newer version execute to more path because the subquery returns something. In this case, the developer clarified that this is not a bomb. This is an ineditant problem of the DBMS. So we also described any case with a non-executive plan. And finally, like I mentioned, we also check whether the query is too complicated. So if the query is too complicated, we also discard. This is some things from the University of Pittsburgh. Okay. Actually, this second case is very interesting. Can you give us a more concrete example? Oh, okay. I think I have to repeat some detailed manner. Okay. We started from the query contains subquery, but we cannot guarantee the which optimizer choose which run path to choose. So, but developer think the, and the older version of the DBMS make return nothing from the subquery, but your version returned several tuples of a row from the subquery, but the developer thinks it's not the optimized result or misbehavior of the optimizer. It's more like some lucky choice from the older version. So your reference is not the query optimizer's problem but could be operator's problem. Actually functionality because now you are giving a system that as Andy asked is a static database. And if you run the query in principle, it is returning three tuples. It should return three tuples. All the times the only difference with respect to regression is speed. If you have two different answer under the same circumstances, it means you have a buggy system. In consistent system. Right, actually we also thought the same way as you did. So we thought the misestimation is also a problem when we put the bug, but developer from the feedback, developer thinks the misestimation is an irritant problem of a DBMS system. So they... You just didn't understand that. The answer is confusing me. So I think just to add some more context, hi Panos. Hi Joy, how are you? Good, I am doing well. Yeah, so in this case, the two versions of the system, in the old version, they came up with a different query plan and in the new version, there's a different query plan and because of an optimizer update and then in the old query plan, apparently the sub query ended up during execution, it didn't return any results. So because of that, the whole query ran very quickly. In the new version, this optimizer update probably helps a lot of queries, but for this particular query, it ended up picking a plan wherein you end up like doing more work before you figure out that it's actually not useful work and therefore there is a performance regression. So it is actually like stemming from an optimizer update. Go to the, I think I got it. Yeah, thank you for that. Sorry Joy. What you're saying is the first question that Don't ask, hi Don't nice meeting you, is that basically the second time or the first time the optimizer organized that's an empty query from the catalog. So it didn't bother to invoke the, whereas the other one, it retrieved tuples and then rejected them later on. That could be possible the case, understand that. Yeah, yeah, yeah. I think we give more details in the paper. We can definitely like chat about it offline. Yeah, thanks. All right. So the, when the SQL folks discovered a performance regression inducing queries, then we sent all the queries to the SQL minimizer. So the minimizer tries to find a sense of the problem to report the regression to the developers. In the, in this component, we applied the two different strategy. The bottom of the strategy try to extract a little subquery and top-down approach tries to remove unnecessary expressions iteratively. Let me show you one example. So assuming that this is the query, the user of our system discovered. And using the bottom of approach, we try to extract the subquery. And if there is a dependency, we remove the dependencies between the original main query and run the query. If this query still shows regression, we extract these subqueries. That's why we call it as a bottom of approach. And after we extract this subquery, we apply top-down approaches and try to remove elements one by one. So we can remove conditions and columns and subqueries, another subqueries, and we remove clauses. So after the remover process, you can get the minimize the version of the query. Then the user can report this query to the developers. Now... Okay, wait a second. Sorry, don't again. So trying to ask this question is, so you do this reduction, you do this reduction seems okay, but there is a problem, but you have such a complex query cause of regression. And then we're trying to remove things. How do you make sure that when you remove this thing, it still costs the same regression? Does that make sense? Right, so every time we remove something, we run the query to the older version of the database and newer version of the database and check the execution time. Okay, so you still see the similar gap, so you think it's the similar regression? Similar, not similar time, but similar magnitude. It is similar gap. I mean, similar like relatively difference. Yes, right. But not the similar, not the absolute time difference. Okay, cool. Okay. And finally, we apply the SQL debugger to automatically find the root cause of the problem. The SQL debugger actually plays with the SQL mini-minor. So when we apply the SQL mini-minor, we generate too many partially reduced queries because we are applying the strategy, our strategy one by one. So there are several partially reduced queries and at the same time, we apply commit by section and try to find the first slow version of the DBMS and the first fast version of the DBMS. And using the partially reduced queries, then we get the run trace. So in this case, we collect control flow from the run trace and then by using the statistical debugger, we finally get the bug report. This is the overview of the SQL debug. The first we use commit by section. This is very widely used technique. So starting from the old version and the newer version, we do a binary search and find the least problematic commit. In this case, the commit number two is the starting point where the problem begins. And again, we place with the SQL mini-minor again. So starting from the original regression query, we apply the minimization and we collect the partially reduced queries and then we run all the queries to the fast and the slow version of the DBMS. But here's one minor problem because we are again collecting the run trace from the two different version of the DB, the address of the binary is different because of the different compilation. Even if we have the same function, the address of the function is different. So the run trace has different offset. So to correct the problem, we enumerate all the function names and we also enumerate the address of the function. So by using the base address of the itch function and the offset from the beginning of function here, so we make an alignment. In this case, the name function has different address and what we find the same predicate using the offset from the hex 20 address. So in this case, we make an alignment and get the result of the run trace. So after we make an alignment, now we are ready to use the statistical model. And we explained the detail about the statistical model in our paper. So please refer to the detailed information about the statistical model. So when we provide this predicate information and the result, we can finally generate the final report. So our intuition of the statistical model is that we calculate the relationship between the predicate and the result. So I introduced Apollo. Yeah, so just an interruption. Yeah, so going back to Dong's question, I think that during reduction, as you said Dong, like sometimes the regression disappears during minimization. So some of these partially reduced queries actually are good queries in the sense that they don't exhibit a regression. And some of them actually do exhibit the regression, which is exactly what we need for statistical debugging. We want to actually like compare the control flow traces for these queries that are running fast against those that are running slow. So in the next slide, Jinho, I guess, in the next slide, maybe the next one after this. Yeah, so here the predicates in this case are essentially branches in the program, in the database code. So we kind of compare the branches that were actually taken and the branches that were not taken in the good runs and the bad runs. And note that minimization is a iterative process. So we have many such good runs and many such bad runs. So this, by comparing these branches that were taken and not taken, we can kind of figure out which branches are highly correlated with the regression. So that's the final idea. Yeah, so I get that. The interesting thing about here is like a query optimizer can be pretty complicated involving multiple rules, transforming rules, including different cost models and all this kind of stuff. I understand like the way that you're trying to map different like, you know, predicate conditions, trying to see like what kind of conditions triggering this problem. But one thing I'm planning to ask at the end is like, because it's complicated, when you actually introduce a commit, when it's disruptive, we'll actually destroy this whole thing because like you cannot even just align them. Because like, for example, introduce a new rule that somehow like offsets the effort of another rule. Oh, that's right, right. Yeah, for example, like the kind of predicate structure is kind of different than what we're doing, blah, blah, blah, blah, blah. And even like, for example, you have condition one and condition two. I don't know if you're doing exact match. So for example, this one line of predicates needs to be exactly the same as the second predicates in your line, or it just like showed up the same line and somehow looks similar. Because that can cause another issue because the logic of that two line may be different. Like, you know, that you're... The interrupted, a better question is like, for Postgres and MySQL, when Postgres is 30 years old, MySQL, and I don't know if you're looking in version five or version eight, like that's been pretty well banged on for a while. Like between commits, I imagine that there's not major changes. So the question is, I guess, how far back in terms of the number of commits do you have to look when you see a regression and how widely you're modified is the code such that what Dong is pointing out, you know, can't do this alignment anymore. Right, that's a very interesting question. So in our case, we do the comment by section first. So the amount of the coding modification is not that huge, right? So a hundred of lines of the modification. So among the modification, we tried to pinpoint the location using the statistical debugging. But in our next version of our column, we also tried some automated debugging using the statistical model without coming by section. But in our some experience, we found that we can still find the exact location without the commit by section. But in this case, here, in this case, we use the good queries and the bad queries, but to make the statistical model working without the commit by section, we need some good query mutation. So for example, here, so we are just using the minimized version of the query, but to make it work, we needed to mutate the query to hit the branch with the problem. So another thing is that, I guess, so Andy's question was actually about like how far in time do you have to go to actually like find these bugs? That's actually not something we have studied, Andy. We have not really done an extensive study of several bugs. That would be a very interesting question to study, I agree. And another aspect, just a minor clarification, the predicates in this case are actually like assembly instructions. So it's not at the source language level. So it is pretty like glorious information that you're getting from statistical data. I'm not saying you do this, but you can maybe look at like, think of like three categories of the age of a system. Like you have a one hand postgres, that's 30 years old, maybe a middle-aged one would be like Mongo, that's 10 years old now, and then the simplest one, like a newer one, I'm not saying our system, but like something that just came out like two years ago. I suspect your thing is probably easier to handle on like old mature systems. I see, yeah, that'd be a very interesting study. Thanks for the study. Yeah, it's interesting, but don't do it, because it's a lot of work. Okay, I will continue. So after we applied the statistical model, we generated a final report, which shows the rank of the age of predicate, and then as a recap, so far I introduced a sample of tool chain and introduced the three components. So for the evaluation, we tested it on the postgres and the SQLite, and we used the Dynamo Rear for the getting control flow graph. In our evaluation, we tried to see the efficacy of the SQL Fuzzer and the SQL Minimizer, and the accuracy of the SQL debugger. So when we applied the SQL Fuzz, we discovered 10 previously unknown, unique performance regression, and we observed average more than 200 times performance drop on the postgres and the SQLite. In terms of the first party tips, when we applied all the preterm rules run from the developers, we could remove almost all post-pagetive cases, and our Minimizer significantly reduced the homicide by 76% from the original science. Actually, go back to the last slide. So these are like the filter rules that the developers say, if it's not three X slower, we don't care about it. Right? Right, right, right. So those are all the feedback from the developers. So of the 99 queries you found, like, hey, something's wrong. Right, right, so those queries are actually anyway show any integrations. So there's several queries, the ones that something got slower. Right, right. But then after you do the filtering of the developer, say, hey, do this filtering, it looks like you got like one query. All right, right, so we found many post-pagetive cases. So... They're false positives, like, the false, like, let me phrase that. How many of them are like really things got slower and should be addressed, but they just don't have the labor or the manpower to do it, versus like, oh, like, this is not realistic or you're doing something bizarre that no one can actually really do in SQL anyway. You know what I'm trying to say? The fact that the filter removes out so many, that's like an arbitrary thing because that's the developer saying like, yeah, this query takes three seconds, I don't care about that. But it really was a regression. That's right, regression and the developer acknowledges it. I wouldn't, false positives is not the right word I would use. Then... It's like, the ones that, like, this is not so much a comment about what you're thinking to do. Like, yes, you think you can do filtering fantastic, do that. It's more like, my SQL said, hey, we don't want these because we don't have time to fix them. Actually, so our tool is more about the developer's standard points. So to developer, it's better to remove any possible cases that they may discard when they receive the query. So that's why we made the filtering based on the developer's feedback. All right, so what would be useful here instead of just like, before and after, maybe like a breakdown of the discovered queries, which ones would be filtered out by your six cases? Yes, right, so in our evaluation, we provided the ratio of the filtering out for each, the filtering load. Okay, cool. Yeah. In fact, I had a similar question, so, Chris, I'm just speaking. Again, I had a similar question because I was wondering whether you had this classification of what a class or class of queries that they're realistic expected or some very esoteric. And because with these randomness, you get these all practically correct queries, but meaningless. All right, so most of the queries that we generated are not semantically meaningful because it's random behavior, random manners, but if the query is simple enough and could happen in the real world, so we can consider it as a regression. And the related question is why maybe you have already used some of the standard TPC kind of queries or try to map the one that you generate randomly, whether they fall into some characteristics of these expected queries that systems use in benchmark. In this evaluation, we used the TPC benchmark and in the next version of Apollo, we are generating any arbitrary database when we do fudging. But here we used the fixed benchmark and used only select statement to find the regression. Okay. As a suggestion, maybe you can start with a benchmark and then two permutations. A more controlled way to see how the fuzzing is rather than completely random. I see. So just to clarify, Panos, I think that is, Jinho, isn't that what you're actually doing? Like you are starting from a TPCH query and mutating it? Yeah, as a beginning, we used the TPC page. Yeah, so that's actually what Panos was asking. So we are not starting from scratch, Panos. We are actually starting from these queries and mutating them. That's a very good idea. That's kind of what he's trying to do. Okay, I miss that. Thanks. No, no, I think we just to clarify that that's what he's doing, actually. Okay, and going to the regression and the C-core debug. So among the generative predicate that we expected among 10 regressions, five was in the first ranked branch, meaning that we correctly identified the location. And two cases was in the second ranked branch and the rest of the three regressions were in the third ranked branches. And this shows some interesting cases study. So this query shows about 1,000 times lower down on the latest version of the C-core light. So the problem was happened by the bug peaks, especially for the correctness bug. When the developer handled a correctness bug on the where is not in our statement, the developer thought there is no, at that time, there was no way to consider optimizer, optimizer's optimization behavior. So they discard existing optimizer on the left join. So meaning that when they picked the bug, they discard the existing optimization. So the consequences that the optimizer not over provide left join optimization. So they make much slower execution. And this example shows about three times slower down. And the developer applied implement the new version of the hashed aggregation, but by mistake, so the function, individual function was faster than before, but by mistake, the hashed aggregation function resulted in the redundant building of the hash tables. So the function generated hash table for each comparison. But that's why it shows three times slower down here. And before I conclude this talk, I want to introduce a several interesting fudging of work used in the security community. So now fudging support various domains, like now we fuzz always currently self and we doing fudging the hypervisor and application several devices like IoT or the mobile phone. And also fudging comes with hybrid approaches. Now we combined with external components. So this drill and the QCM incorporate the concurrent execution engine and try to solve complicated constraints in the branch. So whenever the fudging gets tough, it let the constraint solver to calculate the complicated conditions. And another further called the user employed dynamic data analysis and they can find the one-to-one matching from the input to the specific branches so that they can mutate on the critical locations intensively. And now fudging is with the deep learning techniques so they develop a training mechanism to automatically discard uninteresting inputs. And there is a single approach. So somebody, someone combined all good fudgers and used all the fudgers at the same time. And whenever one fudger find the interesting test case, they share among each other. So this is called the ensemble fudger approach. Also we have specialized environment for the fudging. So one group developed a specialized operating system for the fudging. So they solve a fudging problem by introducing the better version of the file system or the better version fork mechanism. Or another research group proposed an idea to provide hardware-based coverage collection. So whenever the fudger discover new input or interesting input, the hardware directly updates the internal mid-term structure for the past calculation and computer reason. Gino, this is sort of like a shotgun of a bunch of fudging papers that I'm sure are interesting. Which ones are actually related to databases that we should be cared about? Okay. Yeah, Gino, you can. I was also like, yeah, surprised, but it's good. This is good. I guess he is trying to spread the gospel of security research. I clearly your theme works and I think it's a good idea, but here's a bunch of fudging stuff that's newer and I don't know. For all of these ones, I don't know how to map them back to what you just taught me. Okay. Should I care about these things and not yours or should I just care about yours? Or do I need to understand these to appreciate yours? Oh, I think this is kind of random support. I choose something that can potentially help the DV research. DV fudging or just a DV performance research. Okay. Because I don't think there is some specialized operating system optimized for the database. So I think you can wrap up the new section and move on to questions, I guess, and take some questions. So you had an acknowledgement side, right? Yeah, yeah. Okay, so in the current version of Apollo, we provide a tool chain for detecting and diagnosing regressions and we are also working on the open source. So it will be soon open source. And now we are also working on next version to discover the other types of the bug, such as performance bug and data corruption or crash. We are interested in integrating our system with more DVMS. So for example, we discovered five more performance regressions in Kakar's DB and also we are improving our tool chain based on their feedback. And finally, our automation will have reduced the labor of the developers and then they can focus on the more important problems. I thank to the Postgres and SQLite and Kakar's DB for providing us some variable feedback. I also appreciate the CMU for inviting me to present my research work. Okay, awesome. So we can't do applause, but I'm sure everyone is greatly appreciated because only two people dropped off since you were talking, so that's awesome. Okay. So we'll open up to the floor, if anybody has any questions. Okay, so I have, you can go first. No, you're farthest away, you go first. Okay, cool. So I have two back and back questions. So it's kind of related. So number one, I think it's very interesting because I know a bit, not too much about fuzzing and reducing, but one thing interesting I want to know at the end of related work is does anyone do something similar as you did, performance regression on fuzzing that actually like doing, trying to find something other than just correction bugs or some security holes that's crashed a system, something like that. That's the common fuzzing areas. It's kind of a new and it's the first time that I saw some things doing the performance progression as number one. Number two is kind of deeper in down this road because like, you know, when any software, like bumping up from one version to another, supposedly they don't really introduce any correctness bug, then there's a question, is the regression a bug, like not correctness bug a bug, or it is a choice. So it's trying to incorporate more common cases, people who actually cares more about than, you know, some corner cases, they may actually sacrifice those performance to introduce regression on those stuff to make some more common cases and the user care about much faster. To answer the first question, there was very similar work on the security side and especially the previous work tried to find performance bug, not the performance regression. So for example, the quick sort algorithm when a user provides some specialized crap to the input, somehow quick sort algorithm shows very slow execution speed. So the fudging technique tried to find such a corner cases for the performance bug at the regression, but I'm not sure about the performance regression on the some general fudging approach. So that's my answer for the first question. And the second question, could you also repeat the second question again? So the second question is when there is a new commit getting into the system, so it will cause some regression, but the regression can be not a bug, it can cause by a choice. So that's what I mean. So, yeah. Right, so. Mac requires a human make a value judgment, that's not his tools trying to do. His thing is to learn you that something happened. Yeah, so I guess you just spam everything that you find to me, and I can figure out which one actually is a problem. That's kind of the thing. Yeah, okay. But we believe if the regression exceeds some amount of the threshold, that could be a problem. Even if the developer made a choice, the customer may not agree with the regression. Okay, Panas, you had a question. Yeah, in fact I relate to the second one and the answer I guess provide the answer to the question that you have before about from 99 queries. You ended up in one because all of the other ones, maybe they were developers, they thought it was by design rather than a bug. And my question was similar in the sense that if we assume that actually systems are evolved, how much attention could you focus on the parts that have been changed and see whether actually you meet some expected behaviors rather than going to corners cases that might have been introduced in the second commit 25 years ago, but they were never discovered so far, it's irrelevant. So this is what I was thinking that they somehow, this helps you from evolutionary point of view to detect problems that you thought will solve and then they are actually become headaches. So that was a, and if you had discussed that with the developers. Right, actually this is different difficulty problem. So we don't know the answer before we actually submit the bug to the developers. So even if we are very sure about the bug, developers can say it was already discussed several years ago and we decided to not to think it as a regression. So one example could be our experience with some parallel execution. So in some case, the query decided, the optimizer decided to use the parallel execution. But when the DBMS engine prepared the parallel execution, it takes some time, some utilization time. So sometimes it is much slower in the newer version because it uses parallel execution and whereas the older version does not use the parallel execution. So this we don't know about the history and this is making from the developers. Even if we observe some regressions, we don't know the answer before we submit the report to the developers. Okay, any more questions? Hey, I have one. David Dailey. Hello. David Dailey from MongoDB. First off, thanks for opening up talks. That was kind of cool. Are you the performer student at Mongo? Yes, I am. Oh, awesome. Welcome. Thank you. So one question I had was like, this is really cool stuff. I'd actually read your paper before. But about concurrency. All of your tests, I believe, are a single query run by itself. A lot of our interesting performance problems are from things happening at the same time and something locking up in the system or TCML goes crazy. Is there a path towards exploiting what you've built there for the concurrent case also? Actually, now we are considering Reckless bug and data corruption and crash and other things, but not the concurrency bugs. But since we are making it as a plugin, so meaning that if we are interested in the different class of bugs, so we can support the concurrency bug plugin to our system. I think that is a good idea. But I don't even mean necessarily that something's wrong, just that when you run 120 people doing something in parallel on this, it gets slow now and it didn't before. Right, but in our setup and the current version, we didn't consider those concurrency and multi-connections, so single connection. But I think we may need to consider those some issues. Okay. And David, what do you guys do now to find those things? Oh, we run a huge number of tests regularly. That's not a minute long answer. But I mean, is it like, it's like the threshold thing, do you care? Like that, I was surprised that Montmango, my SQL Postgres says it's not through Reckless, we don't care. Yeah, 3x seems a little aggressive to me. And it depends very greatly based on how wide the regression is and what the test is. But largely, on things probably as focused as you're doing here, we're probably going down, if it's below a 10% change, we're probably not looking at it, but above 10% change, we're looking at it. If it's a more general end-to-end test where we're really loaded the system, say it's YCSB or TPCC or link bench, we'll detect and investigate things that are smaller than a 10% regression. Okay. Very cool. Here's someone asking a chat, so I'm just trying to acknowledge that. Thank you, let me see that. So someone says, are performance regressions that don't involve a change in query plan more likely to be serious, such as the hash aggregate implementation example? That's a good question. I see the question. It was so fast, so I couldn't understand. Click, get a click chat. You got it or are you going to read it again? Yeah, this is Zoom app, it doesn't really work at all. We're trying to do something. This is the first time I'm using the Zoom. Oh, okay. So someone repeat the question slowly. The question is, are performance regressions that don't involve a change in query plan more likely to be serious, such as the hash aggregate implementation example? So he's basically asking, if I throw the same query and the query plan doesn't change versus another query, it has a different query plan and they both have regressions, which ones are going to be more... I see, I see. Seriously, whatever, which ones are more significant or worth investigating are more problematic. That's a great question. So we thought about the problem, but I'm sure we could not differentiate the internal optimizer's decision. So when you report about, we didn't think about the actual internal decision making from the optimizer, but we considered. Yeah, I guess the answer would be yes, because I guess it is highly likely that those operators, execution engine operators would be used more often irrespective of a bug in the optimizer, right? So that's actually a bigger concern probably compared to issues in the optimizer, which rarely manifest, yeah. But yeah, that's also something that we should probably study, like bugs in different components of the database system. So I just wanted to give a quick shout out to John Reger, a tutor, whose work on C-Smith and C-Reduce actually kind of inspired the public. Oh yeah, so this is interesting because all of my background of buzzers and all this coming from John Reger, but I don't know if he's here or not. I actually send this paper to him, so he won't know this, so. Yeah, and John actually mentioned that his work on C-Smith and C-Reduce was actually inspired by a conversation he had with the database developers at Microsoft in the SQL Server team in the 90s who are building this tool called RACS for Stochastic Testing of Database Systems. So it kind of is like coming back to the data. Yeah, coming back and forward, which is good, you know, this is how our community is supposed to know. Okay, it's been an hour, I have a small child that is under the age of one. Last question is, how usable is your code now? Like if I, how long would it be for me to say point it at a new database and start running this automatically? So Joy suggested a very systematic approach to support usability. So we are thinking about providing some YMF files that user can specify the port number and the address. They just let the folder go. That's our goal. Oh, you're not there yet. But you are working on it. It's research course, right? I totally get it, okay. All right, this is awesome. All right guys, let's stop here. Let's again thank Jinghao for spending time with us and all of our friends from Georgia Tech joining in with us.