 Carnegie Mellon vaccination database talks are made possible by Ototune. Learn how to automatically optimize your MySeq call and post-grace configurations at ototune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. All right, guys. Welcome to the last vaccination database seminar talk for the semester. We're super excited to finish off with the professor Eugene Wu. He is an associate assistant professor without tenure at Columbia University. He did a PhD under Sam Madden and my team. So he's gonna talk about his research on human data interaction and building systems around that space. So as always, if you have questions for Eugene as he's giving the talk, please unmute yourself, say who you are and where you're coming from and feel free to interrupt him at any time. We want this to be a conversation and not him talking by himself. And Eugene, hopefully you will introduce the museum piece behind you before you talk. Okay, sure, yeah. So thanks, Andy, for inviting me to close out the vaccination season. And this museum piece is a prized part of my art collection. It's a t-shirt signed by Mike Stonebreaker, who's one of the pioneers and illustrious researchers in our field. So with that, let me get started. I also noticed, by the way, that Andy added to my bio that I wear pants. And I wanted to clarify that it depends on the temperature and situation, whether or not it's needed. I also just wanna clarify that in contrast to the rest of the series, I'm not gonna talk about a particular database system, but instead share the vision that, you know, my students and collaborators have been working on in this area of system for human data interaction. So as we all know, programming is powerful because users don't need a program, right? They could use human computer interfaces that we all use today. But the premise, of course, of this entire seminar is that the center of gravity is shifting from compute to data to the point where data rolls everything around us. And so there's corresponding, there's shift towards human data interfaces, right? That are centered around visualizing and interactively analyzing and manipulating data. And we are seeing this kind of like trend across nearly every discipline and every task, right? Across monitoring, finance, marketing, science, and industry, as well as data tasks such as data wrangling, data modeling, data exploration analysis and so on. And in fact, if you think about databases themselves, the reason why I have this seminar in this field at all is because they represented a leap in human data interaction from imperative programming to declarative specification. And now the demand for better interfaces not just from programmers, but really from everyone. You can think of human data interfaces as both the front end interface, right? That renders data as pixels on the screen and let users point and manipulate this. And the underlying systems that transform and process the data in response. And yet despite the active research in both of these areas, right, it's still very difficult if you think about it to build and design data interfaces today. And I wanted to start with two of the major reasons why, which are a combination of scale and what I call design dependence. So if you think about scale, if your dataset is small and you can just process it in the browser, in the client, then the problem is arguably solved and there's many, many great solutions today. But as the data size grows, your architecture necessarily becomes more complex. For instance, you might need to shift processing out of the browser to your laptop. And if the data gets any larger, then you might have to use a remote server and database or even the cloud, right? And all of this introduces latency at every system boundary as well as network bottlenecks, cloud API calls and so on. And the fact that you're now kind of like implementing a distributed system. But in addition to that, right, the interface itself hasn't really changed. The user still expects the same amount of responsiveness because they don't really see any of this. And in order to kind of like provide that, then we often end up adding caches to high latency ever in the system. And if the caches become quite smart, right, because you might want to query the data in a cache, then they start looking like databases. And it's not to say that at every level of this kind of architecture, there aren't great libraries, right? There's really good libraries and systems to solve each of these individual problems, but optimizing and combining them together is still non-trivial. In fact, if you squint, this is kind of akin to the memory hierarchy. And we kind of know programming efficiently against such a hierarchy is one of the fundamental challenges in computer science. However, in this particular case, we're asking designers to actually manually program and optimize against this hierarchy and combine disparate software systems. And part of this signals a lack of great abstractions for specifying and describing the interfaces, right? And enabling this kind of optimization that we know and love. Now you might say, well, there's a lot of fast, like and responsive visualizations out there. So clearly, if you know exactly what you want to build, you can hire a crack team of developers and designers and just build and optimize it. But for most cases actually interface design is an iterative process. And the problem is that design choices affect the system architecture as well. So let's consider a very, very simple example. We'll look at different designs of the left interface that is just analyzing legislative votes from chambers of Congress. And then on the right, we'll have a trivial architecture and just look at what kinds of data structures that might pre-compute. So if you want to add two buttons to provide interactivity to choose the decade, then it's pretty simple, right? We can pre-compute both decades. And then when the user clicks on one of them, then we can send the pre-computer results back to the user. So that's very simple. But what if we change the design slightly to a rain slider? Well, this would then require pre-computing a quadratic number of views. And so you would need a different data structure, maybe a data cube or something else. You might even consider another optimization to push that to the browser and then cash and process it in place, right? But then maybe the designer wants to try adding a radio button or something else. And that requires yet another type of optimization, right? And this is just for a very simple example, but the bottom line is that any minor change in the interface design directly affects the architecture and the types of optimizations that are needed. And so what we see here is that the trend in most application development today on web and on mobile is veering towards no code levels of simplicity when data is not really an issue. Yet the combination of scale and design dependence makes this increasingly harder for data intensive interfaces. So what our lab has been focusing on over the past several years are three classes of problems under the umbrella of the data visualization management system. At the foundations level, we're thinking about identifying system primitives that can more easily express and build data interfaces and enable optimization. On top of that, we're also developing systems and algorithms to more easily design interfaces themselves. And then at the very top, we've been thinking about if we can design interfaces easily and we can implement them easily, then what other capabilities beyond displaying data and clicking on widgets could we think about, right? And so there's many projects at each of these different layers, but what I wanted to do today is focus on one slice through this architecture. Two of these projects are gonna showcase decades old concept called data lineage helps us innovate at the systems and interaction levels. And then the third project shows how we might expand the range of interfaces that are designed and created today. So I'll start with smoke, which reframes interaction as lineage operations. This is work initially led by Fotosalidas and it's now continued by Hanian and Charlie. So what is fine-grained lineage? Given a workflow such as, for example, here I have two tables, I wanna join them by ID or color and then I wanna aggregate them by ID or color, right? So fine-grained lineage corresponds to the record level dependencies between operator inputs and outputs. Concretely, if we look at output one, the orange record, then it depends on the joint results J1 and J2 and J1 correspondingly then depends on A1 and B1 in the base tables. And this forms a graph that can tell you, well, which inputs contribute to any output records and vice versa. And it's very useful in terms of debugging, privacy, fixing distributed protocols, incremental view maintenance and more, right? So it's been around for several decades. And so it's very useful. However, actually materializing and using this lineage graph today is very, very expensive. On even recent papers and systems, on analytical queries, the existing overheads can be between 10 to over 1000 X slowdown of the query. And it kind of makes sense because a very, very optimized read query has now transformed into a massive write operation, right? And in addition, lineage is often inefficiently represented. For instance, I'll describe this next. And so to give you a sense, there are two common approaches today. The first is what we call the logical approach and we stay within the relational model. And the goal is to just rewrite the base query so that output is annotated with this lineage information. For example, here in the output, it would be the gray attributes, right? A and B. Now the nice thing is this is compatible with any database system. But as you can see, the output one, O one has now been replicated for each combination of input records that depends on from the base tables. And the query now needs to do a lot more work to actually compute these annotations because of this denormalized representation. There's a kind of some inherent overheads in this approach. Now alternatively, a physical approach, which is popular in big data systems like Spark is to modify each operator implementation to write pointers out to some external linear subsystem like a graph or a key value store. But it turns out that, even just making these virtual function calls can be very expensive in any fast database system. We experimentally see that just making empty virtual function calls can slow things down by over 2x. In addition to the subsystem also replicates a lot of functionality that database provides. And so in both cases, it's quite expensive. Now the reason we care about this and I'm harping on performance is because lineage is actually a natural way to express interactions and define interfaces. If we think about everything that you see on the screen, that's data, and you point to any pixel that represents data, what you're really referring to is its lineage. For instance, if I select points in the scatter plot, what I'm really selecting is the data that backs it or its lineage. And similarly, if this base data has been rendered in a different way, for example, like in a bar chart, then we typically will see that the bar chart changes when we select data in another chart, right? This corresponds directly to either forward lineage or view refresh. And so ultimately all of this is declarative, right? What's happening here is the user interaction is data that we're joining with what's rendered in the scatter plot. And we're then getting its lineage and then refreshing the other views. In other words, if you think about interactive visualizations, the initial charts are the output of SQL and the interactions is then lineage and more SQL. Beyond just visualization, right? We can, data is manipulated and rendered in many other forms. Yifond Wu's with 2020 work explored how you might wanna, for example, reference lineage and code. For instance, you might wanna query against the lineage that you selected and pointed to in some other visualization or chart. Or you might wanna fit a model against it, right? So this is all programming at this point. And it makes sense because what lineage really represents is the correspondence between your input data and what you see on the screen. So ultimately lineage is great. However, existing systems are super, super slow, right? It's just unrealistic to use them to drive interactive applications. And yet we know that fast visualization implementations do exist and so what Smoke is trying to do is bridge this gap. So what Smoke is, it's a row-based in-memory query engine. During query execution, as they execute each operator it materializes the lineage information for each operator. The idea is really to overlap lineage capture and operator execution by reusing data structures built such as hash tables and use it for lineage capture. And what we do is we're gonna instrument physically these operators and use query compilation to remove that overhead. If you know that you only care about end-to-end lineage information, we can actually then propagate this information as we generate the lineage so that you end up with just kind of like end-to-end information and avoid storing this per operator. So this is an overview of Smoke. And what I wanted to do is give you two examples of how lineage capture works in the system and then you can extrapolate. For something like filter, what we're gonna do is populate the lineage information which is represented as this red integer array. So for example, let's take a look at the first row. It's gonna pass the filter because 40 is greater than 20. In addition, we're going to store the input rows ID in the corresponding offset in the integer array. The second row fails the filter but then the third row passes and so we're gonna store its corresponding record ID as well. So that's pretty much all we have to do. Lineage in this case is just a simple integer array. It's cheap to create and populate. Now for something more complicated like group by it works in two phases. During the build phase, we are gonna augment each bucket. For example, here, when I look at row one I'm gonna create the bucket and then in addition augment it to also store a list of record IDs, right? So then when I go to record two and I update that bucket, I'm gonna store its information and similarly for record three when I create the new bucket I'm gonna also allocate this information and store its record ID. Now in the scan phase, all we do is just emit the output table as normal and for the lineage we simply create a top level array which we know the exact size of and it just points to the RID lists that I've already constructed, right? So all of this ends up being super cheap too and none of this is really additional work and additionally scanning the data to generate lineage information and piggybacks as much as it can off of query execution. So rather than showing you like performance graphs I just wanted to show this in action that it can actually power interactive applications. So this is a cross filter visualization over 14 and a half million rows of flight data and it's gonna be rendered as four charts each one is just a query result, right? Now when you click on a state or a bar we filter the input table for a corresponding subset and then re-aggregate and re-render each of these other charts and the benchmark will build whatever data structures it needs and then load these initial charts and then simulate every one of the 1800 possible interactions. Now the one popular approach that's dominant today is to build visualization specific cube structures to accelerate these interactions. However, recent papers end up taking between four minutes to one hour to build data cube for this data set, right? So it's really not realistic for ad hoc cases and these data cubes themselves limit the types of queries that you can express. So what we're gonna show in this benchmark is instead of waiting for four minutes we design a custom cube it takes about nine seconds to build just for this benchmark. In addition, here in the top right this is gonna show the cumulative time taken for each interaction including the time to load these initial charts. All right, so now the benchmark has started and what we're doing is waiting for the data cube to be built. And once it is built we can see that every interaction is super fast, right? And so that's why in the cumulative chart it starts fairly high along the Y axis and then it's basically horizontal. Now to show you smoke in action what we're gonna do is when we generate these initial charts we just instrument it to capture lineage information, right? So as if we created all of these charts in ad hoc manner. Now when you click on a state we can use this backward lineage information to help us find the input subset and we can use the forward lineage information to incrementally update all these other charts. So you can think of lineage now as an optimization that comes for free it's helping us do selective view refresh. So now the benchmark has started and you can see the dashboard basically loads very, very quickly because lineage capture is actually very, very fast, right? And it turns out that we end up finishing this particular benchmark faster than it took to build the cube. And so what that suggests, right? Is that lineage has the possibility of being fully interactive and practical for developing these interactive interfaces. In addition, we can capture and use this information on the fly. So this is work that we did several years ago and the big question that everyone often asks is like, you know, this is kind of a toy engine and no one in real life is going to instrument their engines to actually do this, right? And who uses a row-based query-compiled engine for analytics, blah, blah, blah. So what about real engines and can you do it with minimal surgery? So the idea that we've taken here is we're looking now at columnar engines that are actually used for analytics. And if you think about how something like filter works the idea is the following, right? So if you have something like quantity and you have a filter then the output ends up being a selection vector, right? Where each bit is set depending on whether or not the row passed the filter. If we stare at what this is as compared to the lineage that I showed you earlier then it turns out that they're basically equivalent. A selection vector is a dense representation of the same information that lineage was encoding as well. And so what we've basically found at least in some late materialization engines is that query execution itself is not just easy to piggyback off of it is literally computing lineage in many cases. And so what we're doing actively right now and so this is unpublished work and we're kind of like still working on it is we're in instrumenting systems like that DB which is an embedded columnar vectorized engine to see how far we can get using this kind of idea. So I wanted to show you is just like very early numbers of lineage capture on TPCH with scale factor one for four of the queries. These blue bars are just the smoke numbers copied from the original paper. And so all of these are relative, right? In terms of overhead to get the compiled queries essentially you can think of as handwritten tight loops. Now smoke duck is duck DB with lineage capture. The numbers now are relative to duck DB, right? So it's not apples to apples but we end up seeing is that the overhead varies between 1.8 to 4.8% overhead. In addition, it turns out that the lineage instrumentation took about four to 20 lines of code per operator. And because a lot of the code is just pinning existing arrays in memory. And overall in terms of memory overhead it's about 20 to 40 megabytes on TPCH scale one, right? So what all of this means is that lineage is a critical piece for expressing interactive applications and interfaces but it's gotta be fast, right? And what we've shown is that lineage is interactive and is practical. And so what we're doing right now is studying the extent that these ideas can apply to other late materialization engines and how to cleanly integrate lineage management into these engines as well. And what we seem to see is that many arrow-based analytics engines like DataFusion and Rquero and JavaScript are also amenable to some of these ideas. So we're very excited about this. And so this is kind of like how something from database theory can be applied to something in HCI. So now I wanna shift to a usage of lineage and this is work led by Lamper's focus, Young Wu with collaborators, Jinnon and Knuckle. So this animation is from the CHI 17 paper same stats, different graphs. All of them have the same summary statistics on the right but obviously they're very different, right? And visualization is exactly powerful because it shows you what a prior statistics cannot. Now our work seeks to empower users to identify unexpected patterns in the interface and be able to ask why these exist, right? And then to be presented with sensible explanations and this hopefully can help them debug data errors or simply better understand what's going on. So I wanna just share two examples via demos of our work. The first demo is this idea of query explanations that we introduced in 2012 through the Scorpion paper. And so here like you can imagine a sensor deployment collecting a bunch of data per minute, right? And across like an entire deployment. And we might wanna group by hour and then compute the average sanitation temperatures and then plot on a visualization. Now the user might immediately then ask, well, like all of this looks kind of like periodic and it might look pretty good but why is this region looks so high, right? What's going on there? And we'd like some explanation as to what's going on. So this is kind of what I wanted to first demo if you can see my screen. So this is an exact same data. So the blue corresponds to the average temperature on an hourly basis. And you can see there's kind of some daily periodic pattern, right? So what we'd like to be able to do is literally just like in the slides, ask what is going on here and specify them as anomalies and then say, hey, I thought they would look closer to these other points that I just selected. And then I can run the system called Scorpion. And what we're gonna do is look for potential explanations as to why this might be the case. And so what we'll see here is, you know, I think I screwed up some demo. No, I didn't screw up a demo but I need to show you more information. So, you know what? Screw it, I don't care. So here, if we look at Moat 18 and you re-visualize just the data from Moat 18, we can see that like the data is just totally crazy, right? And it turns out that if you actually rerun this, which I'll actually do right now because this is a little bit better, is if you do exactly the same kind of like, I think, sorry, I'm just gonna do this again because I can't show you all the coolness if I'm missing some of the inner- You're good, keep going, you got it. All right, so we're gonna like re-execute this, right? And we're going to look at different rules and see their effects if we've removed the data. So what that means is earlier we saw that sensor 18 is a potential kind of like explanation. And if I hover over this now, what I'm gonna do is remove all of its readings and then re-visualize the visualization. And so we can see here is like the anomalies we specify disappeared. And so this can be a useful heuristic for an analyst that then try to understand what's going on, right? And so what the semantics are here is that we've generated this predicate rule that says centers with low voltage or you know, sensor 18 and so on. And what do we mean is if we ignore all that data that matches the predicate and re-execute the query then it would help address the user's question, right? Their complaint. And there's been really great follow-up work by Peter Baylis, by Sudipa Roy, Alexandra and many other areas of people in this particular area. But what I wanted to show you another demo is a different example of explanation in the context of machine learning analytics in our recent rain work. And the setup is very similar, right? However, what we know is that machine learning is not only used for end user recommendations but it's also often used in one step of a data analytics workflow. For example, here maybe what I wanna do is first filter to just get the readings for faulty sensors, right? And I have some model that predicts whether or not a sensor is faulty. And then I wanna group them by hour and just count it. So maybe this is for a dashboard or a monitoring system. Now here, the user might see that there's been a drop in hours three and four and wanna ask questions why. Now, existing explanation approaches will help us find issues in the readings data set. But even if that data is fully correct, we can still have these errors due to errors in the training data, right? Because that can cause the model to mispredict in a way that ultimately causes what the user is asking about. And there's not many good solutions for this problem. Normal data cleaning looks for errors in the training data but ignores kind of how the model or the downstream analysis is used. And influence analysis techniques for machine learning relies on labeling individual model mispredictions, but that doesn't really work if prediction is just one step in this larger analytics process. And so what RAIN is, is the first work to perform training data debugging based on user questions of the analysis outputs. And specifically what we're doing is we're estimating how much the leading different training records will affect what the user has asked about, the user's complaint. So I just wanna show you a quick demo of this in action. So what I have here is just a very popular data set, MNIST. And I'm gonna group by its predicted label and it's just gonna count. So this is kind of the visualization and we can see that for the most part we expect a uniform distribution in this training in this data set, right? But we flip some of the labels. And so what we'd like to be able to do, you can imagine the user asks, well, why is seven so high? So I'm gonna just specify that I think it's higher than what I expected. And I'd like to see what training data kind of might be the reason. And so here I've returned the top 20 training records where if you remove it from the training set it would help, it would most decrease this particular output value. And you can see that all of these kind of highlighted in red correspond to one digit set of the MIST label that seven. You can similarly click on the bar for one and say, why is it so low? And we would then identify is that, oh, actually, like removing all of these MIST labels will help you with this particular question, right? And so you can see that depending on the kind of questions you ask, you'll get kind of like better or worse results because they may be more or less aligned with the type of error that actually affects what the user sees. In fact, an interesting thing is you can ask seemingly nonsensical questions. For example, you might ask, what kind of data can I delete from my training set so that four actually is increased, right? The count of four predictions. And what you'll end up getting is a bunch of digits that look pretty much closer to fours than other digits. And if you delete them, then that would help the model predict more fours, right? And so it's kind of nonsensical and so you can see how visualization and interfaces are an important component to trying to understand and interpret explanations. All right. This itself, of course, is not like a full or complete solution to this problem, but hopefully it demonstrates the value of not just the algorithmic side, but also the interface side. Now, the reason why I talk about this at all is because these explanations that we're talking about, right? End up being a function of lineage information because what we need is to know for what the user is asking questions about what data was it derived from and also how was it computed, right? And so that is precisely lineage in what is commonly called data provenance. And so that's why smoke is useful, right? It's not just for building and specifying interfaces, but also for enabling these kinds of capabilities efficiently. All right. So I'm done kind of talking about the use of lineage. And I wanted to switch gears and talk about the third project called precision interfaces that's led by Yiru Chen with earlier versions led by Tibo, Owen and Houchi. So we focus so far on building interfaces and expanding their capabilities, but interface design itself is ultimately dictated by the analysis task, right? Analysis typically start off very ad hoc. You might be programming or like writing ad hoc queries and then we ultimately wanna design an interface once you settle on the set of queries that would be actually useful for an analysis. And yet if we think about how interface creation tools work, they actually work in the reverse fashion, right? They limit themselves to a particular class of queries or analyses and then make creating interfaces for those easy to build. So here, if we think about the tools for creating interfaces at the X-axis here corresponds to the expressiveness of the analysis task. And the Y-axis is how easy it is to create interfaces using them. Now tools like Metabase and Tableau and Excel and so on make it very easy to build interfaces that are essentially parameterized queries or data cube operations, right? So you can say that for these very, very common classes of analyses, they're fairly limited but they're very easy to use. But if you want anything more complex towards the write then you basically enter the world of writing lots of code or paying someone. And although there's engineering kind of like programmatic libraries to help with this it's still just too difficult or expensive for the vast majority of potential interfaces, right? To be creative. So oftentimes they just simply will not. And so the kind of like the long-term goal that we have for this project is to basically place present interfaces at the top, right? For any analysis task, how can we make it nearly trivial to actually create the corresponding interfaces? And the philosophy that we take is the following. And this is just one approach. If you think about what an interface is there's really two components. There's a visualization that renders the output of some queries or programs. And then there's interactions that let users change those queries, right? And when you change them, then the interface updates. Now, if you imagine listing all the queries that result from all possible combinations of interactions with the interface you could call this the expressiveness of the interface. In other words, the interface itself is a compact representation of the set of queries that it can produce. And we would ideally hope is that that set includes what we need for an analysis task. All right, so what our work is trying to do is given a sample of the analysis task that you would like we want to derive the latent interface, right? And the core challenge here is kind of, actually it took us a long time to figure out something reasonable, which is how do you map query strings from the input to interactive, fully interactive interfaces, right? We have a series of papers that look at different approaches. And so what I'll give you is just kind of a demo, a walkthrough of the four main steps and what we're currently thinking is a reasonable approach. First, we're going to model queries as their parse trees. So here I'm going to focus on just the highlighted parts of query one and query two, right? Because that's the part that's different. And we can see is that these two parse trees are rooted with equals and have two children. Now the corresponding interface is going to be simple, right? Because ultimately we have to execute these queries and render them. So the simplest interface would be just render the output of these queries and then just visualize them. But we want interactive interfaces sometimes. And so we introduce a new class of nodes called choice nodes to encode subtree variations. For instance, here we could add a choice node to choose between these two subtrees, right? And so what that means is you can choose one of its K children and parameterizing it will result in a standard parse tree that you can execute. So you can think of this as generalizing sergeable kind of like queries that we know and love today. The nice thing about this is choice nodes can be directly mapped to interactions. For instance, here, clicking on the second button would correspond to choosing the second child of this choice node, right? And of course, the buttons are not the only types of interactions. You could map this to you. Radio buttons can also choose one of many. And so we can rank the different candidates by borrowing from existing interface cost models that can account for things like usability, complexity, screen size, and so on. So that's the third step is costing. And then the fourth step is to recognize that the middle tree here is the result of merging the initial pairs of trees, right? By adding this choice node. So it went from two trees to one. These types of tree transformations allow us to generate structurally different interfaces as well, like we saw here. Now, we could also imagine refactoring the highlighted equalities out, right? So that we have a tree rooted at equality. But then now you can actually make independent choices for the left and right operands. And so this ends up generalizing beyond the input two queries that we have. This also maps to a different interface, right? Because now we have two interactions instead of one. And in practice, of course, you know, we have many different types of transformation rules and types of choices, choice nodes that we can express beyond this trivial example. But hopefully this gives you kind of a sense. So to put these steps together, you can imagine initializing the system by parsing the initial input queries into a sequence of trees, and then mapping both the tree execution results to visualizations, the choice nodes to interactions, and then the tree structure to a particular layout. You can cost the output interface and then we can either return it or then transform the trees further. And so this is the problem statement as well. Given a query log, we wanna search the space of possible trees and mappings for the lowest cost interface that expresses all of them and potentially more. You may decide this, but what is the cost function here? The cost function, you know, we borrow from existing UI literature. You can think of things like personal preference, usability, how easy is it to express the analysis, you know, like the workload and so on. How, you know, like whether or not, like there's too many options in a radio button list and so on, right? So, you know, there's different ways that you can cost it. And what we kind of like use is both per interaction, like cost as well as, you know, what's the expected effort that the user needs to take to express the input queries in sequence? So like this is stuff you can compute offline. You don't need to put humans in front of this and test it. Yeah, so you could put humans in front of different canned interfaces offline and then train some model and then use that, right? So, correct, yeah. And, you know, implementation wise, we use Monte Carlo tree search. It's basically randomly searching the space and it's a little bit better than that because it uses some reinforcement learning kind of ideas. Okay, but, you know, the cool thing is a demo, right? So let's take a look at a demo. Here, hopefully you can see the queries. So I'll zoom in here. I just have a simple cars dataset and maybe here initially I want to just look at, you know, different miles per gallon and for a different horsepower. And then maybe now the analysis kind of like, we've changed this, so we want to see a different range for the horsepower, right? So this is fairly simple and what we're going to do is kind of like run the interface a little bit. And we can see that it kind of like immediately generates like an interactive interface that you can use to basically filter and avoid writing queries. But then what if I actually have kind of a more complicated query structure? For example, now I want to actually change the subquery. Here I'm going to use a very simple example for demo purposes, but let's say I just care about cars from the USA because, you know, we're like super, I'll stop there. And so here, you know, what we can see is that we've actually generated a weird interface, right? It turns out that, you know, like for this particular three queries, you know, it's actually might be best to just allow you to specify each of these queries individually. And maybe that's totally nonsensical, right? But if I now say I actually care about kind of like Europe, then presented interfaces in the background will crunch for a little bit and it'll actually generate a different type of interface. So you still have the slider as normal, right? This range slider, but you can now select either just the full cars or you can manipulate the subquery structure itself, right? So what this basically does is when you click on Europe, it switches to the subquery that it corresponds to and then manipulates this text string here. And then now maybe you wanna do something else, right? Maybe you wanna also look at horsepower and look at the average miles per gallon from this dataset group by horsepower. And if you run something like this, then presented interfaces will chug for, you know, a couple of seconds and what it'll end up generating is a multi-view visualization where it actually visualizes both the horsepower versus miles per gallon here, right? And it allows you to interact with it directly. And so you can use this visualization to then interact and see kind of like what's going on in the initial query, right? And you can also, of course, use these other kind of controls as well. And so by just writing a few queries or you can imagine mining some existing query log, we can identify and fully generate a fully interactive multi-view visualization that you can use in lieu of programming, right? If it turns out to work well. So that's a demo of kind of like the types of functionality to show that it is possible to do this, right? And the nice thing here is that you can have arbitrarily complicated queries as well as transformations of its substructures because what we're doing is we're just encoding all of this as transformations in these trees. So just to kind of wrap things up, I've talked about these three projects in the state of visualization management system. And hopefully, you know, I've convinced you that it's pretty interesting. I want to connect it now with the rest of our projects to see how all kind of pieces together. The interfaces that we generate in this last part are in this internal declarative representation. It's very similar to the trees and choice nodes that we saw earlier, right? Because what we want to capture in interfaces is not really the front end rendering, but the data flows, right? That back it because that's the hard part in a lot of these, in modern interfaces. Physical visualization design is another project that I didn't talk about. And it consumes this representation in its goal is to recommend a system architecture to optimize the interactions. It's very similar to physical database design if you're familiar with it, but the data structures are created and stored outside the database. And we viewed contacting a database as a slow path. The reason is because there's lots of work on new custom systems and data structures and papers out there today that are often external to the database system itself, right? And it also tries to account for the client and communication. Now, so what this allows us do is to design an interface, right? Either manually or using this, you know, prison interfaces, point it at a snowflake or some remote database and just fill in the rest of the architecture. And of course this can now execute and leverage a lot of the kind of system primaries that we've been developing, such as smoke for lineage and interaction or communication management and chameleon, as well as kind of query optimization. In addition, because we're generating these interfaces, we can imagine embedding these novel interactions and functionalities into the interfaces that we generate, optimize and deploy. And so in other words, we think of prison interfaces and the reason I talk about it as the way of kind of like bootstrapping the use of a lot of these other projects that we have and that we're working on. And so just if there's one thing you take away from this talk is that data management and visualization and HCI, you know, are very interesting standalone, right? But really the inner intersection of it, there's lots and lots of really interesting and rich research problems that can benefit both communities. We've tried to study some of them in this DVMS like data visualization management system at the levels of functionality, interface design and system primitives. So at this point, this is my URL and a bunch of pictures of ninjas. All right, thanks everyone. All right, awesome. I will clap on behalf of everyone else for Eugene. So we have time for questions. If you have any questions for Eugene, please unmute yourself and fire away. Okay, so my question is about the precision interface stuff. That seems super cool. Like, you're taking one of the SQL queries they throw at you and then trying to turn it into interface. Like, and I understand there's a cost function and maybe I don't fully understand where they can handle this but like what happens if users like a writing stupid queries or like it's not necessarily stupid queries but like they're trying to figure out what the data looks like to begin with without asking questions. Like how do you prevent them from polluting the interface? And also, because I can imagine also to what you really want to do is let a knowledgeable person write the SQL queries and then the novice users don't get direct SQL access they instead get the interface that's generated. So how do you prevent, it's not an adversarial thing as if they're trying to hurt you or hurt the interface, it's just they don't know what they're doing. Definitely, definitely. You know, like I think of precision interfaces as like the underlying mechanism, right? It just takes a sequence of queries and it generates stuff. Now there's, you can imagine many ways of using this. So one that I demonstrated here is a knowledgeable person writes perfect queries that are well crafted for a demo. But another could be, you know, you do this ad hoc analysis at the command line or using other kind of like complex tools. And then once you've decided like this is the analysis I want to share with other people or that I think I'll do repeatedly then you want to basically say like those queries log those and generate something specialized, right? So that's another way that you can imagine using it. But for instance, maybe like you're writing these queries and then you just mark the ones that you think are useful, right? Or you could explicitly use it as like an interface generation tool by directly writing queries that you know will be useful, right? So that is if you're using it specifically as a design tool. There's many kind of like limitations and opportunities to, for example, we generate like one interface but there's actually like how much do you want to generalize beyond the queries that the user gave you is a parameter that you want to be able to control, right? And maybe you personally have like preferences on the layout or on the types of visualizations and so on. And it turns out that like you'd like to be able to maybe specify that and then use those as kind of hard constraints, let's say. So I think of this as an optimization problem and then what you can use it for. So we're currently like at the point where we think this is like a good formulation of the problem. And we're hoping to write it up like later this year. Yeah. Okay, okay. And then sort of what I should be is like you know, SQL is, you know, you can write the same query and significantly the same queries and in a bunch of ways in SQL, like does it, how well can it handle that? Does it always generate the same interface? Or does it, like if I use nested queries or CTEs, but again, the high level answer is the same thing. Like I asked my students for the first homework assignment, they have the right SQL queries, they ask, you know, questions written in English and they come back with all sorts of crazy different ways to write it. Would your things still generate the same interface? Or... I think it would probably generate something pretty, it's like as you can see, right? We actually just look at the syntax and some database statistics and schema information. So we know nothing about the internals, right? We're not looking at the query plan or the logical plan or anything like that. If you do look at the query plan, what could you do? So I think if we looked at the query plan, it would allow us to do more canonicalization, right? So I think that would end up generating like equivalent interfaces if they have equivalent, like, you know, the closer you are to the semantics, the closer you are to kind of like canonicalizing. I would say, I imagine that, you know, like with any tool, you can use it in like unintended ways and intended ways, right? And so I would imagine that you probably don't wanna use this as just like throw a bunch of random queries and generate something. You probably would end up using it to either like mine, for example, like you could, I could imagine having like some interface complexity measure or some kind of like entropy measure where you can say, oh, this interface is actually reasonable with some criteria. And then kind of like just scanning through like query logs and looking for like, you know, sequences, subsequences that kind of are self consistent, right? So that's like one crazy thing that you can imagine doing. But ultimately, like people will use tools based on where it shines and probably not, you know, not for the corner cases necessarily. So it probably isn't good for like the setting you described, at least now. I wonder if you take like Tableau and you know, you load some public data that they have and then set up a basic visualization and then, but capture the queries and then run the queries you capture. Yeah, we've done that in the past, yeah. Like one of our goals is to be able to synthesize like the Tableau interface, right? Because like the primary like degrees of freedom is the group by clause and the work clause, right? And the project clause. So, but you know, like, so it's really just degrees of freedom. Another way you can think about it, right? Another like a baseline would be if you, like how would you write interfaces today? You would literally write and construct query strings, right? And so at minimum here, it kind of like generalizes Sargeable things. So for example, we can generate only syntactically correct statements, right? And we could, you can imagine doing some checks to make sure that you only execute queries that are not faulty, right? So those are things that you could do with this sort of functionality. So the baseline I imagine is like the alternative to creating an interface, which is writing a bunch of like query templates and then like filling in like query strings. Yeah, here we go. Okay. I had a question about smoke, too much petals. Yeah, so in smoke, I guess if you're running in DuckDB, it sort of answers my question, but there's no sort of level complexity of the query where you would lose like how many layers of nesting, you know, of the converted queries, CTEs where like you would lose as many information like all with accessible. It's all at the physical level. Okay, if you don't lose anything. Yeah. Okay, cool, good, good, good. Okay, any other questions from the audience? Eugene has a one year old and I have a one year old. I'm gonna have to take care of them. Oh, he stands office, so he doesn't have to. Oh, I think the nanny leaves in 10 minutes, so. Okay, so you gotta keep going. Okay, any last question for Eugene?