 Okay, it's a pleasure to welcome Jennifer Widow. Many of you probably know a little bit about Jennifer. Among many things she has done, which have earned her justifiable fame, one of the more recent ones was the first MOOC in the area of database, and in fact one of the very first MOOCs at all that she offered about five years ago. And that was an eye-opener for everybody when she got a hundred thousand people signing up for her MOOC, and in addition to that she has a long history in database systems, so she's currently a professor at Stanford and also associate dean for faculty and academic affairs. She has an interesting background. Her bachelor's degree is actually in music, not even in computer science, but there seems to be some nice connections between music and computer science. I know of a few others who also moved from music to computer science, and then she did a PhD in Cornell in 87 and was at IBM for a while before joining Stanford. She's an ACM fellow member of National Academy of Engineering and American Arts and Sciences Academy and so forth, a lot of achievements and honors, and she has also done a lot of other interesting stuff. She has actually sailed into India. I mean she must be one of the few people in the modern era who have actually sailed on her own boat with her family from Thailand to Andaman's, and she's been camping and trekking along the Rocky Mountains and in Ladakh. She's seen more of India than I have. So it's a pleasure to have her here today and she'll tell us about some of the interesting things she's done. Thank you very much. How's the sound on the microphone? Is it good? I generally have a loud voice, so I don't want it to be too loud. So I'm going to talk today about my three favorite results, as you can see from the title of the talk. So I was asked to give a general talk recently for the SIGMOD conference, and it's actually hard to figure out how to give a sort of retrospective talk on what one has done. But recently I had been somewhere where people said, anytime you need to do something, three is a magic number. So I said, all right, let me just think about the three favorite things that I've worked on in my research career, my three favorite results, and put the talk together around that. And so that's what I'm going to do today is tell you about my three favorite results. I generally like interactive talks, so I very much appreciate questions during the talk. We have time for questions during the talk. If anything is unclear, please feel free to raise your hand. And there may be one or two places where I ask you questions just to make sure you're engaged. So be ready for that. Okay. So let me start by defining what favorite means. So I think we know what three means. Result is sort of a matter of opinion, but you'll see what I consider a result. But favorite, things can be favorite for different reasons. And so I want to tell you right up front what favorite means for me. A favorite result is not for me one that won a best paper or a test of time award. And it's not even necessarily the result with the most influence. So it's really my personal philosophical favorites and for each one I'm going to tell you exactly why it is a favorite result. Okay. And I'm not going to bring any of these on you, no surprises. I'm going to start by telling you what the three results are and then we'll talk about them one at a time. So my three favorite results over my long career, you mentioned the year my PhD was before many of you were born, I'm afraid to say. The first one is called data guides and data guides and here are the three results. Data guides is in the area of semi-structured data and that result was in 1997. The second one is called CQL. It stands for the continuous query language. It's in the area of data streams and that was around 2002. And the third one I'm going to tell you about is ULDBs. It stands for uncertain lineage databases and that was in the area of uncertain data around 2006. So sorry to say I haven't had a favorite result in the last almost 10 years. I did spend part of that time sailing into India. I did take a year off actually in travel including the sailing that was referenced. And I've been department chair and all kinds of things but those aren't great excuses. In any case these are my three favorite results and we're going to go through them one at a time. Before we do that I want to tell you about the Stanford Info Lab patented five-part for introduction. So that's a mouthful. The Stanford Info Lab is our database group. Info Lab sounds a little fancier. We have a few faculty and databases and we work frequently as a group. And in that group we've developed a methodology for writing the introduction to a paper. And in fact that same methodology works very well for talks. And it's basically five questions that you need to answer in the introduction to describing say some result that you got. So guaranteed paper acceptance if you use this five-step strategy. Top secret Stanford Info Lab secret strategy. Unfortunately we've had a few things rejected but I do think it's a good methodology and I'm going to be using it during the talk. So when you want to introduce a result and this is typically like I said a paper or a talk the first thing that you need to do is identify what the problem is that you're solving. It's surprising how many people forget to say what problem they're actually solving when they present a result. So what is the actual problem? Second, why is it important? Nobody wants to hear about a result if it's not important for any reason. Third, why is it hard? So you have to motivate why this was a difficult problem to solve and why it's going to be important. Why hasn't it been solved already? Okay. And lastly, finally what is our solution? And when we have students write papers we literally make them answer these questions in the introduction even writing them you know there and then before the papers finalize we'll take out the little headings. But it's a good methodology. For today when I give you my three results I'm going to be answering these questions for each of them but I'm also going to be answering a sixth question which is why is it a favorite result? Well if it was so hard that it had if the only reason it hadn't been solved already is that it was extremely hard I probably wouldn't be able to solve it either. What four really means is what have people done to chip away at the problem? It's sort of the what's the landscape of work that's already been done and how am I advancing that one? Yeah but if that's a good point very good point and thank you for the audience interaction like I said I like it a lot. Okay so I'm going to launch right into the first result which was called data guides and the I'm going to start by giving you the context for the work and again you have to wind way back to 1997 which only a few of you in the room can actually do realistically but I'll wind back for you. So the project was a project called LORE which stood for lightweight object repository and we were actually working on a data integration project a project for you know bringing together multiple sources of data and trying to coordinate them and in the context of that we developed a data model for semi-structured data which seemed useful for integrating data and then the LORE project was specifically building a database system to manage that type of semi-structured data so if you could read these words which you can't you'd say LORE is a database management system here it says for XML but we started with semi-structured data and then transition to XML. The student who worked on data guides is a student called Roy Goldman I want to make sure to give everybody the proper credit here. Okay so as I said we were working on a project on data integration we were looking at semi-structured data which seemed like the right thing to do because when you do data integration different sources have different data and you can't have a perfect matching of the structure and we invented what we wanted our model to be or our representation for data to be and we decided on a directed labeled graph and we called it the object exchange model and I'm going to be using examples that come from literally 1997 I took them out of the papers so the example that we frequently used was an example about restaurants and I'm not sure how easily you can see this but I just want to point out this is a directed labeled graph so we have a root object and then below it we have other objects that are reached by labels so this says we have two restaurants and we have a bar when I turn can you still hear the microphone okay not a problem okay so and this restaurant has a name an entree a phone and an owner and you can see down here there's little data at the leaves there's a huge mistake in this data that someone just pointed out recently with this talk it's relevant to India Darbar is a Indian restaurant they wouldn't serve beef curry there I don't think at an Indian restaurant so someone pointed that out to me recently but somehow we had that in our running example anyway what do you want to notice here is that it's semi-structured data not everything has to have the same structure we can have some shared data down here the bar all it has is a name and so forth so this is a directed labeled graph this was the model that we were using now I want to say immediately that this models could also the same data could be represented in other models you might think of for semi-structured data so for example here's exactly the same data encoded in XML and as I mentioned when we were working on the project XML emerged while we were working on and we transitioned from the object exchange model to XML nowadays you might think of JSON as a semi-structured data representation and here's that same data all of these share the fact that they can have irregular structure and that they're self describing you don't have a separate schema by the way just so I can get a sense how many people here are familiar with XML everybody good JSON most okay good just to get to understand okay so again semi-structured data what characterizes it is you don't have a fixed schema it's self describing because the schema is kind of in the data there and that is also true of course in the object exchange model okay so that's what I'm going to use is the object exchange model but you could think about it as XML if you'd like okay so back to the patented five-part paper introduction now we're going to go and what is the problem why is it important and so forth we're going to go through these one at a time okay so what is the problem that was solved by this favorite result called data guides it we solved the problem of semi-structured data lacking a fixed schema or I'll say that's the problem that we solved and I would say well that's an obvious problem that's the whole point is that semi-structured data doesn't have a fixed schema so that's pretty easy that semi structured data was even called schemalists or self describing how many people know what a schema is again just okay okay I've got a savvy audience here good okay so why is this problem important well if you think about database management systems they rely on a schema for a whole bunch of things so most database systems before you can put any data in the database you declare the schema you say here's the structure of the data here's what it's going to be and the system will use it to store statistics about the data because it knows what the structure is it will use it to build indexes over the data it will use it for things like checking the validity of a query if it knows what the if you look at a query a query is going to describe the data and this the schema is needed to check whether those descriptions are type correct for example or even exist even the simplest thing of taking in a sequel query select star and expanding it to the attributes that come out in the answer the schema is used it's just used all over the place actually if you think about a browser for a database or a query builder interface it's the schema is key to that as well so if you start trying to build the database system without a schema which is what we did you quickly get into trouble okay so I hope I've convinced you it's important well why is it hard first of all if you have the semi-structured self-describing data you need to define what schema even means because you can't just say this is the structure and this is going to be it and in fact the schema is in the data so what you need to do since the schema sort of implicit is and have an algorithm to infer the schema from the data third the schema might change anytime the data changes so in a regular database system you declare the schema it never changes or if you change it it's a major upheaval on the database here every time you insert or delete data you might have this change to the schema so you need some way to incrementally modify the schema as the data is updated and then the schema can be as large as the data if your data is extremely unstructured maybe each element is a different type and so your schema the data is just the same as the schema okay so hopefully I've convinced you that it's been it's a hard problem that was number three why hasn't it been solved already well back in 1997 that's easy to answer at that time actually we were the only group that was actually building a database system for semi-structured data so there was a lot of work on it like I mentioned when we started doing it we were using it for data exchange or data integration it wasn't until you build the database system that you actually realize how important the schema is okay so finally what is our solution to the problem so our solution was to define something that we didn't quite call it a schema we called it a structural summary of semi-structured data and we gave a formal definition for what that means I'm going to go through each of these fairly lightly because I don't have time to give a whole hour on this one topic a formal definition algorithms for inferring and updating the structural summary how it's used for indexing and statistics which are two of the things schema is used for how it's used for query processing and how it's used for the user interface but we're going to plow through this pretty quickly so here's a reminder of our example we have the two restaurants with various you can think of them as sub-objects or attributes and then that one bar okay here is the structural sorry first I'll give the formal definition and then I'll show you the structural summary or the data guide for that database so what we we we originally called them structural summaries one day we decided let's call it data guide it's kind of a cool name and I think that's partly why it's stuck so well okay so here's the formal definition the data guide is an object in the object model that's the first thing so it's going to be represented in the same model in other databases schema is represented in the same model as well in relational databases the schema is actually stored in a table second and here comes the important part every unique path of labels in the database so every path you can go through with labels is going to appear in this data guide exactly one time okay and I'll show you that in a moment and third there's no extraneous path so if you have a path in the data guide there is a path in the database that has the same labels okay so here's our example again and here is the data guide okay so again the database and the data guide and if you look it satisfies all three of these you can you you could verify this offline but it is obviously an OEM object we have every path in the database is here so we had a restaurant phone we had a restaurant manager everything that appeared in the database appears here and furthermore every path that appears here appear corresponds to a path in the database okay so that is the data guide in this case all right let me tell you now a little bit how the data guide is used and by the way let a question I often get and I'll just deflect this one right away we allowed cyclic data in the in the OEM model and XML it's harder to get cyclic data we allowed it here cycles in the model in the data would result in cycles in the data guide and then you would get that same the definition would be satisfied even for paths of arbitrary length so if you could go you know ABC ABC or right around and around in the database you would be able to do exactly the same thing in the data guide and that's partly what made it hard to tell you the truth okay so indexing and statistics we have this data guide sitting there for this unstructured database and then we augmented it with other information that served as our indexes and our statistics for the database so one thing we did is at every node in the data guide we stored the object IDs for the corresponding nodes in the database so right here we have here for restaurant entree the three nodes in the database that were reachable by restaurant entree were 6 10 and 11 so that means if you have a query that's asking for the entrees of restaurants you can actually just go down here straight to these objects and go straight and get them you don't have to explore the whole database this is effectively what was already known as a path index so we didn't invent the idea of an index that follows paths but this we could put it there in the data guide and that was all not only for the leaf nodes but the internal nodes so for restaurants here these are the two nodes that were restaurants in the original database we also added sample values and you'll see those in our user interface so for example we added some we would if there are a large number of values we just take a few just so we could show the user so this was example values of what you could reach from restaurant that name down down the restaurant name path next for query processing we had what we did for query processing is we didn't want to give errors actually so one of the philosophies when you query summary structured databases that sometimes you don't really know what's in there or things that used to be there are not there any longer and so instead of generating generating errors we had a warning system and then we used that the data guide to generate warnings when queries tried to follow paths that didn't exist so for example if someone said I want to find the entrees of the bars in the database right now there are no entrees for bars maybe there were before maybe there will be later so instead of generating an error for that query we would give a warning and we would do that not by exploring the whole database because if you have a million bars you don't want to explore all of them to discover none have an entree so we would do that using the data guide and it was much much more efficient and then we did something similar to that expansion of select star that I described for regular sequel databases in our query language we had sort of regular expressions to navigate down paths of the database so for example this query here says find phones and addresses that follow any path that star will match any path of any length so find any phones or addresses in the database is what this is saying here again is regular expression if you try to evaluate that over the database you would have to explore the whole database to look for phones and addresses but if you use the data guide you can look and see where there are which types of elements have phones and addresses and then only explore those in the database so again it's it's a kind of indexing in a certain way okay any questions yet on this so far so good okay we're going to go into the user interface a little bit these user interfaces are going to look very old to you again this was 1997 and 1997 they looked really cool I'll just say that right now so many database browsers again start with the schema of the database so they'll if you use a regular relational database they'll show you the tables and what we showed then was the data guide and this is this is going to be a little hard for you to read but you don't have to read it one of the databases that we use as an example was a database about our group about our database group and so that was sort of medium-sized database and it was semi-structured so this is a browser for that database this is actually the data guide so it says the group has group members projects and publications the members have name email and so on each of these is you can think of this as a as the graph and of course not everybody has to have all of those elements so this is the data guide satisfying the properties of the data guide you can open and close things and so on pretty standard so you could be users could browse the data guide and then they could also look at information about specific paths and this is this says I want to learn about group members original homes and it gives sample values of where they came from and then you could also formulate queries from this so you could select things for the query result and you could put conditions and then they would go back to the data guide is again a visual query generator from the data guide and this would be a query saying I want group members whose position matches the string student who came from Nevada and New York and has been at Stanford more than two years so this is how we use the data guide for visual query formulation okay so that's a quick browse through some of the things we use the data guide for again we use it for things that a scheme as traditionally used for but the main goal was to to figure this out for semi-structured data okay so why is it hard so I'm sorry I swapped the order of some of these questions I should have mentioned that I was doing that it's hard to explain why something is hard without at least explaining a little bit about what it is okay so there were some technical some serious technical challenges first of all the data guide by the definition I gave you is not always unique you can have multiple different data guides multiple different objects you know graphs for a database that satisfy those properties that every path in the database is in the data guide and every path in the data guide is in the database those were the two properties and there especially when you have you know a dag or cycles so we define something called the strong data guide and the most interesting thing is the strong data guide was not what's the minimal data guide and I'm not going to go into exactly what the strong data guide is effectively it said that the set of objects that are reachable by that path are unique but it's too technical to go into here but the challenge was that it wasn't just the minimal one that we wanted to use and the fact that there were multiple ones and that some were better than others for the purposes that we were using using it yeah question you will only have a cycle in the data guide if you have a cycle in the database no in the database if you have a cycle then you have an infinite number of paths yeah because every path in the database which could be arbitrarily large will be a path in the data guide yeah it the cycle thing was tricky right very tricky yep but even with simple data guides you can it you can have even with fairly simple databases you can have different data guides for the same database that satisfy the properties and some are better than others yeah second the data guide isn't always small yes yeah right so there is so the let me go one more bullet and come back to your question okay said the data guide isn't always small sometimes it could be very large as I said compared to the database in fact it could grow bigger than the database to satisfy our properties of a strong data guide so we introduced the notion of approximate data guide which actually would relax the notion that every path in the data guide appeared in the database and that would we could then actually make it smaller and third and this is why I'm coming why awaited for a moment constructing the data guide was similar to converting a non-deterministic finite automata to a deterministic one and there is there a notion of minimality and that notion of minimality just wasn't right for what we wanted so we would relax that notion of minimality and what we wanted it really was about being able to have a unique set of objects for our path index that was what was turned out to be important so by this by the way that's easy in general that particular construction is easy for trees harder for dags even harder when you have cycles and there's incremental maintenance issues as well now what was your question and did I answer it this answers it yeah I thought it might yeah yes so it's relaxing this third requirement so the second requirement said every path in the database appears in the data guide the third said every path that appears in the data guide appears in the database by relaxing that third one we were able to make smaller data guides we'd have smaller data guides but some of those paths wouldn't exist so basically we're merging things that really shouldn't be merged was the basic thing okay last question I promise to answer why is it a favorite this one is a real favorite and the reason is first of all because when we did it there were challenges of every type we had to develop the foundations just the basic definitions there were real algorithmic challenges and implementing was a challenge as well so I like that and then it had applications of every type storage structures query processing user interface very different applications of this particular result the name I think the name really stuck people are still actually talking about data guides today believe it or not and it is it was a long time ago so there and the last thing I'm going to say is among all of my results this is the one that wins in terms of tenacity and longevity and actually the student Roy Goldman who graduated way back around that time so now 18 years ago I have this habit whenever someone talks about data guides I send them an email and say they're still talking about data guides and I still send him emails periodically now I think if it had a really crummy name I don't know whether it would have persisted quite as well when we started actually we started by calling it representative objects and then we switched to structural summary and then we switched to data guide and I don't know we'll never know whether the name was a contributor but it but people are still using this result today so that certainly makes it a favorite for me okay so that's the end of favorite number one any questions or more questions yes yes no that doesn't contribute to minimality so minimally was just in the structure and the sample values were random more or less no we tried to make it clear that those were just sample values and those are really just for people that was more for exploration of the database to see the kind of values that were in there and they were chosen randomly yeah yeah so we made no guarantees about that that's a good question okay so let's move on to number two which is CQL the continuous query language and I'll again start with the context which is around 2002 and we're building at this time a new kind of database system again this is what I like to do is actually build new kinds of database systems and this one was to manage data streams and you may have heard of data streams the basic idea is that instead of having a monolithic database that sits on the disk it is relatively stable and you ask questions of it periodically the data is coming as rapid streams it could be sensor data stock tickers whatever Twitter feed which didn't exist at the time and so the data is streaming in and the model there is instead of asking a query and getting an answer you actually register a query and then as the data streams the answers to the street the query continuously stream out and that's why it's called continuous query because the query the query sits there and the data goes through unlike traditional databases where the data sits there and the queries go through okay and the students involved in that actually both Indian students but neither from this IIT we think we determined right neither of them came from here but I don't remember which one Arvind Arasu and Shivnath Babu and Arvind is now at Microsoft Research and Shivnath is a faculty member at Duke University okay so I'm going to just go through the five things again what is the problem the problem is we were building a database system for data streams and we needed a declarative query language sorry went a little fast there so you probably all know but a declarative query language is a query language where you describe what you want out of the database it's and you don't describe you don't describe how to get it so I want all employees that earn more than their manager you just say that in the query not you know open this file and iterate through these tuples and then open that one and match okay but you again it seems like a fairly sophisticated audience so you probably know that why is it important I've always thought there were two key things to a database system a declarative query language and transaction processing so I would say a declarative query language is a key component of any database system and again I guess I'd say that's pretty obvious that that's something important if you're building a new kind of database system alright so why is it hard and this I'm going to motivate quite a bit if we want to reuse the sequel language to some extent in the area of data streams it turns out the semantics the meaning of the queries can be very subtle and we'll see examples that that show that in fact I think even if you don't reuse sequel the exit the semantics of continuous queries over data streams are not at all obvious and the semantics can actually have a significant effect on the implementation I'm not going to go over that I'm not going to have time to talk about that today but if you ask a query on a stream remember the query sits there in the stream goes through what that what you can express what your queries mean could make the difference between being able to store a very small amount of data and continuously answer the query versus having to store all the data growing infinitely large and just small differences can make a difference small differences in query semantics can make that difference between storing almost nothing and storing arbitrary large amounts of data okay so I'm going to stick with this why is it hard for a minute here and actually go into an example specific example how many people know sequel okay good how many people know the continuous variation of sequel okay right not many but you've probably seen these types of examples that okay so this example is going to have a stream and it's going to have a table okay and the stream is this is going to be example that's keeping track of people who are accessing web pages let's say so the stream is going to be a stream of people looking at a URL so this is going to be the URL and a user ID so it's just going to keep saying this URL was looked at by this user this URL by this user and so on lots of those coming along quickly and then for users we're going to have a table of information so this may relatively that relatively stable it's going to have the user ID in the age of that user okay so the continuous query that I'm going to ask is the average age of the viewers by URL in the last five minutes so at every point of time I want to know for each URL that was visited in the last five minutes I'm going to presume a lot of people visited it what is the average age of those users okay so here's the query the only thing that's different in the syntactic expression of this query is this window here which says and this is just a standard in data streams windowing construct that says look at the last five minutes of a stream so this is just a very easy query it says take the last five minutes of the page views then take the user table and join them right that's all I need to do group by the URL and give me the URL in the average age okay so very straightforward right everybody comfortable with that okay what's the result of this query now I'm going to show you that the semantics is subtle is this query giving me a stream is it giving me a relation something else I don't know I don't think the answer to that is obvious okay the query was easy to express using SQL but what it actually means in the environment of a rapid data stream is not obvious second of all what if you've got your five minute window of page views and in the middle of that a user's age changes does that change the result of the query did I want that to change or do I want the age at the time that they viewed not obvious either in my opinion okay so that should motivate why it's hard I hope and we're going to come back to that example then the question is why at the time why hadn't it been solved already so at that time there were actually a few groups who were building databases for data streams I'm just going to say the others didn't seem to worry that much about query semantics so I have I came from a programming languages background I actually like to know what a query means before I implemented where some of the other groups were were just going more straight into the implementation okay so now let's go into what our solution is so the way we solve this problem the philosophical way we solved it was we decided to rely very heavily on the semantics of relational database systems relational database systems they haven't known semantics people know what SQL or relational algebra means and so if we could mostly rely on that and just extend it a little bit for data streams maybe we would get something that was understandable so we have relations and again people really know what it means to run a SQL query or relational algebra on relations so we're going to start with that knowledge and then we're going to add streams as an alternative you saw we had a stream and a relation and then we're just going to have ways of converting between streams and relations so what we're going to say is when we have put one of those windows on the stream like that five minute window we're going to say that turns it into a relation so now that five minute window is just going to give us a table of tuples for that five minutes and then we can operate on that like a relation and then we're going to have operators just a couple of them that will take relations and turn them into streams so we really focus all we all we have to do is define the semantics of these two and this we can reuse and that's how we decided to go about it and I think it worked reasonably well so let's go back to our example now okay so we have our it's the same example exactly our query is expressed exactly the same way and now this views here the five minute range just turns into a relation by our definition okay so the result in this case is a relation because the whole thing is having a relational interpretation the only thing we had to do is turn that into a table and so the result is a relation that relation will be updated when time passes because that table will change when time passes it'll be updated when new page views occur okay and it will be updated when ages change so the answer to this is if someone views a page and their age changes while that page view is still in the window that actual it will update the average age for that URL okay now so that was the interpretation here is an a let's say instead of having a relation that changes we'd actually like a stream as the output so if we want to stream as the output it's pretty simple we just add this operator here called stream right here and this meaning of this operator is that whenever something in the relation changes it will just emit the change as a element on the stream okay it's a little more subtle than that which I'm not going to get into today but that's the basic idea okay the third and harder part is what if we want to use the user's age at the time of the page view in the answer okay and this is not beautiful but doable so just briefly how do we do that we actually take the from clause here and we generate a sub query that makes a stream out of the pairs of the user looking at the URL and their age at that time so we sort of generate an intermediate stream that matches up those ages and then we take the window on that stream instead of on the original stream and that becomes our that then turns into this relation and then we do the rest if you don't understand that some people do I think but that's how we solve that particular problem okay so summary of our solution we defined a precise what we called abstract semantics and it said okay we'll take any semantics for relation to relation like relational algebra and then we just added on the stream to relation and relation to stream operators that I showed you then we had a concrete implementation which was based on the sequel language for our relation to relation the windows I showed you for going from streams to relations and then stream operators to go the other way and then we also added to the language a sampling construct because in data streams it's quite common to actually want to take a sample of the stream and operate on that instead so that was the language called CQL the other thing that we worked on which I actually found extremely interesting was query equivalences in this context and again and some other optimizations but what's interesting and this sort of comes back to the the thing I described about whether you have to save an entire stream or you can just operate on small parts of it we could take queries that described operating on the entire stream and we could automatically translate them to equivalent queries in the language actually that just looked at the immediate elements of the stream so that was pretty interesting again not time to go into that here but that was a fun part of the query language work that we did at the time yeah yeah but what I meant by continuous is just that the query sits there and runs forever until you turn it off you still do it when yes yeah I'm going to mention time briefly but we assume nothing happens unless a stream element arrives well no or the clock ticks yeah so those are discrete things the clock ticking or a stream element arriving are the two things that can advance them yeah it's a good great question yeah so we had a guiding principle for the project or for the the query language development which is that easy queries should be easy to write and we think we succeeded with that and simple queries should do what you expect interestingly we doesn't say anything about hard or complex queries doesn't say that hard ones should be easy to write it doesn't say that complex ones should do what you expect the language was we we definitely define a real semantics for the language but it was not necessarily that easy to use when you got to the more difficult and complex queries to your question about time and ordering time was actually a big deal whether you'd always get your data we had time stamps on the data stream would they always come in order if there was a big lag could you be sure that you weren't going to get one with an earlier timestamp all of those were difficult issues what we did was kind of literally brushed them under the rug by saying we had a lower layer that delivered well-behaved streams they would come in timestamp order and if there was a big lag you knew you weren't going to get a really old one so you could know that you were beyond a certain point yeah you mean when we did an output what timestamp when we put on it so we would put on the time at which we computed that aggregate yeah so that would typically be the last time so if yet yeah we had a notion of a current time so we had in addition to the stream elements arriving we would have the notion of now we could talk about time forever by the way yeah yeah but I'd be happy to talk about it later we will talk about it yes but it was a huge deal the time of business yes absolutely part of the reason I added this was because it it was a huge deal but I wasn't going to go into it in the talk yeah yeah and it's some of that was just we had to make a decision I'm not even sure we made the right one but some you just see these alternatives yeah okay so why is it a favorite I liked it because partly designing a new query language is underrated I think in the database community and it's very difficult to publish new languages actually and in fact as a little aside the previous project I talked about we also developed a query language it was called Laurel it was a query language for semi-structured data and then for XML we had a nightmare getting that published finally so we were trying to publish and trying to publish it nobody wanted to publish the query language finally one of my co-authors Serge beatable was invited to be the editor of a brand-new journal called the Journal of Digital Libraries back in 97 that journal by the way only had one issue ever volume one number one and then it folded but he was invited to contribute a paper to volume one number one and we had this paper we couldn't publish on that query language so we put it in there and it's still there volume one number one of the Journal of Digital Libraries for years that was one of the top 100 cited papers in computer science in the whole field of computer science so that just tells you that you know just because you can't get something published doesn't mean it's not going to be important in the end so remember that so anyway very difficult to publish query languages people somehow this one actually got attention so I was happy about that second I think database people tend to ignore sometimes the need for precise semantics and just go forward with implementation and even in the early days of sequel I think there are some examples of sequel queries that would do different things on different systems because nobody had really thought about it very much and I think people here appreciated the challenges I would say not the name in this case CQL was never a real catchy name okay that's number two yes yeah I don't yeah I'm not I don't remember the Oracle did take the language and do something with it and I thought they were simplifying it but maybe they went off in a different direction so I still think we satisfied the requirement that simple ones do what you expect it's the other ones I agree yep right so there has been no standardization and you still have different systems doing different things right and so what when and it would be nice if there was standardization but nobody's really taking it up for some reason it's partly because it's worked on very disparate communities I think I worked on trigger languages one of the first things I worked on and there it was a similar situation where it was very complicated for complicated cases and the standard ended up just simplifying to the point where you could only write the things that you would do what you expect that hasn't been done yet in data streams maybe it's yet to happen I don't know yes right yes and you that's another way you can do it and if it's imperative then you sort of know what it's going to do sort of if you can read the code right yeah so I don't know if there's an answer I mean I think the one answer is if you simplify it enough then everything will be understandable but then maybe you can't do what you want and the in the trigger languages the triggers are very simple so for simple cases that's great but then if it's complicated you have to write it in code so it's hard to know where the draw the line yeah okay third and last is uncertain lineage databases and again I will set the context so the context here is a project called trio and trio was a project where we were building a system that put that had three things on equal footing we said data uncertainty and lineage those were the three things we wanted in the project and that was motivated by applications we were thinking about scientific applications have uncertain data and they often need lineage I'm going to tell you about lineage entity resolution also so we somehow had a number of applications that needed these things and that was the project people who worked on this one Omar Banjaloon Anish Dasarma who I'm told is an IIT Bombay graduate and Alon Halevi okay so what was the problem well we're building another kind of database management system once again this time for uncertain data and we needed a data model okay so again I think it's sort of obvious that this is a problem you have to have a data model of some type why is it important I'm going to make a sweeping statement that thinking carefully about what your data model is is fundamental to just about anything that you do in databases actually any research involving data and I want to say right off data model now as it's confusing models probably not the right word people use model now to mean something quite different especially with machine learning becoming more popular what I really mean is representation how do you represent your data we talked about the object the directed labeled graph that's a representation XML JSON are representations and that's what I'm thinking of tables are representation okay key value pairs are representation okay so why is it hard sweeping statement there was a strong tension when we were developing this data representation between having something that you could understand and something that was expressive enough and I'm going to show this exactly so I'm going to go into an example now which is a database for solving crimes okay and this crime solving database is going to have two tables one is witnesses seeing a car at the scene of a crime okay and people who drive a particular car so very simple obviously but we've got a crime we're trying to solve that we're going to generate a list of suspects which are people who who drive a car that has been seen at the crime but little twist is that this datum is uncertain that means that when we have some tuple in this table it seems the witness may have seen the car or they have see a car and they're not quite sure what type it is I'll be more concrete about this a person might drive a car but we might not be sure criminals might not always register their cars for example and here I assume people know relational algebra but in case you don't this is just a simple join of the two tables saying we're going to pick out the persons who drive a car that matches at least one tuple in the other table okay that's simple stuff all right little bit of background on uncertain databases everybody who works in this area agrees abstractly that a definition for uncertain database is that it's a database that captures the set of all possible certain databases okay so we're going to we're using the relational model so an uncertain database is a set of possible relational databases and those are called possible instances for the database this is standard in the field so in this case we might have Kathy who saw either a Honda or a Mazda so that's representing two possible instances Amy might have seen an Acura or maybe she didn't a Honda is driven by Billy or Frank we're not sure which one okay the concrete representation that I'm going to use is all and I'll show you it in table format in a moment is two constructs alternative values and question marks which say maybe presence or absence we're not sure in the work we also had confidence values or probabilities but I'm not going to use those today so it's just going to be presence or absence or alternative values so this particular list here Kathy saw either a Honda or a Mazda now we've got an actual table we relisted as Kathy Honda or Kathy Mazda she saw one or the other and Amy might have seen an Acura so we have this tuple Amy Acura with a question mark so this table has four possible instances it's got the two possibilities for the first tuple and then the presence or absence for the second and those are independent so you multiply them and here's one with two possible instances there's a Honda that's driven by either Billy or Frank okay so that's our representation that we picked okay now here comes why is it hard fundamentally it's hard because this representation this model I showed you is not closed and the definition of closed says if I have data in my representation and I run a query over that data I want the answer to also be representable that's what closure means okay that is not true in this model I can run a query on this data and I can't represent the answer to the query in this representation that's bad news so same data except I've added now a couple more tuples here I've added Jimmy who drives a Toyota or a Mazda and Hank who drives a Honda no question marks there so there's a there's always going to be three tuples but it's just which values are picked so this has four possible instances this has four possible instances 16 possibilities together and that's the database by the way anybody notice anything about the people in the database gender related guys are the criminals and the women are the witnesses right which probably somewhat reflects reality but anyway okay all right so we want to generate our suspects who are all going to be guys by the way and we do that again with this query so we're just going to join the two tables a little more complicated now because it's uncertain data but we're going to join the two tables and someone is going to be a suspect if they might drive a car that might have been seen at the crime so these are possible suspects okay so we join the two tables and here's the answer we get we get that Billy or Frank might be suspects right maybe maybe not Jimmy might be a suspect and Hank might be a suspect okay that's the answer when we join the two tables that does not correctly capture the possible instances in the answer and this is where I'm going to ask you to tell me why and to see who's still with me here anybody know why this answer doesn't correctly capture the answer to the query no I could elaborate I just said no it's not because yet the non-determinism I'm just going to tell you that I was at ACM India a couple days ago in Trivandrum with a big audience largely of local students and one really bright undergraduate got this I just sat there I had like 300 people in the audience and I just sat there and waited waited until somebody got it and this one kid just finally went like that and he took he went out on a limb but he got it right so all right now you've been absolutely challenged yes I think you're getting there yes we've got yep tell me more I think you're getting the right idea anybody know why this is not the answer think about the possible instances in this answer yeah pressures on what you mean there'd be yet another tuple no so I give you a hint anybody want a hint so how many possible instances are here are here there's three for this right three possible and two for this and two for that so we're talking about three times two times two possible instances that right you were getting it try try as every possible instance really a possible instance let's suppose Jimmy is in there what does it mean if Jimmy is in there what can we say about the world if Jimmy is in our possible result Jimmy is in there because someone saw a Mazda or a Toyota right so if Jimmy is in there somebody saw a Mazda or a Toyota who Kathy what if she saw Mazda what didn't she see a Honda so Kathy didn't see a Honda Jimmy's in there Jimmy being in there tells us that Kathy did not see a Honda what do we know if Kathy didn't see a Honda we know that Billy and Frank can't be in there and neither can Hank right so and so in fact these are not the correct possible instances because this would allow both Jimmy and Hank to be in there I think some of you are getting it just didn't quite articulate it Jimmy and Hank can't be in there at the same time that would be a that would not be a correct answer because that would require two conflicting things to have happened make sense okay and actually we prove that you cannot capture the answer to this query in this data representation it's actually impossible to do that okay how much time do we have by the way like zero right okay so I think we should probably go on at this point I mean I'm almost done why don't we just go on because I'm very close to done and then we can come back to this we solve this problem sorry why hasn't this been solved already very quick most other researchers in this area were working from a very theoretical point of view so they cared about expressiveness a lot but not about understandability so they had quite complex models but what was our solution our solution actually turned out to be to add lineage lineage says where did data come from I have data in my answer where did that come from and we actually added this is the representation here you can think of it as pointers we added to the result pointers that said this says that the first alternative of tuple 31 came from the first alternative of 11 and this first alternative of 21 I tried making this slide with arrows which and it got too messy but you can think of these as arrows just pointing to the data that it came from and then the interpretation of the result says I only create the instances that are consistent where I don't actually pick inconsistent answers I don't pick the presence or absence at the same time and we proved that that correctly captures the answer and in fact with the lineage ULDB is uncertain lineage database it's closed you can always always represent the answer and in fact it's complete which means you can represent any uncertain database which is a strong property to have okay why is it a favorite and then we're done favorite because we conceived this project before we conceived the data model and as I said it was motivated by applications that's needed uncertainty and lineage and then we never imagined that this concept of lineage would be the key to representing uncertainty never imagined that that would happen so was it implicit somehow in the applications was it an unconscious hunch was it divine intervention or pure luck I don't know but that was very nice how everything came together and that's why it's a favorite definitely definitely not the name our name seemed to have gotten worse and worse over time so anything in common among the favorites well there's expressiveness there's simplicity there's efficiency these things work against each other I think for data guides we had expressiveness and simplicity but they weren't all that efficient for CQL I'd say expressive and efficient but not necessarily simple I think ULDB's actually really do get get all of that and I just wanted to say I think a lot of the work I've done in retrospect has been trying to balance these goals expressiveness simplicity and efficiency and now I'm done so thank you more questions do we want to go back to the question yeah yeah do you want to go back to a question on the yeah so everything could be expressed we could take them the model could be list the possible worlds the problem is that can be exponential right so and my favorite example is let's say you have just 10 tuples each of one each of which could be present or absent and then you want the aggregate of those the sum of those 10 tuples 2 to the 10th and possible answers so it's better not to list them although you actually for that particular example you need to list them but in general listing the possible worlds is the possible instances is going to be not the way you literally want to represent it and that's in fact sort of what the whole thing is about how do you have a higher level more compact representation of these possible instances that's the crux of the problem though I didn't put it that way yeah the question other questions yeah how do we prove completeness well well we needed to we do have negation yes but what completeness means is that we can capture any uncertain database I'm going to tell you that it's not a beautiful capturing but you can what I'm saying is you give me any set of possible instances now it's not described by the query it's described by the possible instances if you give me any set of possible instances I can create a representation of that set of possible instances in my model okay and I can do it by kind of encoding them and using lineage to point to them right so it's a complicated construction actually what do you mean by non-deterministic yeah right so yeah so no we have we can capture all of that and we do it sometimes our lineage we have a notion of negative lineage like this doesn't exist our lineage actually ends up being not just pointers but Boolean formulas yeah so there's a lot of things I didn't tell you there's no way I can cover all of that in a third of a one-hour talk but these are good questions and yeah that one of the things I didn't tell you is that to get the full result you have to have Boolean formulas for the lineage yep yes no it's deterministic actually yeah yeah we it's deterministic it's fast and it's approximate and I'm trying to remember was a long time ago but we might have even been able to dial how approximate it was but I'm not I wouldn't bet my life on it right anything else where is database research going to go well it's going in a variety of directions I think one of the most important right now is the combination of databases and machine learning and I think it's important both technically and politically because this big data is trying to be grabbed by different communities and I think it's really important for those to get together so I think that personally the not that I'm doing it myself but work that is trying to marry those two areas and and get the people in those fields together is most important the other is obviously scalability and so these systems that are highly scalable aren't always that easy to use they're much harder I think than people like to advertise so that's very important as well as examples of what I think is important but I'm not a visionary actually yeah anything else okay thank you