 Δεν υπάρχει τίποτα στις πρόεδρος να βρήξουμε μοδελότητα στον φοβόριο του. Λοιπόν, πιστεύω, η πρώτη πρόεδρος να προσπαθώ είναι, τι είναι η λογική και η δημιουργία των δεύτερων πρόεδρος να βρήξουμε μοδελότητα. Και νομίζω ότι η καλύτερη πρόεδρος που μπορώ να σας δείξω είναι να πω για μου κόλικο πρόεδρο, Βίκτορ Βιάννου, που, σε ένας πρόεδρος του 1997, πήνει ότι, σε ένα πολύ ρερό σέν, η κομμάτιση πρόεδρος δεύτερου δευτερουμένης θεόδο. Και την κομμάτιση δεύτερου δεύτερου δεύτερου δεύτερου θεόδο. Λοιπόν, τι θέλω να κάνω σε αυτά τα δυνατήρα σήμερα και εμπορή, είναι να δώσεις ένα σένση για να πω πως αυτό είναι δουλειά. και δεν θα υπάρχει καλύτερη εξέπτυξη για την πιο εξέπτυξη, όχι αν έχουμε φορία εξέπτυξη. Βέβαια, για τα τρία 40 χρόνια, έχει υπάρχει μια πολύ εξέπτυξη και πολύ προστασία μεταξύ λογική και δεύτερη μορφή. Λογική δημιουργεί ένα εξεκουσία σχέδιο και ένας τελευταίας και δεύτερης μεθόνς για να μιλάψουμε τασκοί που έχουμε να κάνουμε μεταμανάσματα. Βέβαια, για εμένα, η συμμετέχεια μεταλλογική και δεύτερη μορφή έχει δύο πρόσεις. Σε ένα σπίτι είναι ένα πιο εξέπτυξη της λογικής σε εξέπτυξης, αλλά όπως εμπροσπαθούν, θα δούμε πιο εξέπτυξη εξέπτυξη εξέπτυξη, είναι ένα πολύ καλύτερο εξέπτυξη της λογικής από τη συμμετέχεια της παιδιάς. Εδώ είναι το που θα θέλω να κάνω. Η πρώτη εξέπτυξη είναι ένα πιο εξέπτυξη της λογικής σε εξέπτυξης, όπως μόνο σε ένα τρόπο που δεν θα έχετε πρέπει να δει. Θέλω να μιλάψω για τα συμμετέχεια μεταλλογική και δεύτερη μορφή. Αυτό θα έτσι πήγε στην δημιουργία της τετκότης από 40 χρόνια. Μετά θέλω να δούμε ένας παιδιάς, ένα πολύ σημαντικό εξέπτυξη της λογικής παιδιάς, δημιουργία της κομμάτιας, και να δούμε τις προηγές και τις διεθνές με τα μορφή. Θα θα είναι ένα θέμα στην Παγκάρτα της Βαναλόσμας, η εξέπτυξη της εξέπτυξης εξέπτυξης. Και από εκεί, εμείς είχα μιλήψει μεταλλογικής της λογικής, θα δούμε πώς αυτά τα λιμήψεις έχουν δημιουργήσει σε ένα κοντάστατο θεωρία της δεύτερης λογικής, η οποία είναι η δεύτερη λιμήψη της λογικής. Σε χρήμα να δούμε εξέπτυξη, εύκολα να δούμε μία πιο σύστημα από αυτή η συναγή της λογικής και της δεύτερης, δημιουργία της λογικής και της εξέπτυξης της λογικής. Τι έχουν οι τέτοιες στιγμές, το κόμμα που σκέφτυξε και το κομμάτι, είναι το αντιμετωπίσιο με τη δεύτερη λογικής και η συναγή της λογικής. Πώς έρχεται αυτό το κόμμα, εμείς από αυτήν την κόμμα, Εντάξει το κόμμα, ή το τέτοιο κόμμα, όπως γι' αυτό το κόμμα του κόμμας και τους φίλους. Και σε κάποιο στιγμή η ιστορία της δεύτερη λογικής είναι η ιστορία της στιγμής και της δευτερης λογικής. Τι το κόμμα έγινε was to start the scientific revolution 40 χρόν. At the IBM Sam José Lab, which is now called the IBM Almaden Research Center, he did two things. One was to introduce the relational data model and at the same time introduce two languages for asking queries against databases, relational calculus and relational algebra. That was the scientific revolution. Very quickly, within the next decade, we had the technological revolution with the development of system ARAT IBM, Ingress at Berkeley, and very soon the Oracle Corporation getting to the picture with the product and IBM following with DB2. And the rest is history today. Relational database technology is a $17-18 billion industry a year. So let me briefly remind you of what Tedko did. He formalized the relational data model by saying that relations, namely subsets of cartesian products, are a good formal object for representing data. And the idea he had in mind, of course, was that we can think of a table as a way of storing records, and a table formally is really a subset of a cartesian product of sets, that is to say, a relation, in the sense that we've been seeing here today. And then he introduced the notions of relational schema and relational database schema, and this is nothing else but what today it was called the vocabulary in the first talks in the morning. In other words, we have relation symbols. They have specified areas, but the only difference here is that we give names to the various positions of the... in its relation symbol, and we call them attributes. So we can think of a relation schema as being a set of attributes or a symbol with a fixed identity. And therefore this is a template, a blueprint, that represents relations of that particular identity, but also with names of the attributes. And then an instance of a relation schema is simply a relation that conforms to the schema in an actual database management system. You have to have some matching data types, but I will suppress this for now. And then a relation database schema is a collection of such relational schemas. And the database or a database instance is simply a collection of relations that conform with the schemas that we have. So you may ask now, what is the difference between what we've been seeing all morning and databases? I think I can summarize the difference in this slide. A relational structure, as we saw before, is an object that has a universe that we have made explicit in a bunch of relations. A database is basically a relational structure in which the universe has not been made explicit. We only have the relations. That's an important difference which we will very soon see is going to cause us some problems. But Code had the idea that these are dynamic objects. New elements may come into the picture and populate the relation, so the universe may change. So he only made explicit variations, not the universe. So that's the only difference in some sense between relational structures and databases, as we saw them today. So, as I said a few slides ago, Code introduced two languages for asking queries against databases. The first is a procedural language. And procedural here means we tell the sequence of operations. We specify a sequence of operations by which, by the execution of which will give us the answer to the question we ask against the database. The other one was declarative. In other words, we use some high-level language, in this case first-order logic, to specify what we want to retrieve as opposed to how to retrieve it. And Code proved a theorem that, in some sense, relational algebra and relational calculus have the same expressive power. This is not exactly true, and I want to explain in what sense it's not exactly true and in what sense it is true. So that's what I'd like to formalize. It's really textbook material, but I want to make it precise. Let me remind you what Code did by way of relational algebra. Basically, relational algebra is the set of expressions that you obtain by starting with a bunch of relations in your schema and your vocabulary, whatever you want to call it, and closing them under these five operations. The first three operations are perfectly general from discrete mathematics, union, difference that sets theoretic difference. You just insist that these are relations of the same identity, Cartesian product, and then you have two special operations that are special because they are meaningful for relations. The first was projection and the second was selection. So what is projection? Intuitive reprojection is the operation by which you want to suppress or hide some of the columns in your table. So, for instance, if we have a table with banking information accounts, banking accounts, and we want to suppress the information about the account number and the balance, then we basically get... if we only would keep the customer name and the brand's name, that's what we get. Formally speaking, the syntax of the projection is this pi for projection i1im where i1im are distinct integers from one up to k. And the semantics is it gives you back the set of all m-taples such that there is a completion of this m-taple to a k-taple coming from your original relation. So this not only suppresses some columns but also allows the columns. Changing the order. So this is an operation on columns of the table. Code had also the idea that we need and rightly so, we need another operation that filters some rows, throws out some rows. And that's the selection operator. So the selection again is a sequence of operators of operations. One for every condition. The condition is a Boolean test subject every row. If it passes the test, we keep it in the result, otherwise we throw it out. And then the question is what are we allowed in the condition? Well, in a language like SQL the conditions are very elaborate but in the case of code the conditions were very simple. He allowed the quality equal, not equal and if you have a total order in the domain of the values of some attributes then you allow arithmetic comparisons bigger than, less than or equal than and then you take the Boolean closure of these expressions. So you can talk about people who whose balance in the checking account is more than 10,000 or who live in this locality and the balance is less than 9,000 and so on and so forth. By the way, please feel to interrupt me as we go along. But also very quiet audience. So now here is the formal syntax that relational algebra is a string obtained from the basic relations by applying these operations. So that's the first language that code gave. Notice that each of these operations is very simple but the strength of these operations comes when you combine them together. And then code went on in his second paper to give some non-trivial examples of new operations that you could derive from these basic operations and perhaps the most basic and important operation is the natural join. Is everyone familiar here with natural join? Yes? Who is not? Alright, so let's quickly explain what the natural join is. Here is a motivating example. Let's say we have in a university the restless database that has information about faculty who teach a course in a particular term and information about enrollment students, course and term. Because that's what happened. The department announces the teaching schedule the students who run in courses. And then of course we get in the beginning of the term that has the name of the students enrolled in the course. So how is this done? Well, we want to obtain the thought by student, course, term and faculty name and it will turn out that this is the natural join which is given by this boti symbol between these two relations. So formally the definition of the natural join is the following. Suppose you have two relations schemas, R and S and suppose that they have some attributes in common. Remember, these repositions are named, right? So suppose they share some names as we saw before, the two relations who are sharing names. Then the projection, excuse me, the natural join is the projection of a selection of the Cartesian product. So here is how it's carried out. You start with the Cartesian product of two relations and then you keep the tuples in this Cartesian product that have the property that for every common attribute the value in the first relation is the same as the value in the second relation. Now you get a subset of the Cartesian product where you have a lot of duplications. For this duplicate you keep one of the two and that's the natural join. So indeed if you do in the previous exam you get the taught by from teachers and funerals and there is of course a very naive algorithm that basically creates the Cartesian product for every tuple you see whether or not they match. Notice that in the case where the two relations have no attributes in common then the natural join becomes the Cartesian product, right? So in principle it's as expensive to compute as the Cartesian product. Here is a second example also goes back to code and this is a more complicated operation called the quotient or the division. So what is the quotient or the division? You have two relations r and s but now you're going to assume that the rity of r is bigger than the rity of s and then the quotient is a relation whose rity is the difference of the rities r minus s and it consists of all the tuples of length r minus s such that no matter what tuple you get from s when you append the two tuples you end up enough. This sounds very strange but it does something very useful. Let's look at an exam to appreciate it. Victor Viano happens to be a great instructor at UC San Diego and you want to find the students who have taken every course that Victor Viano has taught. How will you compute this if you only had relational algebra? Well, it's very simple. From teachers, faculty name and course we can obtain the courses taught by Victor Viano, right? That's a projection of the selection teachers who are in a selection condition and we use faculty name equal Victor Viano. Now we have a table with only one column that gives the course of Victor Viano. Now we want the students who have taken every course that Victor Viano has taken so if you follow the definition this is nothing else but the quotient of enrolls divided by the courses taken by Victor Viano. Now when you look at this at the definition of the quotient it's not obvious right away that this is expressible in relational algebra. Yet it is and that's a non-trivial exercise for undergraduate students. Let me illustrate how this is done by doing it concretely for a relation of I-85 and a relation of I-2 and we have I-3 and the idea here is that we got to use the difference operator. We remember before in the example of the natural join we saw we've used projection we use selection we use Cartesian product the union you can imagine situations that we use the union but here is a way to use the difference and it goes like this the quotient basically is a subset of the projection of 1, 2, 3 of R because it consists of all the triples such that no matter what pair you append from S you end up in R So what we have to do is intuitively speaking take the projection 1, 2, 3 of R and throw away the tuples in the projection that don't make it In other words the projection is our candidates and we want to throw all the ones that don't make it all the failed candidates so to speak Now let's consider this relational algebra expression this is the Cartesian product of 1, 2, 3 with S take away R These are really the things that don't make it to R-S So therefore to get the quotient what we need to do is take the difference again So this we have a nested use of the difference and this way we get the quotient as an expression relational algebra So as I said this goes back to code He illustrated that you can do interesting things with these operations Now We have a sort of basic language design question Code came up with these 5 operations and I showed you how you can express interesting other operations Do you need all of these operations or not? In other ways was there any redundancy in his language Code was a very precise man I never had the honor to meet him but that's what I hear from the people that knew him and he was very careful you actually can prove that 5 operators can be expressed in terms of the other 4 So this is the theorem Each of the 5 relational algebra operations is independent of the other 4 You cannot find an algebra expression that involves the 4 of them and gives you the 5th one And how do you prove something like this? Well, this is like a lower bound in expressive power The idea here is to find a property for each operation that the operation has but no expression built from the other 4 has it So let me ask How would you do the Cartesian product? It's very easy What does the Cartesian product do to the arities? It increases the arities The other 4 operations whether it's projection or union or difference or selection Well, projection lowers it Difference keeps it the same Selection keeps it the same So you find one property that the 4 have and the 5th doesn't have and I told you what to do with the Cartesian product it increases the arities the projection decreases the arities What would you do with the difference? Well, it goes back to some of the discussions we had in the morning here The other 4 operations are monotone If you put more in the arguments you don't lose any tuples The difference in general has the property that if you put more in the second argument of the difference our minus says you may decrease the outcome So this monotonicity property is what tells them apart it's trickier to do it for the union It's an interesting exercise So the bottom line is that Cod chose his 5 operations There was no redundancy There is no fact It's a very lean language Ok, that's all I want to say about the algebra This here direct effect What's that? Yes, yes But it's not as easy to see That's a non-trivial exercise I mean, yes, yes No, no, it's just combinatorial It's fine the right combinatorial property I'm sure there's a proof by logic but that's a straight combinatorial argument It is true because the theorem is true It's part of the 5 basic operations Right? You can see here the direct influence on the design of SQL and the semantics of SQL SQL the main construct is selected from where and unfortunately select corresponds to projection where corresponds to selection and from the Cartesian product Right? So you would to express the projection of the selection of the Cartesian product in SQL you would write a select this R1, R1, RAM from these relations from this Cartesian product where this condition is satisfied So direct influence of relational algebra on the design of SQL Now in addition Cod introduced a calculus and relational calculus is a declarative language which is entirely based on first-order logic There are two versions of calculus Actually Cod introduced a tuple calculus in which first-order citizens variables were ranging over tuples of a fixed arity for arity k So this is like making arrays first-class citizens as opposed to making elements of arrays Here in logic we use more the domain calculus where the elements of the tuples are the variables We will focus on domain calculus there is an easy translation between the two So I want to discuss a little bit now Cod's technical result which was that calculus and algebra have the same expressive power and assess the calculus that's the term that he used but it really stands for first-order logic This is the syntax as we saw it early in the morning when Suprati gave the presentation this is just the standard syntax of predicate first-order logic and I'm not going to repeat it again of course let's keep the semantics So Cod's idea was that now you can write expressions formulas of first-order logic with three variables one database and get back a set of k tuples which consist of all the k tuples in your database that satisfy the formula So that's goes back to the semantics of first-order logic that we saw in the morning So as an example if you have let's say an edge relation in a graph let's say you have connections distance from two non-stop flights this gives you the pair of nodes that are connected by a path of length 2 or the destinations that you can pair say b of destinations that you can reach with one stop of it in the case of an edge line and as an illustration of this we saw how hard we had to work to get the quotient this was a nested application of the difference together with projections, selection, cartesian products we did all this work really because the quotient is easily expressible using the universal quantifier this is the direct translation of the definition of the quotient to first-order logic in this case the set of all pairs so that for every triple if x3, x4, x5 belong to s then the quintuple belongs to r so we get immediately the translations much simpler than the relation algebra expression this is making a case why it's superior to have a nice declarative language as opposed to having a procedural language so code theorem informally says that algebra and calculus have the same expressive power meaning whatever you can whatever query you can express one you can express the other as I said this is not entirely accurate and I want to explain in the sense in which it's not accurate and then formulate very gorgeous correct version of this result and sketch the proof going from algebra to calculus is very straightforward this is a translation interpretation we're adding an interpreter of going from algebra to calculus so for every relation algebra expression there is an equivalent relational calculus expression there is only one way to do this and this is induction algebra the first three parts are very straightforward of course the union corresponds to the disjunction you can hardly see but there's a disjunction here the difference is phi 1 and not phi 2 the cartesian product becomes the conjunction with different with different sets of variables what happens to the projection the projection is an existential quantification right that's what you would have expected and the selection well is basically you take your condition and you translate appropriately in first order logic and then you take the conjunction of the condition theta together with phi instead of theta we are using a translation of theta into the formula of calculus so this is a very straightforward translation it just brings out the flavor of projection as an existential quantification that's really all there is to it and the selection as a filter where you add another condition on the formula what about the convergence well it is simply not true the way we have set up our definition so far it is simply not true that every relational calculus expression has an equivalent relational algebra expression let's see why let's take this very simple negation of an atomic formula we have a problem here and the problem is that we haven't made explicit our universe and therefore here we would like to take the complement or the difference with the universe but we don't have the universe around we have not made it explicit so this is really the problem beginning I said there is a little difference between databases and relational structures and the difference is not making the universe explicit now we pay a price for it the price is we lose this translation but that's not the only this is a blatant way there are other ways we have a department in a university departments have chairs and we keep track of the administration keep tracks of department and the name of the chairs like this X Y there is a Z such that X is the Z is the chair of department X and Y is different than Z but what is Y here we have a problem and there are other things like that we look we look for the students who enroll in every course in every term we can also see that there is no way but this requires proof but it's not hard to show that this is there is no equivalent relational algebra expression for this one either so it is not just any case there are other reasons that make this translation fail so let's take a closer look at this so as I said as I hinted the previous relational calculus expression are not translatable to algebra because they give different answers depending on the domain that we will choose to interpret our variables again the price we pay for not making the universe explicit so let's look at the simplest of these three examples if you see our variables range over domain D then of course we can say the semantics of this expression is d to the k minus r but as we change d we're going to get different values intuitively this means that the relational calculus expression not r is not domain independent it depends on the domain over which we interpret the variables so now we want to formalize this notion of domain independence On the other hand something like this the difference that code used in relational calculus is domain independent because you see when you try to give a sign meaning to this expression already you know that s, this tuple must be a tuple in s so even if you consider a bigger domain doesn't make a difference you are still ending up taking the difference between s and t so we want to capture this difference with the precise definition this distinction with the precise definition and this brings the notion, a very important notion in databases called the active domain the active domain comes in two parts the active domain of a formula and the active domain of a database the active domain of a formula is simply the set of all constants that may appear in the formula so this is a very simple thing you look at the formula if it mentions some constants the active domain of a relational database is really the important thing is the set of all values that occur in the relations in the database so you look at the database think of it as a set of tables you look at the individual values that's the active domain now suppose we have a formula of calculus and now we want to give, remember we haven't made, we have a database but we have a relational structure so we have no universe around we want to give rigorous semantics to this in some sense the problem is we have not made the semantics very precise so if I have a domain a universe which is big enough to make the evaluation meaningful and this means it contains the active domain of the formula and the active domain of the database then phi sub d of i is the result of evaluating the formula over this domain and i that is all the variables and quantifiers are assumed to range over d it's not meaningful to go below the active domain because then you have to take into account the values that are in the database and of course the relation symbols are interpreted by the relation symbols in i if we take d to be as small as it's meaningfully possible namely if d happens to be the union of the active domain of phi with the active domain of the database then we write the result of evaluating phi on d and i as phi active domain of phi and now we can say that a relational calculus formula is domain independent if no matter what domain you evaluate the formula as long as it's meaningful enough big enough to be meaningful what you get is the same as evaluating the formula on the active domain in other words all it matters is the evaluation of the active domain the formula is very stable let's look at some examples this is not domain independent we saw that before we changed d we get different answers we get d to the k minus r this is not domain independent something like this is domain independent there exists y, r, x, y that's easy to see domain independent on the other hand going back to for all y, r, x, y it's not domain independent think about it because you may have a very simple database in which all you have is r11 right all you have r11 and then in this case the active domain this is my i my i consists just of this the active domain is simply one so if I evaluate this expression if I look at all the x's such that for all y, r, x, y then of course I get only one but now suppose I change my domain suppose I take my domain as 1, 2 then if I evaluate this is this is let's say this is phi so I'm evaluating here phi on the active domain now if I evaluate phi on this domain what do I get I get the empty set right I get the empty set so I've changed the domain and I get a different value because it would insist that also for y equal to I must have r12 but I haven't changed my database so this is not domain independent now with this notion we can state precisely a code's theorem so code's theorem says that if you have a query the following are equivalent one there is a relational algebra expression that gives you the value of the query on every database two you can find a domain independent relational calculus formula a nice domain independent relational calculus formula such that q of i is phi on the active domain of i remember the active domain is the set of all values that are carrying your database third there is a relational calculus formula this you don't know it may or may not be domain independent but you evaluate the query on the active domain only the difference between 2 and 3 is that here you can put a bigger domain and still get the same value because the formula is domain independent in the third you have an arbitrary formula you don't know if it is domain independent or not but you play it safe you play it safe by restricting your evaluation on the active domain so let me sketch the proof of this theorem we have to go 1 implies 2 2 implies 3 3 implies 1 it is obvious that 2 implies 3 you use the same formula so we are only going to worry about 1 implying 2 and 3 implying 1 1 implies 2 we already proved in some sense we have to go back to the previous translation of algebra to calculus and argue in every step that what you get is something which is domain independent so you have to do it by induction 2 implies 3 I argue this trivial so 3 implies 1 here is a very important a very simple step a very important step the key to this is to go back and realize that the active domain of i is expressible in relational algebra that's the first bullet so that for every relational database schema there is a relational algebra expression such that for every database the active domain is the result of evaluating the database the result of evaluating the expression of the database what was the active domain of the database the set of all values ok so let's say that we had a relation R with 3 attributes R, A, B, C what would be the active domain what's the expression for the active domain well is never we have the projection right so we can take it is by A R union by B R union by B C by C R this gives us all values that are in the database very simple but very important and now we use the above facts and induction of the construction to obtain a translation of calculus under the active domain interpretation and that's now straightforward but the interesting part really is universal quantification because remember algebra doesn't have explicit universal quantification it has existential quantification and difference so of course you'd use the logical equivalence that for all why C is not exist why not C so as an illustration let's look again this formula independent so what you do in this case well this is not exist why not R the active domain is pi 1 R pi 1 R union pi 2 R I'm assuming that R is binary here so here is the induction under the active domain semantics not R is the difference between the Cartesian product of the active domain with itself take away R therefore exist why not R is the projection of this expression on the first coordinate therefore not exist why not R is well the difference between the active domain take away the previous expression so it's very straightforward and the key to this is that we have restricted interpretation to the active domain and the fact that the active domain is expressible in relational algebra so the bottom line of this is again let me look at the result let me state the result again that we have this precise statement that gives us the sense in which algebra and calculus have the same expressive power they don't what is true under the active domain semantics they have the same expressive power or your formula is domain independent when you don't care about the semantics is that clear any questions about this I'm going fast because this is really basic basic material but comes out very clean at the end however there are some interesting questions first of all an observation the equivalence is effective so we can go from algebra to calculus and from calculus to algebra and therefore later on whatever results we prove about if we prove an indecentability result for the calculus it translates to algebra and vice versa so let's go back and think for a moment about the active domain it's very nice that we have the active domain excuse me domain independent it's a very nice property to have but on the face of it it is a semantic property because we say that for every domain that contains the smallest possible domain possible we have the same semantics so we have some bad news if we want to ask the question can we automate the process of discovering if a calculus expression is domain independent because suppose we want to give a programmer the power to write queries in first order logic right but then we don't want to have to worry whether or not the formula is domain independent or not and worry about the domain so there is this old result that preceded that code roberto di paola was a logician you see a layer at the time who proved this theorem he has a three two page proof using tractable theorem that we saw in the morning that determining domain independence is an undecidable problem there is no algorithm if a given relational calculus expression is domain independent not surprising after what we have seen in the morning writing all non-trivial semantic properties concerning first order logic in some sense they are not to be undecidable however there is a next best thing you can do in a situation like this and the next best thing is something that also a news would like to have said right and effective syntax here means that this is the good news you can give a nice syntactic description some context free grammar for a subclass of first order logic that has the following property every formula in this class is domain independent and vice versa every domain independent first order formula is logically equivalent to one in your class that's the next best thing in the face of the bad news of Roberto Di Paola you can give an effective syntax that in some sense exhausts all of domain independence of course what you cannot discover is whether a given formula in the formula in your class are logically equivalent all you know there is one there was a lot of work done on this competition to make in the class bigger and bigger and bigger in the 80s there was a lot of work in that direction because the idea was you want to give the programmer the biggest possible class of formulas right and something like the top of the line here is this paper by Rodney Topor and Aaron van Gelder the original paper was in Podge 87 this is the 1992 general version where they describe such a detailed syntax anyway so that's an aside now this was the first part as I said to talk about what TED Co did now I want to look at three basic problems about database query languages we have seen what a query is so a query is basically a function that takes as input a database and gives you back a query relation and it's invariant under isomorphisms and of course all the queries defined in logical languages are queries in this sense and for us a Boolean query is going to be a function defined on some database instance that takes value 0,1 and is also invariant under isomorphisms so Boolean query given a graph is a diameter at most 3 given a graph is it connected so we want to look at three basic problems about queries the query evaluation problem the query equivalence problem and the query containment problem the query evaluation is the most basic problem in databases you give a query in some language and the database and you want to find the value of the query we saw this problem in the morning as called the model checking problem that's the model checking problem for whatever language you have in mind the query equivalence problem in the case it's simply a version of logical equivalence you are given two queries and you want to know if on every database they give you the same answer and of course this is very important in an actual database management system because the user writes a query and then the optimizer takes the query and transforms it to a query that is presumably easier to evaluate in the process you want to be making sure that you work with a sequence of queries that are logically equivalent the query containment problem is the question given to queries this is the case that on every database the relation you get by evaluating the query on the database the first query is contained in the relation in the second Boolean queries this is logical implication so we want to understand what is the algorithmic status of these problems for relational algebra and relational calculus and already I argued that the query evaluation problem is the main problem in query processing equivalence and containment are closely related in the sense that two queries are equivalent in the other and also if our language is closed under conjunction we have that containment is reducible to equivalence so we have seen already the proof in the morning and I'm grateful for the people that gave the nice introductory lectures in the morning the query equivalence problem for relational calculus is undecidable and that's a very easy translation from finite validity in the morning we saw it as finite satisfiability but of course the fact that there is no algorithm to tell if a first order sentence is satisfiable means also there is no algorithm to tell if a first order sentence is true on all finite structures that's finite validity you can very easily reduce finite validity to query equivalence by taking something like a sentence that is a and then asking whether or not your formula is logical equivalent to this finite validity sentence so finite validity is reducible to query equivalence therefore we have undecidability we get for free out of this that the query containment is also undecidable because the query containment so the query containment is also undecidable and of course notice that here we have a chain of reductions the whole thing problem goes to finite validity finite validity to query equivalence query equivalence to query containment so bad news right so we have for relational calculus and algebra we have one query equivalence and two query containment undecidable now you can ask what about query evaluation for calculus and algebra and the two problems are the same for algebra and calculus because of the polynomial time translation between the two and we also saw in the morning that both problems are p-space complete because calculus is first order logic and I had the different sketch of proof here we saw in the morning in Ram's presentation a very nice proof of the membership in p-space by proving it is in alternating polynomial time right so let me skip the hardness is the quantified Boolean formulas but you can see also directly that it is in polynomial space because if you have such an expression you bring it in calculus expression you bring it in prinex normal form and then you get these quantifiers let's say you have m quantifiers and you create m blocks in memory and what do you do in every block in memory you keep the presentation of the elements of the active domain in binary ok so you need logarithmic many a constant logarithmic number of blocks to maintain this and then you cycle through all possible values and you also keep a counter in binary to make sure that you have exhausted all the tuples and you don't keep cycling so this gives you a different way to show that the query evaluation for calculus is in polynomial space but it also the same this argument here what happens when you fix the formula when you fix the formula the number of quantifiers becomes constant so you have now m blocks of memory and now again you keep the values in binary and therefore the whole thing becomes in log space right so this is a direct way to see that for fixed formulas the query evaluation problem is in logarithmic space and therefore it's in polynomial time and in some sense this explains in a way the paradox that we have this high complexity yet database systems give answers to queries we are not afraid of a p-space complete argument but the user is typically we have the queries fixed and the database changed so at least we are in log space in that sense but in turn this consideration made Vardi write this influential paper the complexity of relational query languages where he introduced these three notions that they would be very important to us as we go forward the notion of combined complexity and expression complexity so suppose you have any query language let's call it L the combined complexity is the model checking problem where both the formula and the database are part of it the data complexity is not one problem but it's a family of problems one for every sentence in the language and it's the question given a database instance does it satisfy the formula the complexity is where you play the game the other way around you fix the database suppose you have a fixed database in which you ask different questions so now the formula is part of the input only and of course you have one such sentence one such problem for every for every database so data complexity is parametrized by the formulas family of problems expression complexity is parametrized by the database query complexity and you can say what it means for data complex to be in a language meaning every sentence is the property but the associated decision problem is in the language isn't a complexity class and query complexity means isn't some complexity class instance the associated decision problem is in the class Vardy made an empirical discovery it's an empirical discovery it's not something you can prove because you can go over all possible logics that for most query languages the data complexity is of lower complexity than both the combined complexity and the query complexity and often exponentially smaller that's an empirical evidence you have to go query language by query language and the query complexity can be as hard as the combined complexity relational calculus is a case in point here and that's the picture that we have seen today for the combined complexity we saw this p-space complete actually saw this in the morning data complexity it drops in log space so we see the exponential gap between p-space and log space and we know it cannot be worse than combined complexity so it is in p-space but actually it can be p-space complete and in fact we saw this in the morning in Ram's presentation also because he used a very simple database to encode that he has 01 and the unary relation with an element to encode quantified billion formulas so this is the situation with calculus and algebra so in some sense this looks like very bad news for databases because these two problems are undecidable and a query evaluation in at least combined complexity is p-space complete therefore this motivates the following question that we will explain and I will go at a slower pace tomorrow but there are interesting sub-languages of calculus for which these two problems are at least decided and how low can we go and by the same token are there problems are there languages for which the query evaluation is has lower complexity at least combined complexity than p-space and how low can we go and as we will turn out and that will be the topic of our discussion tomorrow and we will try also with some of the things that I think Ben Rosman will be talking about there is this language of conjunctive queries which are simply existential positive sentences built from atomic formulas to existential quantification not disjunctions and they have this lower complexity but we will try to explore them from the database point of view is that they encapsulate the most frequently asked questions in databases so what we are going to do tomorrow is explore these three problems equivalence, containment and evaluation for conjunctive queries and then we will get some good news and some bad news also and then we will try to go a little bit beyond them so in some sense here while before we went outside of first order logic now we are going inside first order logic and try to see what parts of these problems have more tame behavior than the full algebraic calculus so I will stop here