 Okay, so welcome. Hello everyone. It's 6.30. Time to start. So let me start with a few announcements. Several people asked me to move the take home final during the weekend, so you don't have to take a day off from work in order to take the final. So we did this. The new final is now scheduled to be as follows, and I realize I'm using the wrong notation here. Okay, it should be, it's going to be posted on Friday, today's right here. Must be that. On Friday, December 10th at midnight, so you can work on it during the weekend, and it's going to be due on December 12th. I wrote the date all wrong, but I suppose you figure out what this means. So set aside the weekend to work on the take-home final, and you will have to turn it in at the end of the weekend. Only two days, this is a final, it's not a project, so two days is actually a lot. You don't need two days to do the final. Any questions or any concerns about the final? Yes. Do we know when the results are going to be in? I asked because for Microsoft, for having the reimbursement, there is a deadline. I'm sorry, but I can't hear you. I can't, can you speak louder? Yes. Do we know when the results are going to be recorded? Because for Microsoft reimbursement, there is a deadline for December, so I don't remember the exact date, but it's around the 21st. I was planning to post the grades on the following Monday, which means Monday the 21st, the 20th, sorry. Is that okay with your deadline? It might be very hard. Okay, please send me an email. The initial schedule of the final was better for me, but I really wanted to move so I can accommodate people who work, but as a consequence, I won't be able to grade it the following day as I was planning because of travel. So I'm going to have it graded on the following Mondays at 20th, so one week after your final. So if this conflicts with any deadlines in Microsoft, then please let me know. Okay, homework for is due next week. This is a homework on transactions. Also, you have a mixture of questions. None of them is really difficult. Some questions are copied from the other textbook, the textbook by Garcia Molina, Orman and Widem, and we have included the full text of the questions on the homework. Some questions are from our textbook. We just have the numbers on the page, so you need to look them up. Our textbook is not very, how should I put it? The questions are not always very well formulated, and somebody pointed out to me an error. They require you to identify which of the following schedules are serializable, but serializable schedules only make sense if you know what the transactions are doing, while those schedules, they don't tell you what the transactions are doing. So we will drop the question about which schedules are serializable. You have to answer only the other questions, which are conflict serializable, view serializable, and so on. So I will send out an announcement the following days after I figure out exactly which questions. The next homework, homework five is posted. It's going to be due in two weeks. This requires some programming. It's about database tuning. It's really a toy database tuning problem. Like most of the little exercises that you're doing, they're little toy exercises, so you get exposed a little bit to the concepts without actually having to do too much work. I'm going to lecture a little bit about tuning concepts in the next week, the next lecture. So this goes a little bit out of order, but that's okay. The amount of tuning that you need for this homework is tiny. I'm also going to ask you to read on your own on database security. Read from the textbook, the chapter on SQL access control. It's a very well-written textbook. I have a few slides that I sometimes show during the lectures, but they're essentially based on our textbook. So you will be much better off if you read from the textbook. We have also homework six ready. Param is, I don't know what he's waiting for. He worked a lot on this homework. It's on XML. It's ready. It's a lot of fun. He's going to post it in a day or two. So watch an announcement for homework six. It's posted already? Good. And you will also find their instructions on how to use Zorba. Zorba is going to be our x-query processor. You can also use suction, but I think Zorba is more fun to use. Good. So today we need to first finish the discussion from last lecture. If I could switch the slide deck, I see it takes a while until it switches. It takes more than a while. Oh, I see. I see. I see what happens. I have to click on it. Okay. So last time we discussed the Arias Recovery Management System. And so Arias, I read a little bit more since then just to give you more context. It was developed at IBM Almaden over many years. And Mohan kind of put it together and wrote a few papers. And he gets most of the credit for Arias. But I should also point out it was a work done by several people. Another thing I said last time, which is incorrect, which is funny. I said that IBM holds a patent. Did I say this? It's wrong. Apparently, the lawyers at IBM got it wrong and they got a European patent, but they failed to get a U.S. patent on the Arias Recovery Management System. Just a spicy story for you if you like these stories. And since they failed to get a patent in U.S., you also drop the patent in Europe. So Arias is free for grabs. This is why everybody can implement it. It's a very tight recovery management system. Very clever, but everything needs to fit together. So you have to read it carefully from the book. I also got stuck last time trying to explain the undo part of the recovery, the last step. Do you remember? Okay, so the question was this. Suppose we start undoing from this crash. And the first thing we want to undo is this right of done by transaction. Oh, they all belong to the same transaction. So this is done by one of the transactions. This right to page one. So what do we do? What happens if the system crashes? Actually, no. Let's follow the picture here. So we undo. We write a CLR. The CLR will point to the next record from the same transaction to the next log entry from the same transaction. Now we undo that one. We write this corresponding CLR, point back. And now let's suppose we were in the process of undoing the next update. And as we were undoing the next update, the system crashed. Can you tell me more? Did you read in the book? Can you tell me the details of what happens? How are we going to handle these partial updates being done on behalf of this undo during the restart? We do which one? You start first to redo. And during the redo, you have to undo the CLR records. Exactly. That is a key detail that I missed last time. The CLR records, they also need to be redone. They actually serve two roles. They are actually like physical updates. In one CLR record, you write all the physical elements that are being updated, such that you can redo them. And they also serve as pointers back so you can skip next time the logical undo record. So here is concretely what happens. Supposes this right page did like it was inserting a record on a page. So then now you need to undo this. And remember the undo is a logical undo. So instead of inserting the record, you need to delete the record from that page. Now deleting the record means moving things around maybe affect the second page. The physical changes that you are doing on behalf of the undo, this is what you write in the CLR. The physical updates that you do during normal execution of the transaction are written in a regular log entry. But the physical updates done on behalf of the redo are written in the CLR itself. And there's one atomic write. And the rule is we always write in the log before writing the updates to disk. So this undo is written before the actual undo is done. The CLR is written before the undo is done. So if the system crashes that we did not get a chance to actually do these undo actions, then during recovery we will redo the undo by reading the CLR. Please read the book. The book is pretty clear. And if you want more pointers, I have more pointers to papers that describe the Aries recovery system. But I didn't want to make them generally available so you don't get swamped with too much information. So a few remarks about Aries. So it pieces together several techniques. And there are several extensions of Aries for object-oriented for all sorts of extensions of database systems. And Aries is used today in most database systems like the reference algorithms for recovery. And again, there is no patent. It's not protected by any patent. Any questions about Aries? Good. So what we are going to do today, we are going to finish our discussion of concurrency control mechanisms. We only discussed locks. Remember two-phase locking. And today I'm going to discuss a class of recovery mechanisms that are called optimistic recovery, sorry, a class of concurrency control mechanism that are called optimistic concurrency control mechanism. And they have the potential of being much faster than locks. But if the system is swamped, then the optimistic ones, they don't do that well. And I hope to finish this pretty quickly so we can move on to the next topic, which is XML. Okay, so let's start by discussing, I'm going to discuss three such concurrency control mechanism, a basic one based on validation on timestamp, sorry, and variation on multi-version on multi-version pages. A second one based on validation. And then the particular combination that is using Oracle, which is called snapshot isolation. And I will point out it's a little tiny little dirty secret, snapshot isolation is a tiny little dirty secret that I will show you. So the first term optimistic concurrency control mechanism, it's based on, we are going to call it as timestamp based concurrency control mechanism. And it's very simple. So it's amazing, no locks. How can you guarantee serializability or some variant of serializability without using locks? It's amazing. How can they do this? Well, they do this using this new concept, which is a concept of a timestamp. Every transaction, when it starts, receives a unique timestamp. I think of this as a counter, it could be a counter. So transactions, they start, they receive a unique counter. Or it can be a physical time, the time when the transaction starts. All we care about is that this timestamp be unique for the transaction. Now here is the key invariant, what that makes these optimistic concurrency control mechanism, all are based on a timestamp that makes them work. The invariant is that the serial schedule that we are going to ensure that our schedule is equivalent to will be the schedule defined by the timestamp. So we're going to insist that but in whatever operations arrive, we are going to allow them to proceed only if we can guarantee that the resulting schedule is equivalent to the serial schedule defined by the timestamp. It's a very simple idea, very simple concept. So the first mechanism that we are going to discuss uses, they say this, okay so I didn't describe yet the fields that we need to store. Let me show you the main idea. The main idea is that whenever a transaction requests to write or to read an element, we are going to check for conflicts. Exactly the conflicts that we used in the complex serializability definition. For example, if a transaction T request wants to read an element X, then we are going to check who was the last transaction that wrote it. And then in order for us to allow this read to proceed, what should be the connection, what should be the relationship between the timestamp of U and the timestamp of T? How should they be? Less than, the timestamp of U should be less than the timestamp of T. Can it be equal? Why not? The timestamps are defined to be equal, but if they happen to be equal, what does this mean? That there's a problem in miscode. No, no, it means it's the same transaction. It means that the transaction who wrote it last time now wants to read it. So we can allow here for equal. But what happens if we discover the TRD opposite direction? What do we do? Yeah, we're saying we just abort, we roll back, we abort transaction T. We do not allow it to read. So that's typical with optimistic on currency control mechanism. We assume the best. But if things go wrong, then they need to abort the transaction. This is why they can become very costly if there are conflicts. And similarly, with the other operations, if there is a write request, we need to check if this write request is not too late. And it can be too late either because of a prior read or if a prior write. That's the main idea. So the question is, how do we know this timestamp? How do we know the timestamp of the transaction you who last wrote or who last read each certain page x? And the solution is very simple. For every page x, we will just remember two things. We will remember the timestamp of the latest transaction who read it and the timestamp of the latest transaction who wrote it. That's all we need to remember. And we need to remember a third thing which is a commit bit, but I'm going to talk about this later. So again, the data structures that we need are for every transaction, there is a timestamp. And for every element x, there are two pieces of information, the read timestamp of x and the write timestamp of x. Okay, so I'm going to skip the next slide. I'm going to show you how this works in action. So let's look again at our scenario. Transaction team says, I'd like to read x. What do you check? You check for the scenario here. Concretely, what this means is you would like to see what is the timestamp of the last transaction who wrote x. Here it is. That's our timestamp. Right? It's exactly the wt value of x. We don't know which transaction wrote it and we don't care. We only care about its timestamp in that right here. Okay, and if that timestamp is bigger, then we need to roll back t. It's a very sad, I mean, it's optimistic, but if it goes wrong, then it's very dramatic. It has to do abort transactions. Now, if you think back about the log-based concurrency control mechanism, does that ever have to abort transactions? The concurrency control mechanism is based on locking. Does it ever have to abort transactions? Yes. When does it have to abort transactions? During deadlocks. It happens less frequently, but it may happen too. So log-based concurrency control mechanism are not free of this problem. They may also have to abort transactions. So can you take it like I'm a general lesson of this, then you should be very careful about using optimistic locking if you're in a heavy read-write application? Yes, so that's here. The rule of thumb is that if the number of conflicts is high, then it's better to use a log-based concurrency control mechanism. If the number of conflicts is low, then it's much better to use a timestamp-based or an optimistic concurrency control mechanism. And usually database systems, they can actually combine them. Read-only transactions, they are run using an optimistic concurrency control mechanism, and the other is using locking, and then the interaction is nightmare. Yes. Are the terms abort and rollback the same? Yes. Abort and rollback are the same. Abort is like the normal term, rollback is what SQL uses. Good. Second case, now the transaction wants to write. So what do we do here? Same thing. We check if somebody read it, and if a younger transaction read it, we can't allow a transaction to write it because they should have written it earlier. This write is too late. And how do we check this? Exactly the same idea. If the read timestamp is strictly bigger than the timestamp of the requesting transaction, then we need to abort the requesting transaction. This is the read-write conflict. We have another one, and this is a really interesting one. There is the write-write conflict. Now look at this. So transaction T would like to write, but there was a younger transaction who already wrote it. And we detect this right here. What am I saying? Ts is the timestamp of the current transaction. And what we detect here is that a younger transaction has already written it. How did the younger transaction was able to write it when this would have read it? Transaction T. So let's suppose transaction T read it right here. So this is read. What am I writing? Read T of X. And that's okay. When V was writing transaction X, it was allowed to write it because it's a younger transaction. Remember that the serialization order is T followed by V because that's the order of the timestamps. But let's focus on the write. So T would like to write. What do we do here? T can't write this value because it was written by a younger transaction. In the serial order that we are trying to enforce, T comes completely before V. So what do we do with this write request? Do we abort the transaction? No, it's right here. We don't abort it. Remember the view equivalence? We can simply ignore this write. We can tell you the transaction you're fine. The write has succeeded. But we actually don't write anything. You look unhappy. This one works as V did not read X, right? Oh, absolutely. So we assume that until now there were no conflicts because at every step we need to check for these conflicts. No, I'm trying to figure out what is this condition here. Oh, this is for the... That hasn't been true. You would have had to abort T based on the previous slide. Yes. The green stuff that I'm doing now in green refers to the previous slide. And then we would have to abort it. But otherwise... So actually we should look at this. This is taken care of by the previous slide. But otherwise this write, instead of aborting the transaction T, we can simply ignore it. And pretend... I tell T that everything went fine. But in reality we don't perform the write. So as a consequence, what we obtain is not a conflict serializable schedule, but it's a view serializable schedule, which is view equivalent to a serial schedule. Precisely the serial schedule given by the order of the timestamps. It's a very, very clever concurrency control mechanism. So my slides are not in the best order. So I'm going to go back a few slides because I want to summarize what we discussed so far. So far, we assume that there are no abort, which is kind of kind of stupid because we abort. So there will be abort. But let's assume that there are no abort somehow. And then the entire concurrency control mechanism consists of these two simple rules. Namely whenever a transaction wants to read an element X, if whoever wrote it was younger, then you have to abort the transaction. Otherwise you can allow the transaction to proceed and you just need to update the read timestamp of the page X. Any questions about this rule? And similarly for the write rule, if a transaction wants to write an element X, then well, if somebody read it, then it's too late. We have to roll back. Otherwise, if somebody overwrote it and it was younger, then we do nothing. This is Thomas' rule. And otherwise we do the write and we update the timestamp of the page, the write timestamp of the page. Okay. But now we have a problem, namely abort. Abort and change the whole picture. Remember we classified transactions according to their behavior and their presence of abort into two kinds of transactions. One are called recoverable, sorry, not transactions, schedules, recoverable schedules. And the other was called what? Schedules that avoid cascading abort. And the reason I'm reviewing them now is because I realize that the slides, they actually refer to this condition, not to the previous one. What are the two definitions? When is a schedule called recoverable? When is a schedule called recoverable? It's called recoverable if at every commit, whenever a transaction commits, all the transactions who have written elements read by the current transaction have already committed. And the reason is, and similarly for abort, and the reason is because if the transaction were not too committed, if it were to abort, then, sorry, if one of the other transactions were to abort, then you would have to abort the transaction as well. Avoiding cascading abort is simpler. It's not a test that you have to do at commit, but it's a test that you have to do at every read. It is said that every value a transaction reads must have been written by a transaction that has already committed. So that transaction should not wait to commit until you commit, but it should commit even before you read its value. And that's the definition that I'm going to use on the following slides, the avoid cascading abort. So in order to do this, what we need to do is to keep this extra bit, which remembers if the transaction who last wrote the element x has committed or not. It's a so-called commit bit. Okay, so here is a change that we need to do. Remember that's the first test that we did for a read. During a read, we had to check this. If the right timestamp is less than or equal actually to the timestamp of t, it can be equal because it might be the same transaction. But now if we start thinking about potential abort, this is not sufficient because the earlier transaction that wrote x maybe has not committed yet, and that's represented in this partial schedule on the slide. Let's look at it carefully. The other transaction is called u. So it's an older transaction. It has written x, but now when our transaction requests reading x, we can't allow it to read. If we allow it to read, then the schedule is no longer an avoiding cascading abort schedule. So what we do, we check the bit, and if this is false, it means that previous transaction has not committed, then we put this transaction on the waiting list. We put it in a queue, and we wait for the previous transaction to decide what it's going to do. Is it going to commit or to abort? So this is one change. And the second change is same thing for writes. So for writes, we were very happy with Thomas' rule, but now we have to be careful because if the earlier transaction who wrote it has not committed yet, then we cannot apply Thomas' rule. Then this right must be enforced. Exactly the same principle. So again, we need to wait for the previous transaction to commit if it hasn't committed yet. Okay, so I hope you get the principle. Here are the complete rules for the timestamp-based concurrency control manager. It's exactly as before, but now we also check for the commit bit. And in the right places, if it's false, then we have to wait for the previous transaction to commit and same here. So I'm not going to read them carefully, but I encourage you to study them at home. And some of the exercises on the current homework will help you practice a little bit and understand these rules. They're simple. Once you look at them, you'll understand them. Any questions so far about this first optimistic concurrency control mechanism? I think I have a summary slide. Yes, so a summary. This is, I said here conflict serializable, but no, it's actually view serializable. This is a mistake. The resulting schedule is view serializable and it avoids cascading abort. Now remember phantoms. Phantoms are a big problem in concurrency control. They are much harder to deal with. You can't handle phantoms with an optimistic concurrency control mechanism. You need predicate locks. So they are expensive to handle and they are not dealt with by any of these optimistic concurrency control mechanisms. Good. Next version is a very clever extension of the previous timestamp-based concurrency control mechanism is based on the idea of creating multiple versions for a page. So here is the idea. One of the rules, remember when we decided what to do when a transaction requests a read? Then one of the rules said that if this read is of an element that was written by a younger transaction, then you have to abort it because the current transaction wants to read an element that's old. Well, instead of aborting it, why not keeping multiple versions of the same page around? And then we just serve as a transaction with an older version of that page. It's a very simple idea. So this is a multi-version timestamp-based concurrency control mechanism that keeps multiple versions of every page. Okay. And instead of showing me the next slide, let's play a little bit by imagining how things can involve. So let's imagine we have an element x and we have versions for timestamp 5, for timestamp 20, timestamp 33, and that's it. Now transaction 1 has timestamp, timestamp let's say 14. And transaction 1 wants to read x. Why do you give it? Obviously you give it x5, right? Now transaction 2 has timestamp 25 and wants to write x. What do we do? Sorry? Do we overwrite x33? These are versions of the same page. Do we overwrite x20 or x33? It's not tricky. It's very logical. None of the above. What do we do? We create a new version because we don't have the version for 25. So now we create this new version x25. We stick it right here. Now transaction 3, it has timestamp 28, wants to read x. What do you give it? It gets tricky. What do you return? 25, of course. And now transaction 4 has timestamp 10 and wants to write x. What do you do? This is a tricky question. What do you do now? Looking at exactly what happened on the screen. Yes, we need to abort it. Why do we need to abort it? Is this true? Or did I do something wrong? No, I did something wrong. This is okay. If this, no, we need, yes, we need to abort it. Why do we need to abort it? If we allow transaction 4 to create a new version called version 10, something goes terribly wrong. What goes wrong? Which transaction would have seen the wrong something wrong? Transaction 1. Transaction 1 had timestamp 40. So if x10 were here, what would you change? Then this should have been x10. Very difficult. I mean, it's very, very, it's not difficult, but it's very slippery because we have the true history, the true sequence of events, but they actually mimic a sequence of events that is imaginary. Visualize when you say x5, that is the moment when the transaction was recorded? x5 is a version of the page that was written by a transaction who had timestamp 5. Yes. Imagining the timestamps are incremented. For example, when Transaction 2, who had timestamp 25, when Transaction 2 wrote an element, we created a new version that we called 25. Why do we need to abort this, right? Oh, what happens in the future, I mean in the physical field, what happens with the new transactions? That we don't know. So you want to consider a transaction with timestamp 33, right? What should they do? It already wrote, this transaction has been active before because it wrote x33. Everything is fine. If it wants to read, what do you return? If Transaction 5 wants to read x, you just return it its own version, which is 33, and we're happy with it. Maybe another question would be what if Transaction 6 has timestamp 30 and wants to read x? What do you return? Which one? 25. But you see, that's a difference. You see, we have this 25, we can see it. It's physically present, and we can return it to Transaction 6 because this is the most recent page that we currently have. The problem with T1 was that if x10 had been present, then that is a value that we should have returned to x1. But we didn't have x10 at that time. Instead, we had x5. And now T1 is gone. We can't change our mind. So T1 has seen a value x5. When T4 wants to write value 10, now we can't go back and give the value x10 to T1, which is the normal serial order. So we're stuck. So the only possibility is to abort T4. But it was already updated. So we don't need to write it back. You're looking at which Transaction? Transaction 2 at x25. Transaction 2 writes page 25. As per that rule, you don't even need to write people with that conflict. Oh, so you're thinking about Thomas' rule. The difference here, yeah, we cannot apply Thomas' rule. And the difference is because of the multi-versions. Because we allow transactions to read back in history. We allow transactions to read old versions. Thomas' rule only applies if only the latest version survives. And then if you want to write an older version, that essentially disappears. But Thomas' rule does. This is what it does. It takes advantage of the fact that it only keeps the latest version. Okay, so I hope you've got the idea behind this mechanism. Let's summarize and ask the basic questions. What information do we need to keep around? Maybe we need multiple versions of each page. We don't need the right time stamp, because the version number indicates the right time stamp. Do we need a read time stamp? Did we ever care about who read a page? Sure, Nt4. In order to abort Nt4, we need to know who read page 5. And that is a transaction with a time stamp 14. Okay, so we need the right time stamp. Next question. How many of these copies do we keep around? So new versions are created all the time. When do we start deleting old versions? Yes, sorry. If T5 commits, of course, we also assume that T5 is the oldest transaction. If T5 commits, can we erase version 5? Not necessarily. What can happen? Maybe the transaction with time stamp 6 is still active. And transaction with time stamp 6, if it wants to read, we need to give it x5. So we can remove a version. Only if there is, if the remaining oldest version, in this case, it would be x20, is at least as young as any active transaction. There should be no older transaction than the oldest version of the page. Okay, so let's summarize. This is what happens. Whenever there is a right, we create a new version or we overwrite the existing version if there is one. So we don't need the right time stamp. That's part of the version. Whenever a read occurs, we might have to, we need to maintain this. We need to read time stamp. And when a right occurs, that actually there is another case we may have to abort. Apparently, I missed to mention that case. When can we delete a version? We can delete it if the next remaining version is, there is no transaction older than the next remaining version. That's when we can delete it. Okay, and this is a multi-version concurrency control mechanism. It's very simple actually. It's very, very clever, but it's very simple. Any questions about it? Okay, plenty of time for you to practice it on the homework. And the last theoretical concurrency control mechanism before we discussed snapshot isolation uses a different concept. So far we have seen this concept of comparing the time stamps of the current transaction with that who has read or written the previous page. We have seen two concepts, actually the time stamp concept and the multi-version concept. The next concept that we are studying is that of validation. So this tries to be even more aggressive. It says, let the transactions read and write without any, any, any precaution. But with the following two rules, the, the, every, all the transactions must read first before they write and they, they will write at the end. So it's like, they're like in two phases. First they read and later they write. And the second rule is that before they write, they must validate. There is a validation phase. During that validation, we decide if we can allow the transaction to proceed or not. And this is when we check for all possible conflicts. If there is a conflict, then we will, we will abort the transaction. If the previous one was an optimistic concurrency control mechanism, I would call this a super optimistic concurrency control mechanism. It's, it's even more optimistic. Now the, the, there are three times that are involved here. The time when the transaction starts, the time when the transaction validates, and the time when the transaction terminates or commits or abort. And the serialization time is the validation time. This will, will dictate the order in which the transactions are considered in this, the equivalent serial schedule. Okay. So let's see how it works. There are only two, two rules, very, very simple, but we need to, to study them carefully. We are here. We are this transaction, the T transaction. So we started at this time. We read some elements. And by the way, we know for each transaction what elements we read. We read, we read the elements in the set RS of, of the transaction T. And now we started the validation. This is the validation time. We called it val of T. We start the validation. So what we do, we check for all the other transactions if there are any conflicts. Imagine another transaction like U. Not, not like U, but like, like the transaction U, this one here. Here we check for, for, and U, U is, is, is older than, than T. U is older than us. We detect that. Actually, all we need to know is that U has already validated. This is what tells us that U is older. We are validating right now. U has validated and therefore it's older. Okay. So what can go wrong? We are checking here for read, write conflicts. We are checking for conflicts between something that might have happened here and something that, that, that, and something that you might write later. That would be bad. You should write before we have read. If, if you might still write while we are validating, or maybe it's written after we have read it, that's bad. Very easy to detect. Here is how we detect this. We, we check that the, the reset of the two transactions, of the transaction T and of the right set of the transaction U have something in common. There is a common element. That will be the conflict. And the second thing that we check is, is this here. We check that U has, is not completely finished before we started our, our read. Right? So, so if, if U were like this, if it, if it finishes before we start reading, then we are fine. Then U has finished writing everything that we, we, we, we read. But if, if this happens, if U finishes only after we, we started to read, or is still, still continuing maybe, then, then we have a conflict. And in that case, we do oral back. There is a lot of information on this slide, but that's all there is. This is all that we need to worry about. But you need to think a lot to understand why this works. Notice how, how, how conserved, in some sense, hope is, I want to call it optimistic. It's a very optimistic conicality control mechanism, but it's also, it has lots of, of false positives. For example, what can happen, this is what can happen. Maybe you wrote something and T read it later. That would be allowed under the timestamp concurrency control mechanism that we discussed earlier, but not under this mechanism. Because here, we don't keep track of at what time every read and every write happened. We do only a global validation. It's a validation time. This makes a validation very efficient. You just, you're just comparing two sets. And all the other reads and writes that proceed without any delay. But at the same time, it can result in more rollbacks than needed. Okay. This is for read-write conflicts. And the other is for write-write conflicts. It's absolutely similar. So here, we validate and we check if the write sets are in conflict and that, that, that it has not, not finished when we started validation. In that case, we can have, if we, if we did the write here, then we can have a conflict in this direction. And we don't like this. We don't, we, we can't allow a U to write after T has, after T writes. So this is concurrency control based on validation. Any questions about it? Very simple. Deceivingly simple in some sense. Okay. Then I want to show you how snapshot isolation combines some of the concepts that we have seen in these more theoretical concurrency control mechanism. And snapshot isolation, I'm sure you heard about this because it's implemented in, in many database systems. Oracle is famous. They had, they always had, they always implemented the snapshot, snapshot isolation concurrency control mechanism. And this is why Oracle is so efficient at handling large workloads of transactions. But the, the little dirty secret of snapshot isolation is that it is not serializable. And it, it, it's not because of phantoms. It's not serializable even over a static database. That is, it's a small little secret. So let's see what it is. It's a multi, multi, multi version concurrency control mechanism. Every transaction receives a timestamp. And then it essentially sees a snapshot of the database at that timestamp, which in essence it says that whenever it reads something, it will get a version that corresponds to its timestamp. Okay. We have seen this already. There are no delays. You want to read, go ahead. The only problem is, if you, when you want to write, when, if somebody else has written that page, then there is a very drastic rule. First committer wins. Whoever wrote that page first wins. And the next transaction who wants to read, to write that page will be aborted. Okay. So that's, that's the rule. And this, this is timestamp. This is snapshot isolation. So in some details, we have, we have multiple versions as we saw before. When a transaction wants to read, we return the corresponding version. And if, if the transaction wants to write X, then we check if another transaction has, has updated X. Because if it has updated X, then we don't do anything. We just abort the current transaction. And I made here a comment that this is not exactly the first committer wins because they're actually don't commit. But since we only allow one winner, he will commit first anyway, because that's the only transaction that will continue. Okay. So let's discuss this a little bit. There are no dirty reads. What was a dirty read? It's when, when a transaction reads a value, that has what bad property? It's both written by a transaction that has not committed yet. Can we have dirty reads? No, because we only get a snapshot of the current transaction. And the snapshot means by definition it's all the transactions that have, that have committed before, before that timestamp. There are no inconsistent reads. What was an inconsistent read? You, you, you read once, and when you read the second time, somebody has, has, has written it. Can this happen? No, because you will, you always get the same, the same version of that page. There are no lost updates. What was a lost update? A transaction wants to write, and then the next one writes, and the update of the first one is lost. This can't happen, because if there is such a conflict, then the second transaction will be aborted. However, there are no so-called read-write conflicts. This is what snapshot isolation does not do. So let me show you this in, in detail. This is the template of the, of the schedule that is allowed under snapshot isolation, but it's not serializable. It's not conflict serializable. It's not view serializable. It's not serializable. It's wrong. Okay, let's, I have a more funny slide on this, the same template on the next slide, but let's, let's work out this abstract example first. So imagine two elements, x and y, and the rule is at least one of them has to be positive. Okay, they can't be both negative. Look at transaction one. If, if x is positive, and actually if it's greater than 50, then it will set y to be negative, and it writes it. Look at transaction two. Transaction two does the same, but then reverse. If, if y is greater than 50, then it set x to negative and write it. Okay, now initially, let's suppose that initially both are, are 60. This is a, value zero times 10 zero. What snapshot does t1 get? It's a non-question, okay? It's snapshot zero, of course. It's going to see the value 60 and 60. Similarly, 42. It's also going to see the value 60 and 60. Therefore, t1 is going to want to write y. So it's going to write minus 50 here, okay? Now t2 wants to write x. Is, is t2 allowed to write x? Sure, there is no, no conflict between rights. So t2 is going to be allowed to write x, but this is not a serial schedule. This is not a serializable schedule, right? So if you look at the corresponding schedule, read by the first, read by the second, write by the first, write by the second, we have a conflict. Because, well, what happens here? We need one wants to go before two, and here one wants to go after two. And so we, we have a conflict where there is a cycle in the, in the president's graph. So that's, that is a problem with, with, with snapshot isolation. It does not detect, read, write conflicts. And these, these are called right skews. And, and that's life. This is how snapshot isolation works. So in case this example was confusing, I have a joke here. So imagine acid land, and acid land had two vice-roids. It's a political joke. They are called delta and rho. And their budget has had two registers. One was called x, like in taxes, and the other was called y, like in spending. And initially, the taxes were high, and the spending was low. And the vice-roy delta said, well, if the taxes are high, then we can set the spending to high. And, and the vice-roy rho said, well, if the taxes are low, then we can send the spending to low. And they both wrote and committed. And under snapshot isolation, this is allowed to proceed. But since then, they had, they had the deficit ever after. And that's the sad, the sad story about snapshot isolation. How can you prevent this? If you're using Oracle, are you doomed to have non-serializable schedules? As, as, I'm thinking about what should you do if you were to write an application where you have to read and write elements, you have to write this application like this, like, like the one on the, on the screen. Let's go back to the abstract one. If you don't get, you don't get carried, carried away by the political content. If you, if you have to write this, this particular application, how can you cope with the fact that Oracle implements snapshot isolation? Snapshot isolation suffers from bright skews. Yes, what would you do? You'd be your own level of locking in there, kind of a lock column that you could update before you make changes to certain types of records to make sure that you're the only one modifying it. That's, that's one possibility, then you have to read the Oracle documentation and see if they allow you to have locks. I was thinking when the application level, not in the database, not the database. Oh, in the application level. That's a bit tricky because in a real application there are many programmers. So now they all have to agree on the same locking, locking schemes, scheme. So it's possible but it's this difficult. There is a much simpler trick. Some, it's a certain programming discipline that if you, if you follow, you will guarantee that your program will be serializable. We put a check in x bigger than 15, y bigger. Well, no, whatever you check here, you will see your snapshot and your snapshot looks fine. So you will not detect, you will not detect inconsistency. Yes? Right bank everything that you read, even if you don't change it. Ah, I didn't think about this. So what he wants to do, he wants to write back x and similarly the other one should write back y. If you do this and you do this consistently for all transactions, everything you read, you write back. Will you get serializability? I think, no, I think you should. Will you still get context for that notation? So, well, the difference is that now the snapshot isolation concurrency control mechanism will abort the second, the second transaction. So now, yeah, now you get, you now have transformed a read write conflict into a write write conflict which results in abort. That's a solution but this is not a solution I had in mind. And I think the other one is a little bit, that's the one that textbooks recommends, it's probably slightly more efficient. You insert reads before writes. Even if you don't need why, before writing you read it. It's more efficient because reads are always cheaper than writes. And by inserting reads you will not introduce additional conflicts, by inserting writes you might introduce some additional conflicts. Great idea. So that's exactly the strategy to ensure serializability. So this is what Oracle recommends. If you don't care about writes queue, then don't worry. If you really care about serializability, then program like that and you'll be fine. Okay, so last comments about concurrency control mechanisms. We have discussed two strategies to implement the concurrency control mechanism. The first is based on locks. And there remember we discussed two-phase locking and strict two-phase locking. And the second is an entire class of concurrency control mechanisms called optimistic concurrency control mechanisms that combine timestamps with multi-version, with validation. And there are snapshot isolation is one in this family. As we discussed earlier, there is a compromise. If you have few conflicts, then the optimistic concurrency control mechanisms are better. They would simply allow transactions to proceed, especially read-only transactions. But if you tend to have conflicts, the optimistic ones end up aborting too many transactions. Then you're better off with a pessimistic concurrency control mechanism. So what database systems do, they implement both. I don't know how they underhand that interaction between the two, but I imagine it's not impossible. And then the recommendation is that transactions that are read-only, they will be implemented using timestamps. And the ones that are read-right are implemented using clocks. Okay, and that's the last thing I wanted to say about concurrency control mechanism. Any questions? Then I suggest that we take the break now. Because I want to start a completely new topic. It doesn't make sense to start it for five minutes. So let's take a break, like six minutes break, and then we will discuss XMR, as far as we get with XMR. So is everyone back? I didn't start counting for the six minutes that I promised. But it seems that everyone is back. Can we start? Oh, people are coming there. Yeah, let's start. So what I was hoping to cover in this lecture, but it's okay if we continue next time, is a completely new topic, which is XMR and Xpass and Xquery. And here is what I want to cover. I want to first describe XMR, which is really very simple. It's just a data exchange format. But I really want you to hear it from me, as opposed to reading. There is so much fluff about XMR out there. And it's good that you understand how to think about XMR from the perspective of data modeling. And then there are two languages that we need to cover about XMR. One is called Xpass, which is a very tiny language. And then like SQL for XMR, that's called Xquery. It's a richer query language. And I hope to be able to cover these two today. And maybe we'll discuss Xquery next time. In terms of interpreters, we will use Zorba in homework six. Zorba is an Xquery interpreter that was developed by Dana Florescu, who was one of the... She participated in the working group for Xquery. And then she had a startup that was bought by... Well, it's a long story. Right now, she's at Oracle. She has here a small research group at Oracle. Apparently, they're only research unit. And she develops this Xquery interpreter that has many, many possible applications. It's a very cute interpreter. So I'm sure you will enjoy it. In terms of installing it on your computer, you're probably much, much more expert than I am in downloading and installing software. But it took me half a day to install it on a Mac. Because it doesn't come in binary for a Mac. It comes in a binary for Windows, which is trivial to get. But for a Mac, you need to download all these other packages. And it's a nightmare. But in half a day, I managed to install it. So if you're using a Mac or Linux, you might need to set aside a longer time to install Zorba. Once you have it, it's a lot of fun to use. And you might consider using it later in your job because it has many potential applications. And it's free. Good. So XML, some additional readings here. The XML is standardized by the W3C, which is the World Wide Web Consortium. You are welcome to read the official documents. If that's what you need to do, you can access them. This is the URL for XML, for the XML standard. And this is the URL for the x-query standard. They're difficult to read. You don't need to actually read them. You can rely just on the lecture notes on these three chapters in the book. And in general, on just thinking about this is a lightweight topic. This is one of the easiest topics that we cover in this class. Okay, so what is XML? It's a data format. It's used in many places, like in web configures, in configuration files. It's now used as a replacement for binary formats for Microsoft Office. HTML comes in a variant that is in XML. But what we are going to consider it for is only this. XML has a data format for exchanging or sharing data, or for representing semi-structured data. So far we discussed only relational data. XML is an instance of something called semi-structured data. So look forward to see a discussion on what that means, what is semi-structured data. Yeah, so that's a discussion of semi-structured data. So we have seen only relational databases. And if you think they are too rigid, XML is more flexible, and this is why we call it semi-structured. However, everything we learned about data anomalies, data normalization, now goes down the drain, XML is not normalized. It's a data exchange format that's not normalized. Good, so what's the best first slide to expose you to XML if you've never seen it before? I have these slides from the late 90s when I was doing active research on semi-structured data, and it was a really hot topic. And my way to present XML is in contrast to HTML. If you look at the website, under the hood there is HTML. But what you see is something that you can immediately interpret. Right, you look at this website. What are these two items? What do you think they are? What are they? Sorry? I didn't hear. Oh, no, no, in terms of semantics. What do they represent? Are they, they're rotation, they're books. They refer to books. That's what they are. And as humans, we can immediately recognize the fields. What is this field? That's the title of the book. What are these? Authors. What's this? Publisher. Publisher, and this is a year. Now try to imagine an application, try to imagine yourself writing an application that goes to this particular website and tries to extract this information and identify the fields. Well, this is what you're facing. You're facing HTML. So you can parse it. You're, I'm sure you can do this very, very quickly. But the problem is this so-called screen scraper is very brittle. If the author of this page changes his mind and will represent the title, not in italic, but in something else, then your application needs to be rewritten. It needs to be updated. And that's, that makes it very brittle. Think about XML as describing the content while HTML describes the presentation. Here is how you can represent the same information in XML. Instead of italic and boldface, we use tags like title, author, another author, another author, and there is a publisher, and here is a year. And we put everything in a book. And then there is a next book, and we put everything inside the biography. That's XML. So one way to think about this is that it is a replacement of HTML in the sense that it describes the content as opposed to describing the presentation. Okay, so let me go quickly over the basics. Those things that I underlined, we call them tags. So book is a tag, title is a tag. Every tag comes in two copies. There is a start tag, and there is an end tag, and they must match. Everything between these two matching tags, including the tags themselves, is called an element. So this is an element, and this is an element. And elements can be nested, but they have to be well nested. You can't have overlapping elements. And one more little detail. If you have an empty element where you don't have any content, then you can replace an begin tag and tag with a single tag. That looks like this. Head up, just slash at the end. And the final rule in an XML document, we must have a single element. There is a single big root element. Okay, and it's called well-formed. This is a standard term used by the W3C. The XML document is called well-formed if it has matching tags. Okay, one more thing. In XML, we can also have attributes. For example, next to book, we can say the price is 55 and the currency is USD. There you have the freedom to represent data in attributes. So where should we put the data? Should it be in attributes or should it be in elements? For example, we could move the price here. And we could put the currency here. And once you realize that there is really no distinction between putting it in elements and attributes, you start wondering why did they have attributes in the first place? What's the role of attributes? Any ideas? Yes. Is it tied to the element? It's semantically tied to the element. Yes, but if I make it a child element, you could argue that you can imagine it as a slightly different semantics. But there is actually a different reason why we have attributes in addition to elements. This goes back to the history of XML. XML is actually inherited from SGML, which is a horribly complicated language, SGML. So XML is a strict and much simpler subset of SGML. And the idea of these markup languages was that you take a text, an existing text, and you add elements and attributes to do the markup, to do the annotation. So there is this well-defined text, which is the original document that has a semantics that people can read, and that you are annotating. So now it makes sense. It becomes clear what the role of attributes is. An attribute is a value that you want to add to this text without pretending that it's part of the text. But when we represent data, this distinction goes away. Both elements and attributes are just data. Yes? But for cases like author, where you can have more than one author for a particular book, you couldn't really do that with attributes because you could not, you, can you have duplicate occurrences of author in the attributes of book? That's correct. You cannot have multiple occurrences of the same attribute. Actually, let me summarize the distinction between them. So there are three, three important distinctions between elements and attributes. The one that your colleague mentioned is this one. You, you, you can have repeated elements. You can have multiple authors, but the attributes must be unique. That's, that's a definition for no, if you think about XML as a data exchange format, there is absolutely no reason for these rules. There is absolutely no reason to have attributes in the first place. But considering where XML is coming from, these, these rules make a lot of sense. The second distinction is that the elements are ordered. If we put title before author, that's different than putting author before title. Something you can check in the application. Attributes are unordered. You can't check in the application if, if this attribute comes before that. That, that's hidden. And the last rule is pretty obvious. Elements might be nested. You can have arbitrary level of nesting of elements, but attributes, no, they can't be nested. They are just strings. They're atomic. Okay. So let me skip this comparison to, actually, no, let's have this discussion. So XML is pretty much like, like HTML. HTML also has tags and attributes. So what, what, how, how would you describe the differences? What are the major differences between XML and HTML? So when HTML, let me see, how should I do this? XML and HTML. So here the tags are fixed. While in XML, we can invent our own tags. How many tags are there in HTML? Who knows? It depends on the version. Last time I checked, which was several years ago, there were around 200 in whatever version I checked back then. Gives you a ballpark estimate of the number of tags. But that's it. You live with these tags. Next time, you can invent whatever you want. Yes, but you had a comment. The later versions of HTML are just XML with a stick. With what? So like the HTML strip of the five is XML with a schema? XML with a schema. Okay. But, but, okay, but the schema is given. So, so then it's, it's, yeah, so then the discussion is not that meaningful because then HTML is just one particular instance of XML. Okay. Let's, let's then move on. This discussion was no longer that interesting. So that, that XML is very simple. That's the, all the syntax that you should care about. I'm going to show you some more syntactic details in a few slides just, just for completeness. But, but it's everything you ever need to know about XML was on the slides that you have already seen. What, what is really interesting is this is what we spend some time after a couple of slides is, it's semantics. What does it mean to store data in, in XML? So let me show you a few more syntactic idiosyncrasies of XML. XML has this thing called OIDs and references. So an OID is, here is a good definition of an OID. An OID and XML is like a key defined by some, someone who never took a class on databases. Okay. So it's a poor man's key. And the way they define this key is I said, if you have an OID attribute, as always applies to an attribute, then it's, it's a global key. Wherever that attribute occurs, its, its value must be distinct. Well, we know much better, right? We know that keys, every, every relation name must have its own key, which is, has nothing to do with a key and a different relation. So later they fixed it. They fixed it in XML schema. But in, in the original version of, of XML, that was the definition of an OID. Then they had something called an ID ref. So there was an ID and an ID ref. An ID ref is like a foreign key defined by someone who never took a database class. That's exactly what they mean. They mean that if an attribute is an ID ref, it must occur as ID in, in, in that XML document. And you can, you can't say where or on, on what, on what kind of element it just has to occur. So in this example, what happens here? This ID ref refers to this ID. But it's just syntax. There is no one out there to do the join for you or to enforce any kind of constraint, like we enforce them in, in a relational database. This is all depending on the application of your XML. More syntax. Sometimes in the text, you would like to write characters that might be interpreted as the beginning of a tag. Then you, you embed them in this thing called the C data section, which has this very, very complicated syntax. Everything I wrote here is part of the syntax. And then you can write any text and then you end it with this. That's called the C data section. Then there is something called entity references. These are the things that begin with an ampersand. Some, some are easy, like if you want less than, this is like an HTML. Then you can, you can write the ampersand LT. The interesting story here is that XML is so simple, right? I, I had, I could cover it in like two or three slides. The document that describes XML, last time I checked it, which was like 78 years ago, was 40 pages long. So you wonder what, what, what do they have to say in those 40 pages? Okay. So like a half a page, they described what, what they described so far. The rest is about entity, entity references. It's about this. It's about this stuff. And we never use them for data, for, for data representation. The last, the last piece of syntax I want to show you, it's interesting. We, we have comments in XML, like in every language. And they look like, like you see here on the slide. But the comments, they are part of the data. In the application or in the query, you can, you can check for the comments. You can count them. You can filter them. You can do whatever you want to the, they are part of the data model, which is kind of bizarre. So you can't hide the information by commenting it out. It's part of the data. Okay. Let me skip namespaces. And finally, we get to the semantics. So what does an XML piece of data mean? It's a tree. That's what it means. It's not a relation. It's not a set of, it's not an entity, it's not an entity relationship diagram. It's a tree. Let's look at this example here. On the left, we have this XML document, where the outermost element is called data. That's the root node of the tree. It has two sub elements of type person. Here they are, person, person. And we go deeper and deeper. This person has a name and an address. Here they are, name and address. And it also has an ID, which is right here. So the semantics of an XML document is a tree. And in this tree, we have three kinds of nodes. Actually, there are many more. But these are the three that are really important for data modeling. There are element nodes that correspond to elements. There are attribute nodes that correspond to attributes. And then there are text nodes that correspond to leaves. And these are the text values. And that is the XML data model. It's actually much more baroque and much more complicated than it needs. And then it's much more complicated than this. And for various reasons. One reason is that, actually let me ask you, are the nodes in this tree ordered? Or are they not ordered? And the answer is actually harder than what you read on the slide. Are the nodes in the tree ordered or are they not ordered? Well, this is funny. The element nodes are ordered. But the attribute nodes are not ordered. So it's a, it's a, mathematically, it's a very strange beast in which it's a tree in which some, for some children, you care about the order. But for the attribute children, you do not care about the order. And in addition, there are all sorts of other nodes like comment nodes. The root node is a special kind of node. There are like, if I remember correctly, like nine or 10 different types of nodes for, for x-squaring. But for, for, for our purpose, XML is just a tree. Okay, now let's think a lot about what kind of data can we represent as a tree. The, the, the important information that I'd like to convey to you is that XML is, to some extent, it's self-describing. It's self-describing in the sense that the, the schema that we usually keep in a, in the database catalog, it's now embedded with the data itself. Think about the schema as being like the, the relation name and the attribute names. Like persons is a relation name. And name and phone are attribute names in that relation. In XML, these pieces of information, they become the, the element names, the tags. And they are, they are interleaved with the data itself. So as a, as a consequence, XML is much more flexible. You can, you can just ship it as one file. And at the other end, there is an application that can understand how to parse it. While you can't ship, you can't, you can't send in an email a relational database, right? You can't take your postgres database, attach it to an email and send it. But, but you can do this with an XML piece of data. This is by XML is called semi-structured data. And I have more, more to tell you about its semi-structured aspects. So I want to show you two things. How to, to map between relational data and XML. And then how to think about what else you can represent in XML. So there are two ways to map relational data to XML. One is what I call canonical mapping. And it goes like this. You take every relation and you transform it into an element, into a big element in which every, every row in that relation becomes, becomes another sub-element. And every attribute like John becomes another, another attribute. And the phone is another, another sub-element. So look at what happens to the schema. The schema in the relational data is stored just once. It's actually not stored with the data itself. It's stored separately. It's stored in a, in the database catalog. What happens to the schema in the XML file? It stores along with the data. What is the consequence of this? What will happen to the size of XML? It's much bigger. But surprise, surprise, it's actually not bigger. And this, people in the early days of XML, they tried this experiment. They dumped a relational database into XML. When they looked at the file, it was much smaller. I mean, not much, but it was smaller like half. Why is that? Because there is a lot of overhead in the way relational databases store their data. Their blocks are usually only half full, because they want to support inserts very efficiently. So they keep the blocks half full. Yeah, there is a lot of redundancy. And every single block needs to have a block header. There is a lot of redundancy for organizing the data efficiently. And that redundancy doesn't exist in the XML serialization. Another aspect is that you can compress XML data very well. Especially if you write a specialized compressor that takes into account a special structure, then you can compress it very, very well. Okay, so this was a canonical mapping. Second mapping, what I'm going to call the natural mapping. Look at this particular data. Look at this particular schema, relational schema. We have orders and people, persons, and the person name is a foreign key into name. So instead of representing each relation separately as a sub-element in XML, it's a much better idea to group the orders under person. So we would have the first person called John. And here are his orders. And then we have a second person called Sue. And she has one order. Okay. And this is much more natural than to represent the two relations separately. And then you would have to join them in the application. So this works well if this relationship is of what kind, what kind of relationship is this one here? It's one to many or it's actually many to one, end to one. But if the relationship is many to many, then there is no natural mapping. Then you have to make compromises. You either include, you make one of them the parent and the other the child or vice versa, no matter how you look at it, you will have redundancy. So that's a disadvantage with XML data. Now let's look into this semi-structured business and see what is semi-structured about XML. Here is one way to think about semi-structuredness. You can have missing values. Here are two people, person to person. And apparently they must have a name on the phone, but the second only has a name. Okay. No problem. Can you do the same thing in a relational database? Sure you can, right? You use nulls. We have the same flexibility in a relational database. But in XML it looks like more elegant. There is no null. We just have, we can simply omit the phone attribute if you don't have a phone. Second, second characteristic of semi-structured data. Now Mary has two phones. Easy to do this in XML. You just have two phone sub-elements. How do you do this in a relational database? You are in big trouble, right? If suddenly, sorry? I would know how to do this. You need to restructure your data in a complex way, in a massive way. And now your applications, they need to be rewritten because you have, you need to split your tables and move the data around. It's easy to reorganize the data in the database, much harder though to rewrite the applications. Okay. And then a last example, when you can have different types and different objects. So name was a string, but now for this particular person, the name is structured. And that's okay in XML, but you can't do this in a relational database. But keep in mind that XML is not even in first normal form, not to talk about the other normal forms. XML has nothing to do with normalization. Okay. So we discussed the syntax. I told you a little bit about the XML data model. The next thing I'm going to do, I'm going to describe the schema for XML. And I'm going to stick only to the old, to the old schema, to the old type definition, which is called the document type definition or DTD. As you will see, it's pretty arcane and inexpressive. But what replaced it is called the XML schema is so complex that very few people are using it. I mean, people still use DTD just because they're simpler. So the DTD, it's like a post-facto schema. You have already your XML document, but now you'd like to describe what is in that document, which is completely separately from the document itself. The document still exists without this document type definition, without the DTD. Okay. So what goes into the DTD, we need to describe the tags. What are they? What tags are permitted in the XML document? We need to say how they are nested, which tags contain what tags, maybe what attributes. And we also need to say how they are ordered, if we care about that. And again, DTDs have been superseded by XML schema, but we are not going to discuss about XML schema. Okay. So here is a DTD, very simple DTD. This defines an XML document whose root must be company. The root of that document must be company. And it can have only the elements that are described right here. It can have company elements, persons, SSN, name, and so on. Now under each element you can have something that is described on the right. And I'm going to show you in details how to read that stuff. So what you do with this DTD? This is separate from the XML document. Here is the XML document. And the question to ask is whether this XML document conforms to that DTD. If it conforms, then we call it a valid XML document. Remember, we had another term which was well-formed. What is a well-formed XML document? What was the definition of a well-formed XML document? Yeah, the tags must match. And they must be well-nested. A valid means something in addition. It means that it also has to conform to a given DTD. Okay. So let me now describe in more details the DTD. The main only definition that we are going to discuss is this element definition. Look at the syntax, our cane. So you're using an exclamation mark after a lesson sign. Then you call it element. Then you give the element. And after that you describe what is allowed to be under that element. That's called, it's not called a type, as you would expect. It's called the content model for some arcane reason. So what can this content model be? The most interesting content model is the so-called complex one when we describe by a regular expression what we expect under this tag. Also a very simple one is a text only. It's called PC data. This means string. When you insist that under this tag, there can only be a text, a string, a text value. We can have empty. We can have any, which is deceiving, which means any any elements as long as they are allowed by this DTD. And something weird called the mixed content that I'm not going to discuss. What I'm going to show you next is a complex, the complex content model. It's actually very simple. I'm going to show you to use four examples. So look at the first one. It says that name, the content of name consists of first name and last name. This is a regular expression. What does it mean? What can we have under name? Well, we must have a first name that must be followed by a last name. That's what it means. So here is a legal XML document that conforms to this DTD. So remember this is just a regular expression, what comes after. The content model is a regular expression. Let's look at the second one. Now we have first name with a question mark. What do you think that means? You don't need to have a first name. You don't need to have a first name, but you must have a last name. Can you have two first names? No, you cannot have two first names. Can you have a last name first and then the first name? No, the order is given comma. Comma says sequence. It says in this particular order. Next one, name comma phone star. What does it mean? You can have multiple phones like here. A name followed by an arbitrary number of phones. Can it be zero phones? Yes. What do you think you would have to write if you wanted at least one phone? Then you write a plus instead of star, as you expect. And what does the last thing do? So you must have a name first and after name what you have. You have either a phone or an email. So that's how you read a regular expression and the content model of a DTD. Good. So I'm not going to discuss... Let me back to right here. So the DTDs are old arcane and you can criticize them to death and rightly so. And the standards committee at the W3C, they tried to address that and they built a new committee that was supposed to design something better called the XML schema. And what they came up with was something monstrous. Horrible. Something huge and now it's standardized. It's out there and industrial strengths product. They must implement XML schema. But in practice very few people actually use XML schema. If you need to read about XML schema, we do not require this in this class. But if you need to read about them, I left in this lecture notes my presentation of XML schema which I try to do as condensed as possible. There are like 20 slides describing XML schema. Pretty good overview. So I leave it here. But what I'm going to do now is going to skip from slide 36 to slide 56 and here is where we continue our discussion. And before we move on to query languages for XML, I'd like to step back and discuss a little bit where XML is used as a data exchange format or as a data model. And there are three applications that I'd like to describe. One is data exchange and property list and development schemas. So in data exchange it goes like this. Two databases talk to each other or more likely an application that gets data from one database talks to an application that gets data from another database. And now they need to send a lot of data to each other. So maybe everything about a customer, all its orders, all its complaints, all its invoices, all its bills, the first application needs to package this and exchange it with a second application. This is a great application of XML. You package that information in an XML document, send it over the wire, it's just a text file and at the other end you read it, you parse it, you process it, you store it in your database. This is data exchange. The second application is actually quite interesting and I was told about this by a friend of mine who used to work in the, on extending SQL server with XML. So this is a real application of a real usage scenario of XML inside SQL server. It's used as property list. So sometimes applications they need to manage objects that have lots of potential attributes. Think about products. So all products they have like name and manufacturer and price. But depending on the products they might have very weird attributes. I don't know, voltage, who knows what kind of, all sorts of chemical information if they are chemical products, who knows what kind of attributes they might have. You might end up with a list of hundreds of potential attribute names, but only a few of them are used by, by every single product. It's much more convenient to store such data as XML than to, to massage it into a relation, right? In a, in a relation you would have to enumerate all these few hundred attributes and the vast majority would be null. With an XML representation, you can simply store it as a, as a document. It's flat, but it's very flexible. You only include the elements that you need. So I found it's a, a great application of, of XML. And the last is evolving schemas. If you need to develop an application where the schema is really evolving rapidly. The best example I have in mind is scientific databases. When scientists, when they store their, their data, this is the main reason why they often don't use the relational database system. It's not that they, they don't know how to use it. They are smart people, but they often don't use it because, you know, next week they have another idea and now they want the data completely restructured. They want, you know, completely different way. And the relational database is too slow for that. It's too cumbersome to, to accommodate their quick, quick evolution of the schema. In XML, it would be more flexible, but still XML is not stood up to its promise of being very flexible data format. But there is a promise here. It can accommodate quickly evolving schemas. So these are the, the, the main applications of, of XML. Now, how do we actually process XML data? And there are two ways to process XML data. One is directly in your application via some API. There is a standard API called DOM. It stands for document object model that gives you an API that allows you to navigate through an XML document. You look at, at a note at each, every, at every moment. And you can navigate to its children, to its parents, to its siblings, to its previous siblings, to its, to its, you can navigate in all possible directions. You can read this content. You, you can do all sorts of things. This is your document object model. But the problem with this way of processing XML, XML data is that you need to have it in main memory. In order to access it, you need to read it in main memory. Now you can imagine smart, smarter implementations of this API. But the, the standard implementation is to, to just read the XML document in main memory first. And then you, you, you have access to it. The other way is via query languages. And this is what we are going to discuss next. And with query languages, we have at least the potential of doing all the nice optimizations that we know and love in, in, in databases. And we could potentially apply them to these query languages as well. And there are two, two query languages, xpass and xquery. And that's what I'm going to start discussing next. So I think I, I want to cover only xpass today. I feel that if we go too quickly, then it becomes blurry. I want to show you this very simple language with all, with sole purpose is to allow us to navigate through the XML tree. No joins, no group pies. There is some, some limited form of aggregation, but it's very limited. So I have several examples. And in the examples, I'm going to use this data set. And actually, by the way, I tried all these examples on Zorba this afternoon on this particular data set. And they worked very nicely. The only problem I had was that when I, when I simply cut and paste. And if you cut and paste, and you want to use that data, usually you get into trouble with the, with the apostrophes. For some reasons, they have some weird character there. And then you try to read this as XML and you get an error. So you have to go there and edit and write the correct apostrophe. Okay, xpass. Xpass thinks about XML as a tree, as we know and love. But it has a weird idiosyncrasy. It has two roots. The tree has the normal, is called the root element. But on top of the root element, there is the so-called root. And that's how xpass thinks about the XML document. And you'll see in a second why that's the case. Xpass is modeled after the Unix directory structure. If you want to navigate down, you use slash. So here is the first xpass expression. This would be like query one. It says navigate down to bib, then to book, and return here. And the result consists of all the years that xpass found in the document. Okay, notice this match. Now it seems trivial, but later when you get confused, come back to this slide and remember that the element that we got was exactly the last element that we navigated to. Of course, if you look for bib paper here, then we get the empty set because there are no papers in that document. If we say slash bib, what do we get? What will you see on the screen if you ask the xpass interpreter, if you ask Torba to return slash bib? You see the whole thing because the whole thing started with bib. That's what you see. It always matches the element name. What happens if we just say slash? Sorry? Don't you also get everything? Yeah, you also get everything, but behind the scenes is different because in the second case, you actually get that root. You get the root. In the first case, you only got the root element, but you can't see the root. It only makes a difference when you compose xpass expressions. Okay. In xpass, we have double slash. Double slash means navigate arbitrarily deep, but at least one step. So if we say slash slash author, then we get all the authors. And if we say slash bib slash last first name, then we get all the first names. It's very simple. Okay. Attributes. That's how you query attributes with the S sign. That means just a shorthand for give me the attribute. Interestingly, you can't print an attribute because it doesn't have an existence in itself. An attribute can only exist as an attribute of an element. So I tried to print it in Zorba, and I got an error that it couldn't print it. But if you use xquare and embed it in some elements, then you can print it. Any questions so far? It's very simple stuff. Okay. Wildcard. Star means matches any element, but it doesn't match attributes. While add star means match any attribute. Now, the question that I have that is now on my mind is whether star matches also text values. Does it match text nodes? It doesn't? Sorry? Oh, the only elements. Thank you. If you want to match text values, it is actually a different wildcard, which is called text. So no more star. Now we call it text. But I really forgot if star matches text or not. So that star is only for elements. If you want text values, that's text. And there are other weird things. With node, you can match any node, including attributes, elements, and text. And then for comparison purposes, you can only fetch the name of the current tag. So ignore this for now. Okay. More about xpass? Predicates. So here it gets slightly more interesting than that's all about xpass. This xpass query, you read it as follows. You ignore the square brackets. You only read bib book author. Bib book author return authors. But now you check this predicate. And the predicate says return keep only those authors that have a first name. Now in our data, we had multiple authors, but only recall has a first name. So this is what you get in the answer to this query. So predicates are a way to further filter the data that we are retrieving in addition to the navigation. Here is a complicated one. How do we read this one? Well, I can tell you how to read it. You first ignore all the attributes or the predicates. So you read it like this bib book author last name. So what will it return? It will return a bunch of last names. And at least we understand it so far. Okay. Now which last names will survive this election? And now we get more serious. So only those that once we reach author, that author must have a first name. And it must also have an address. And that address must have a zip and must have a city. So this means you further filter the first names that you're returning. Okay. So I describe this in some details here. So you first strip the entire expression of all the predicates. You first understand where it navigates. And then you put back in all the predicates and consider what additional filtering it does while navigating. Any questions so far? Yeah. Yeah. I didn't explain dot yet, but I think I have it. Dot means a current element. Let me get to that slide because it's coming in the next slide or two slides. So more predicates. You can have simple comparisons inside the predicates. Anything interesting here? This checks if the price is less than 60. This checks if the book has an author whose age is less than 25. By the way, what happens if we have a book in the second example that has several authors? Do all of them have to be under 25 or at least one of them? At least one of them has to be under 25. X paths, like a sequel, has existential semantics. Every time we examine multiple variants in the data, at least one must be true. So this means at least one author must be under 25. What does the last expression do? Returns all books that have what property? That they have at least one author whose content is a text as opposed to nested elements. Yeah. Can we have Boolean conditions here? Yeah, and we can also combine them with and, not, or, and so on. We can also have some limited aggregate functions. And yeah, basically that's it. Okay, more x paths axis, and here it gets more tricky. Dot means current note. So for example, if we navigate to big book and then we say dot, this means continue from the book. And what you do from the book, well, you go deep and you look for a review. Now, that's different from saying slash review. What would that mean if we say slash review? Yes? Yes, it means that you start over again from the root, which is completely different. So since there is no other way to specify, look deep down from here, from the current note other than this using dot. On the other hand, if you just use slash, then there is no need for it. You could simply say a book that has some review. And if you use it between slashes, then it's super close. You can simply drop it. Okay? Again, it's very simple. Double dots means navigate back to the parent. So what does it mean? Remember that every slash, every slash means go one step down. So if we reach this author and then we say slash double dots means go to the child and now the one step back, we are back to the same author. So it's like not doing anything. These two authors, they refer to the same author. Here is a tricky one. And actually, I tried this and it worked as I wrote it here. So this goes to an author, then checks that the author has a first name, then it goes back up. I actually tried this one. So I realized that it is in consistency with the previous one. It goes back up and then checks that who has the last name. This would be the author. Right? Sorry, there is no inconsistency. It's consistent with the previous one. So when we reach first name, again, we have an author. We have a first name. And now we go back to the parent, back to author. And from here, we check for the last name. But this is the same as checking separately that the author has a first name and the last name. Which tells you that this navigation back to the parent is not really needed. You can program around it. It's only if you happen to have a pointer to a note, this is when you really want to go back to the parent. But if you start from the root anyway, then there is no real need to go back to the parent. Yes? Are you asking about the second example or the first example? So in the first example, we are saying go back to the parent author note? No, I was wrong. I described the first example incorrectly. The statement of the slide is correct, but I described it incorrectly. Let me go back to the first example. So we reach book. We go down to an author. And now we say dot dot. So I said it incorrectly. Slash dot dot means from where you are. From this author, go to the parent. So what does this author do now? It chooses maybe the same author, maybe a sitting. This is equivalent. Remember the homomorphism theorem? You can prove this through the homomorphism. Checking for two authors and just keeping one of them since the two can be the same is the same as checking just for one author. So this expression is equivalent to just navigating down to one single author. And to prove this formally, remember the homomorphism theorem. I described it incorrectly. So the rule again is if you are at an element E and you say dot dot, it means go to the parent of E. You're not back to E, but you're back to its parent. Okay. So here is a summary of X-pass. And this is actually the last slide I want to cover today. So we have seen basic elements. We have seen a wild card that matches any element. The slash by itself means the root element. We can start navigating from the root. This is how we go from one element to its child. We can navigate deep down. We can navigate deep down from the root. There is an alternative. We can match either paper or book. This is a choice, like a regular expression. Elements. And complicated qualifiers or predicates. So this is what makes X-pass fun, this simple core. So this is something they really got right. This was X-pass 1.0. But what they did over the last few years, they did something horrible. They came up with X-pass 2.0. I don't recommend that you get there. If you have to get there, try to avoid it. They essentially pushed inside X-pass 2.0 everything that belongs in X-query. They tried to allow you to do joints and all sorts of very complicated things by extending somehow and unnecessarily this simple abstraction of navigating up and down. We will not study X-pass 2.0. And I don't have any slides for it. I never read that documentation. But I've seen several research papers who try from a theoretical perspective to make sense out of it. And when I saw those, I decided that's not an interesting language. So we will not study X-pass 2.0. What we will do instead is next week we will study X-query. I'm going to spend only like half an hour on it. And you have a very cute homework where you are asked to run, I think, 10 X-query queries on an XML document. It's called Mondial about geographical data from the world. Okay. So any comments, any questions? Then I'd like to stop here. And I'll see you next week.