 Data analysis and decision support and as I told you earlier, it is a very high level view and not getting into details. It is meant as an introduction so that you can go and read up more and maybe use tools in this area. Now, what are decision support systems? They are used to make business decisions based on data which the online transaction processing systems have collected. So, you want to look for trends in order to predict what will happen. You want to look for patterns in your data. You want to look for segments of the market, for example, which have the least profitability or the most profitability. You may wish to tune the insurance premiums based on this. You may wish to tune what category of people you want to advertise to based on this. Who is joining as a customer? You may want to look at the impact of ads after an ad campaign ran. What is the impact? So, a lot of information which you have from the transaction processing system and external world information, which you can analyze to see the impact of various things. Use it to make business decisions. Now, there are other people who generally make the business decision, but you are the people who will probably write programs for them if this were in COBOL. For every analysis that they need to do, they have to come to you to write a program. Now, the idea of decision support systems or data analysis systems is that you simplify this process. So, you do not actually have to write a program to get summary statistics, but you just interact with an interface which gives you the statistics. Sort of like playing with a spreadsheet, but the data behind is in a database. It is not a small amount of data which is on a spreadsheet. So, you have in addition to sales data, you have customer profile data. This is a very important thing for many people to decide which customers to target to see where their growth is coming from. So, how do you do data analysis? Like I said, there are tools for this. There are report generation tools which you could use which can get summaries in different forms. There are other tools which let you get different views of data, pivot data, much like you can play around with Excel and graphs. You get similar things, but with database data. For example, you want to get summaries like for each product category in each region, what were the total sales in the last quarter and how do they compare with the same quarter last year? This is a very typical sort of business question. How do you run such a query? How do you specify such a query and how do you execute it? So, you can have a specialized interface for specifying the query, but ultimately if the data is in the database, it is a large volume of data, that system has two options. It can either load the data into memory and process it by itself or it should issue an SQL query which the database executes. So, what SQL has done is provide certain add-on features targeted at such applications such that they can issue a query on the database and not necessarily fetch all the data and then process it. So, then there are statistical analysis. Then there is a whole area called data mining where you want to say look for things which follow a certain pattern or a more common kind of use could be look at all this data which I have seen of people who have done insurance fraud. Try to find patterns in it. Is there a pattern that may be you know any claim which comes within three months of the policy being issued has a high probability of being a fraud. Now, this may be a conjecture that somebody has, but can you automatically look at the data and find patterns. There may be any number of patterns, a huge possible number of patterns. There can be idiotic patterns like if the name begins with S, then the probability of fraud is more. So, probably a wrong and idiotic pattern, but the point is that there are many possible patterns which could be predictors. So, what data mining tools do is they have a certain you know class of patterns they can look for and over the space they will come back and tell you here is a way to predict the probability of fraud. Here are features which indicate that fraud is likely and they will combine the features in some way and give a score. So, given a particular claim they may give a score saying what is the likelihood that this is a fraud. So, that you can then follow it up and pay more attention to those which look fraudulent and less to those which seem pretty normal. So, that is data mining. I am not getting into details here, but any organization has to use these tools at some point. Again some more terminology online analytical processing is an area where you interactively analyze data looking at different summaries of the data. So, data analysts often do this the query we had here for each product category and each region what were the total sales in the last quarter and how do they compare. This is a summary query. This is something which the data analysts may come up with and then immediately after this query the data analysts may issue a completely different query. Instead of product category, they may say product color or if it is a clothes thing or some other thing in there. Now, each of these queries is potentially looking at a large amount of data and it is summarizing it. Now, in what ways do you summarize it? Well, we will see that in just a little bit and then there is an issue of how to do this efficiently. So, there are two issues. We will see both those. So, this is called online analytical processing. Why online? It is online in the sense that you issue a query for data analysis which summarizes data and the answer should come immediately. It should be an online experience. It should not be that the system tells you well, I have to read the terabyte of data with my current disk. It will take me five hours to do it, come back after five hours. You do not want that as an answer and how do you deal with it? We will see a little bit. So, you can have an arbitrary relational schema, but doing this kind of summary analysis on arbitrary schema is difficult. So, what people do is use a very simple schema which turns out is very powerful even though the typical schema is simple, it is powerful. And the reason is that a lot of analysis gets done on certain things called fact tables. What is a fact table? It is like sales, the policies which were issued, the sales which a company had. These are all individual items, small units and there are many of these. Those are the, they are called fact tables. And then there are dimensions to each fact. So, if you sold an item, the measure attributes and dimension attributes are different attributes. So, measure is for example, the number of units sold or the cost of this item, the price of this item. So, there are attributes like this which are measure attributes. And then there are dimensions. For example, the item name, color, size, the branch location, the age of the customer, the date. So, there are so many different attributes which would form the dimension attributes. And here on the dimension attribute, you may aggregate on it, you may group by a dimension, we will see some examples. So, here are the sales which had several dimensions. What are the dimensions shown here? Item name is one dimension, the color is another. And there is a third dimension called size which is shown at the top left corner with size saying all. So, we are not showing you the values on this dimension. This two-dimensional thing only allows us to see two of the dimensions at a time. And so, for this dimension you are showing all possible values, for this dimension all possible values. And this kind of display is called a cross-stamp, also called a pivot table in Excel. What is the difference between this and a regular relational table? In a regular relational table, you would probably model this differently. You will have four columns, item name, number, size and sales. Let us say this is number of units sold. So, you will have four columns. And whereas, here the number of columns depends on how many colors there are. We have three colors, there are three columns. If there are hundred colors, there are a hundred columns. So, it is not a regular relational schema. Number of columns in this display depends on the data. It is not a fixed schema. So, a data analysis system will let you easily create cross-stamp like this. Excel allows you to do this. When you go to Excel reports, you can create a pivot table which is basically this. And it lets you look at data and eyeball it and see if there is a pattern to it. And you can obviously aggregate on different things. You can show different things here. So, that is how you look at data. And this is a relational representation. This particular relational representation, I have not shown the size. But, item name, color and number are there. You will notice one more thing here. I had these totals. This is the total for each item. This is the sub-total for each color. And this is the grand total. Now, this data can also be represented relationally over here. See, it says skirt and these three combinations. Now, skirt all is 53. This is actually the sum of these three. It is pre-computed, sum which is stored. Similarly, dress all is 35, all dark is 62 and all all is 164. So, what do we have in this table? We have the underlying data. We also have a bunch of pre-computed data on that table. This is basically what lets you get quick response. If on this table, I say, give me the totals sales by item name. Do not bother about color. Just give me the total sales by item name. And then, what do I have? I have item name skirt all 53, dress all 35, shirt all 49 and then all all 164. I just have to look up the pre-computed value. So, an OLAP system would pre-comput a lot of values like this in order to give quick response. The pre-computation may take some time, but online when you ask it queries, it will give you a much faster response. That is the idea. Data cube is a multi-dimensional generalization of a cross stamp. It can have n dimensions in general, but n is very hard to visualize for large n. So, we are just going from 2 to 3, but you can picture logically how this can be extended. So, for 3 dimensions, what do we have? We have the size. That is the dimension which we had shown earlier. This is size, small, medium, large and all. Item name, skirt, dress, shirt, pant and color, dark, pastel white. But each of these also has an all. This is an all item and all size. So, if you see each cell of this cube, now you are not able to see the inner cells. But each cell, this block has a value including the inner ones which are not seen. And the ones which correspond to an all on any of the dimension is an aggregate. Any cell which does not correspond to all on any of the dimensions is an underlying raw value. And all is a computed aggregate value. So, now with 3 dimensions, how many all combinations are there? Combinations with all. If you have, for example, here, 3 different colors, you can combine this with all on both sizes. And on… So, we are looking at this column. So, here the item name is all, size is all and then we are seeing 3 different combinations. In general, what is the size of this cube? If there are n values here, there is an n plus 1 rows here. So, the total number of cells in this cube is a product of the number of distinct values for each of these dimensions. And the value in this cell is the measure attribute. There can be multiple measures. We are just showing one for simplicity. So, the size of this cube will be a product of the sizes of each of the dimensions plus 1. For each dimension, add 1 for the all and then multiply it. Now, the underlying data may not actually have all combinations. Otherwise, the data may be very large. So, if you have a few dimensions, here it is small. But if you have more and more dimensions with more and more distinct values, many of the values may never occur in the actual database. But if you pre-compute this cube, you will have to pre-compute all the combinations, many of which may be the same, but you spend a lot of effort pre-computing them. Anyway, the cube itself is there. You cannot see the whole cube at a time. So, what you do is pivot, that is, you select the dimensions to be shown in a cross-tab, in a pivot table. So, that is one way of looking at the data in a cube. You can slice, which is create a cross-tab for fixed values. So, if you take this cube and take a slice of the cube, what have you done? If I cut horizontally at pastel color. So, I am looking only at pastels. And for that, I am looking at the other two dimensions. In general, you may have to cut on multiple dimensions to get down to two dimensions. So, it is called dicing. So, you can, it is called slicing and dicing. You can have a roll-up moving from finer granularity data to coarser granularity. So, for example, I may look at, I may ignore color details and just use the all value for color. Now, drill down on color means I start looking at the individual color values. So, an analyst may look at the overall picture and then say, well, let me see if there was any major difference by color. Now, I will drill down into color. So far, I ignored color. I used all. Now, I will drill down into color and see what is going on. So, that is called drill down. Roll-up is the opposite. I am currently looking at the individual color values. Now, I will just roll-up and see all as the color and look at some other attribute. So, these are operations on the data cube which you use to visualize the data. So, those are the basic operations overlap, but there are also hierarchies in overlap. So, for example, when you have sales, you have actually the date and time when the thing was sold. But you may want to analyze the sales by hour of the day. Company needs to know at what time of the day the sales happened. So, that might help them decide how many employees to have in the morning shift, how many in the afternoon shift and so on. There is also a day of the week, month, year, quarter. What would those be useful for? Well, you may want to aggregate by quarter and see what was the change across quarters from the previous five years. In each of these years, look at the different quarters and then look at the change or you might aggregate to year and say across the five years for the whole year what was the change. So, I may want to, it is the same underlying data, but I am rolling it up on this dimension. Instead of looking at the date and time, I am looking at just the quarter in which it occurred or the year in which it occurred. So, I can aggregate the data on this hierarchy. So, you can roll up on this hierarchy, sorry, in this case it is rolling up consists of going to a course of granularity towards year or day of the week and drilling down looks at a final granularity on the hierarchy. Another hierarchy here is the location, city, state, country and region. In fact, if you are within India, you probably look at city, state, region and then country and then maybe if you are international company, you may have an international region. So, what exactly it is depends on your organization. So now, you can add hierarchies to cross tab. So, in this case, the category, sorry, the item which was skirt dress, pants shirt have a hierarchy which is category. So, this category is women's wear, this category is menswear. So, in this same cross tab, I have subtotals for women's wear, subtotals for menswear. You can do this with Excel. Now, you can also in a data cube, you can have all of this pre-computed if required. So, there are different implementations. The earliest ones were called multidimensional OLAP systems. They actually pre-computed the entire cube and stored it in an array, in-memory array. In-memory array can be fairly compact if the number of dimensions is not too large. It actually works quite well and those systems gave blindingly fast performance for these kinds of queries. So, people started using them. Then, people said, well, they have restrictions. At some point, you run out of memory. If the number of dimensions, number of distinct values, the number of levels of hierarchy go up, the cross product makes things blow up. Therefore, we won't pre-comput everything. We will go down to the database and fetch things as required. So, that is relational, but it's much slower. And then, there were others which pre-computed some things. If it's available, great. Use it from memory. Otherwise, go to the database and fetch it. And even in the database, you can pre-comput and store some things. So, you will pre-fetch pre-computed things from the database if possible. And if that also fails, you will get the raw data and aggregate on the fly. But you can be fairly clever and pre-compute some amount of stuff and then everything else becomes very fast. So, today, OLAP systems often implement a hybrid like this. Now, to support such OLAP operations, I told you earlier that SQL has been extended. And among the extensions into SQL, you can say, select item name, color, size, some number from, say, group by cube on item name, color, size. Which means, it will group by all these combinations. It will group by item name, color, size, item name, color, size, item name, color, size, and group by empty. That is the overall number. So, group by creates, group by cube creates all these combinations. If there are n things here, you will get 2 power n. In fact, if you do group by with hierarchy, it will be even larger number of combinations. So, this is supported by SQL and some databases support it. Postgresql does not. So, do not try this on postgresql, but some of the others do. And moving, so this was introduced in 99. And SQL in 2003 also add other querying features. For example, more grouping features, hierarchy, and so on. Ranking, which is, what is ranking? It is, if you have marks of students in a class, the ranking, who is the first, second, and so forth. Now, it turns out doing that query in SQL is very, very painful in plain SQL. It is possible to express it, but it is very inefficient. So, if you really want to assign ranks to students, you would not do it in SQL earlier. You would do it using a program. You write a program which gets the sorted order and then steps through assigning rank 1, 2, 3 in the sorted order. It is very simple actually, but it cannot be expressed efficiently in raw SQL. But now there are SQL extensions, which let you generate ranks and so on. Again, I do not think postgresql has any of these, but some others have it. Then there is something called moving window aggregates, which are like, if you want to look at stock market data, jumps up and down wildly. So, if you want to view it for a longer period to see trends, you may smooth. So, take the average for every five-day period. Do not give the value on that day, but smooth it out by the previous two and the next two days. So, average over a five-day period. It is called a moving average. So, it gives a smoothed out view of the data. You can do that directly in SQL. So, there are a lot of features added. So, I will stop there with respect to the analysis features in SQL. There are a lot of things, but I am going to stop here. Any questions related to data mining, distance support, OLAP? So, if there are no questions, we will move on to indexing and query processing. This is the last topic for today. Many of you are probably used indexes. How many of you have used indexing, if you have created indices or used them explicitly? All of you. How many of you know how indices are implemented? To some extent. I have some idea of how an index is implemented. So, how many of you heard of B trees? Several of you, but not all. So, what I am going to do now is give a quick introduction to indexing what goes on underneath. Far from enough detail to really understand fully what is going on, but enough to get a good idea. So, what is an index? It is just a structure which lets you map from some values to the records which contain the values and to map it efficiently. So, there is some terminology here. An index has a search key, the key on which the index is built. So, you can, if you give a value for that search key, you can find records that contain that search key reasonably efficiently. An index file contains of such a record, search key and a pointer. What is a pointer? It could be a disk address where on disk that record exists or which file number and within the file at what offset that record exists. So, logical identifier of that record. So, here is an example of an index which has branch names and then a pointer to records with that branch name. But actually there is something interesting going on here. There is just one pointer per branch name, but there are multiple records for that branch name. When can you do this? So, here there are duplicates. There are many records with peri-rich, but in the index I am storing just one pointer, not all. In general if there are three peri-rich records, I should show three pointers in the index. If you take the index at the back of a book, take a word, it tells you which all pages that word comes in. But here I have stored just one pointer. So, there are two possibilities. One is that the first one with that branch name contains a pointer to the next one and so forth. And this is typically done if the file itself is sorted. These are what are called index sequential files. They are sequential, they are sorted on that search key. But there is an extra pointer here. This pointer if you see here currently just points to the next record in the file. So, why do you need the pointer? It is basically to handle insertions. You will first sort the file. It is sorted. So, if you go the peri-rich in the index, it will point to the first peri-rich record. The remaining peri-rich records should be right after it in the file. However, once you do an insertion, there may not be space here. You may store it somewhere else. In that case this last peri-rich record will contain a pointer to the newly inserted peri-rich record. And that will, its next pointer will come back to the following record which in this case is redwood. So, it is not exactly sequential anymore, but it is mostly sequential with a few out of order things which the pointers handle. This is a typical structure of an index sequential file which you have all used. Internally, this is what happens. And the file is sequential. Then there is an index separately here which refers to this. Now, this index should be much smaller than the size of the file for it to be useful. So, the index at the back of a book is a few pages for a book with hundreds of pages. But that is not enough. Even a few pages we have to search sequentially through the index to find a word. It will take a long time. But actually what happens is sorted. So, I can quickly narrow in on the page that I want to find. I do not have to read every word in there. I can look at the top of the page to see what range of words is in a page. And then I do not have to go one page at a time. I can guess where the page is likely to be and skip a large number of pages go down. If it is too far to the right, I will now look to the left. Otherwise, I will continue looking further to the right. So, this is what is called binary search. And that is quite efficient. So, the idea is that the index is small. It is sorted. And I can do binary search on the index to find the pointer to the first record with that value and then go to the file and fetch it. This is a very simple thing. This is not a B tree yet. This is what was implemented at the beginning. Now, of course, this works only if the file is sorted in this order. If it is not sorted, you can still do some things. So, the key things are index files are typically much smaller than original. And it is searching in the access much faster than searching in the original. These are the main properties. And they are sorted on the search key. Another important property is this. The index files are updated automatically when the relations are updated. The database takes care of this. You insert a new record. It will go into the index. You update a record and change the search key value. The index gets updated correspondingly. So, index files are automatically updated. So, you can be confident that if you use the index to search for a record, you will get the right set of records. It would not be some stale OX which is no longer current. That will not happen. So, indexes are very useful for queries. But they could potentially be an overhead for inserts and deletes. Why? You have to not only you have to update the record, you have to go and update the index also. However, some indexes are still very important for checking integrity constraints even for updates. So, for example, to ensure primary foreign key dependencies, you need to check, for example, if that primary key value is already present or if this foreign key value is present in the other table or if you delete from that table, you will need to know which all guys refer to it. So, you should be able to track these things. So, an index can help you do this very fast. So, even insert-delete updates can be speeded up in many cases by having an appropriate index. Now, the most commonly used kind of index which several people mentioned are what are called B trees. But there are others too which are used. R trees are used for spatial data, bitmap indices for certain kinds of data warehousing data. So, here is an example of a B tree. If you look at the bottom of a B tree, the bottom layer, this is called the leaf layer, it looks much like the index file which we saw before. It has one entry for each keyword and then a pointer to the first record with that keyword. Let us still for the moment assume that the file here is sorted on the same attribute. If it is not, the structure will be little bit different, but for the moment let us assume that. So, for each value here, there is a corresponding pointer. The pointer is actually to the left. So, for Brighton, you follow the pointer which is on its left to go to the Brighton record. For downtown, you follow the pointer to its left and go here. So, what do you mean by left and so on? If you are not very familiar with pointers, this is just a record. It is like a COBOL record which contains multiple fields. This field is a disk address or a file offset. It is an integer value. This is a string character. This is again an integer value which is a file offset. This is again a string. So, they just alternate inside here. So, this is the leaf level. Now, to search in the leaf level, if it were all continuous in memory, you can use binary search, but the leaf level might be quite big. You have 30 lakh records. Even the leaf level of a B tree can be quite substantial. You do not want to keep it in all in memory. So, the idea now is that this leaf level itself is sorted. So, why not build an index on it? Which will be what is called a sparse index. It will not contain every possible word here. It will contain some subset of those words. In this case, there is Mayanus here and Redwood here. There are only two values. Whereas, down here, there are many six values. So, if I and here, there is just one value at the next level of. So, if I come in and search for Brighton, I will compare it with Peri-Ridge. Is it less or more? It is less. So, I will follow the left pointer. That is, I will look up the value there, the integer offset. Go to that offset in the index file and then look at this structure there, this record. And in this record, there is only one thing, Mayanus. So, I compare with that. It is again less. So, I go to its left and land up here. Now, I find Brighton there and I am done. If I search for, let us say Peri-Ridge, I will find it here. So, then I will take the right pointer from that. Compared to Redwood, it is less. I will take the left pointer and I am home. So, I can essentially walk down the tree to find what I want. Each time I walk down the tree, I may have to fetch a page from disk. So, I have to fetch data from disk. It can take some time. Reading data from a disk takes even today of the order of 5 to 10 milliseconds. So, that means, in a second, I can do may be 100 to 200 records can be fetched from disk. So, if the tree is very tall, there will be more disk fetches and querying will be slow. So, if you are familiar with binary trees, they are essentially tall trees. Whereas, B trees are very fat. Although, the examples here show nodes with 2, 3 values in there. In reality, each node would have hundreds of values, really fat nodes. And the tree is very short. Typically, the tree has only 3 or 4 levels. I have shown 3 levels here. Typically, even with very large data, it is 3 or 4 levels, but the nodes are very fat. So, there are lot of children. So, the top level has 100 children. The next level has 10,000 children. The next level has, what, 1 million children. The next level has 100 million children. So, with 4 or 5 levels, you have got to 100 million record, which is pretty good. And then, the next level will have 100 times that, which is bigger than any database I know of. So, trees are very short. So, this is the idea of a B tree. So, you can quickly go down the tree and find information with a few disk accesses. You do not have to search through the entire file. I assume this is sorted in the same order as this, but if it is not, it is not a big deal. All that we have to do is keep many pointers. Corresponding to Brighton, I will store 10 pointers, one for each of the 10 Brighton records, if there are many. Here, there is just one. Down down, there are two. So, I will store two pointers, one for each of the downtown records. It is not very difficult to do that. We will not worry about the details. So, I can build an index on any file. However, there is a small difference. This kind of an index, which is on an attribute on which this is sorted, is called a primary or a clustered index. Whereas, if I am an index on some other attribute, which is different from the sort order of this file, it is called a secondary index. There are some performance issues between primary and secondary index. Secondary indices are slower to access data, because each time you have to follow a new pointer. With the primary index, you follow a pointer. Immediately, you have many records, potentially with the same search key value. So, less IO is required. Of course, once the relation is, order is chosen, you can have essentially one primary index. The rest will all be secondary. So, now how do you tell the database to create such indices? It is not actually part of standard SQL, but every database implements sexual extensions, which look like this. Create index, index name, on relation name, attribute list. So, it builds a generally a B tree, but you can tell it what type of index in certain cases. So, it creates an index, and then you can drop the index if you want. So, the create index, you have to give the list of attributes. Now, which indices should you create? And what indices are there by default? The answer is, the moment you declare a primary key for a relation, pretty much every database will create an index on the primary key. Why? Because it has to check for uniqueness. You cannot have two records with the same primary key value. How does it know that there is another record of the same value? It needs an index to do that efficiently. Otherwise, it has to scan the whole table. So, the moment you declare a primary key, an index will be created. Extra indices you may create because you have queries which need to search for records with those specified search key values. So, then you create more indices. If you find some operation is very slow on a large table, even though it fetches only a few records, then you probably should be creating an index. So, in fact, there are tools which will help you decide what indices to create. I will talk a little bit about it at the end of the session. The syntax is exactly the same. So, whether it is primary or secondary is not in the syntax, it depends on the sort order of the underlying relation. And how do you control that sort order? That again depends on the database. There are different ways of doing it. In PostgreSQL, you can say cluster table name on attributes. So, when you tell it to cluster on those attributes, it sorts the table on those attributes. And it does not actually keep it sorted. If you insert, it may again go out of order. So, you have to periodically re-cluster. Other databases these days have what is called a B plus 3 file organization, where this distinction here I distinguish between the index and the actual file containing the records. Now, a B plus 3 file organization stores the records here at this level, the bottom level of the B plus 3 will actually have the records right there. There is no separate file containing the records. The index and the file are combined. So, these days many databases support such things. PostgreSQL does not, but Oracle does, DB2 does, SQL server does. So, that avoids some of these problems of one. You can specify then that make this index, use this file organization of this relation on these attributes. So, then it is sorted on those attributes. And part of a B plus 3 on those attributes. Assuming that a table has or relation has been created with the primary key. Can subsequently the secondary key can be added? You can create multiple indices. Subsequently. Yeah, you can. You can create any number of indices on whatever attributes you want. That is not an issue. I mean there may be some system defined limit, but it is quite large. So, it can be on different attributes. Only one of them can be specified to be the sort order for the relation. The others are not in sort order. They are all secondary. So, how they are optimized? Because the primary, the sort order is there. So, we can see from the last leaf. Then your fetching is faster. But when there is a secondary index, then the order is different. The order is different. If there is only one record with the search key value, then it really does not matter. But if there are many records with the search key value, they may be scattered all over the file. So, you may have to do a lot of IOs to fetch those records. So, definitely it is. Efficiency will be. Efficiency will be. Okay. So, now let us move on to query processing. So, when you submit an SQL query to a database, what happens? The database somehow magically executes it and gives you a result. But internally, something more is happening. And what happens is the following. The query comes in. It is first passed and translated to some form of relational algebra. It may not be exactly the relational algebra version you saw, but some variant of it. It is translated into that form. Then there is an optimizer which runs on this, which generates an execution plan for that query. The execution plan is what is actually executed. Now, there are many different ways you can execute a query. So, let us look here for this. I will come back to that previous slide. So, take this simple query. You are joining an account and depositor on account number field. And then checking that the name is Fartuck and branch name is IIT Poy. So, let us look at a few strategies for evaluating this query. One strategy is to go over this depositor, use an index to find customers with the name Fartuck. So, you find a few of them. Then find accounts that are in IIT Poy using the branch name. So, you have two subsets. Then join them using whatever join. There are many different join algorithms. So, you join those things and get the final result. Join on this condition, account number. That is one option. Another is first take customer name Fartuck. Use an index on depositor to find matching records. Then use an index on account number of the account table to find all the accounts of Fartuck. This is maybe how you do it in COBOL. You have index files. So, you fetch from the file using that index. And then among those accounts which you retrieved, see if the account was in IIT Poy and if so output it, otherwise throw it away. And there are many more strategies. But just take these two alternatives. What is the relative cost of these two alternatives? It depends. So, if Fartuck is a very common name, which is not that common, but let us say Shah, very common name. We get lots and lots of records with Shah. If you fetch all the records and then check which is in IIT Poy, you can waste a lot of time. In contrast, if IIT Poy branch had a smaller number of customers than the number of Shahs, for Shah the best option might be to go the other way. To find all accounts at IIT Poy, index on a depositor and find out those which belong to Shah. That may be more efficient. And the first strategy which I have may be the most efficient in the third situation where there is a reasonable number of Shahs and a reasonable number of accounts at IIT Poy. This may be the best solution at an intermediate thing. So, depending on the underlying data, each of these strategies may be the best. In fact, the difference between a good strategy and a bad strategy can be enormous. Good strategy might finish in a few seconds. A bad strategy might run for hours on exactly the same query, the same data, just different strategies. So, in COBOL, you depend on the programmer to come up with a reasonable strategy and do it. And generally, programmers are good at this if they know what indices there are. But if they do not, then they come up with some strategy. They do not realize there is a much better strategy because this index is there. Or tomorrow, if an index is dropped, this beautifully written code which was working very well suddenly becomes very inefficient because the index is not there. Then you have to go and programmer has to rewrite the code. That is a lot of effort. With SQL, you do not do any of this. It is all up to the optimizer. So, the optimizer gets statistics about the data from the database. Some of it is pre-computed and stored. And it looks at a variety of execution plans. And using the statistics, it finds out what will be the cost of each of those alternatives. And then it picks the one with the least estimated cost. These are estimates. They may be wrong, but they are estimates. So, optimizer finds the plan which it believes is the cheapest plan and executes that by giving it to the evaluation engine. The evaluation engine gets the actual data from the database. And this plan which from the optimizer executes and gives you the output. So, this is basically what happens. And this component is what frees you from worrying about how to evaluate a query. You can blightly write a query in SQL, not even thinking ever about how it is executed. And still, things work out fine. The evaluation engine gives back any response to these optimizers for their further... That is an excellent question. It should be a database researcher. You are asking very good questions on this. It turns out that most databases today don't actually do a whole lot except that in some limited situations, if the evaluation engine finds that you thought that this data had 100 rows in table account. But the evaluation engine finds that you had 100,000 rows. It might go back and update the statistics. And there are some which are a little more clever. But many are very dumb. They don't do any of this. It doesn't talk back. It's just one way completely. And it's all up to you to update the statistics. In fact, in PostgreSQL, this can happen. If you just load the data, the statistics may be completely wrong. It can give a horrible query plan which runs very badly. So, you have to tell PostgreSQL to update the statistics. And the command for that is called Analyze. Oracle has a similar command called Analyze. Other databases have different... Yeah, you run Analyze. It recomputes the statistics. So, in particular, if you make a big change to the database, you load a new relation or add a lot of records to an existing relation, you should run Analyze. Now, other databases like SQL Server, they try to do a better job. So, they actually, during execution, if they find something is very wrong, they will automatically run Analyze or their equivalent without even telling you. So, next time you run, it will run properly. It may have chosen a wrong plan. That one query may run slow, but it will update the statistics, and next time it will improve. And query optimization is the process of finding the lowest cost plan among all the equivalent evaluation plans. What do you mean by equivalent? Of course, they should all return the same result, the query which I asked in the first place. Among the plans which give the same result, find the cheapest one. And how is the cost estimated? You can't actually run the query and find the cost. You have to estimate the cost. So, the cost estimation is based on things like number of tuples in each relation, the size of the tuples. The number of tuples that is expected to satisfy the where clause condition on each relation, which you, in turn, can be estimated in different ways. You may keep some statistics called histograms, which will let you find out what is the likely number of tuples which satisfy this particular condition. So, in COBOL, you decide the plan. You decide to first access this file, and do index sequential access on the next file, and so forth. In SQL, your job is greatly simplified. But you still have to do some things. You still have to figure out what indices may be useful for the typical queries which you get, and create those indices. That is, it may not be you as a programmer. It may be the database administrator who does this. So, how do you make this choice? How do you decide what indices to create? That may not be an easy thing for a human to do, because a human doesn't even know exactly what the optimizer is doing. You can look at the query and say, okay, I believe that this index may be useful for this query. Create the index, run the query, and see if it became faster. If it didn't make any difference, well, this guess didn't work out. I'll try something else. So, that is a traditional way. Somebody looks at the queries which run slowly in particular, and sees if you can create an index which will speed up the very slow queries without seriously affecting the other updates that go on in the system. Now, there are index tuning wizards for many database systems which automate this process. So, the first step for index tuning for the wizard is to know what queries ran on the database system. How does it know this? You can turn on something which logs all the queries that are executed. So, you'll turn it on for some time for a day, let us say. Log all the queries that ran that day. Or maybe if a day is very typical, any 15 minutes of the day is typical for the whole day, then run it for 15 minutes. But you've got a log of all queries that ran, queries updates. The index tuning wizard will take this input and look at various options for indexes to create and figure out which indexes are worth creating which will give you the best speed for the overall workload which it saw, all the queries which it and updates which it saw. So, it will figure out the impact of each possible index. How much will it speed up the queries? How much will it slow down or speed up the updates? And then pick a good set of indices, not just one. It may tell you create these, these indices. And then you can say, okay, accept the recommendation and go ahead and create the index or you may say, no, I think you are stupid fool. But I will look at your recommendation and pick and choose from that and create indices. So, you have that control. So, PostgreSQL doesn't have this, but Oracle, SQL server, DB2, everybody has a tuning package which they all started off with index tuning. The kind of tuning I will come to which is related to materialized views. So, the last part of the talk looks more at optimization techniques. And we are going to look at two specific things. One is nested subqueries and another is a concept called materialized views. So, let's start with nested subqueries. So, here is an example. We have seen this before where you are selecting customer name from borrower where exists select star from depositor where depositor dot customer name equal to borrower dot customer name. What does this query give you? It tells you borrowers where there is another record in depositor with the same name that is customer names who have a loan and an account both. And how many times does each customer name appear here? The customer name appears as many times as that customer has loans as many borrower records with that customer name. So, that is the result of this query. Now, note that this borrower dot customer name is actually came from the outer query. This is a variable from the outer query. So, these are called correlation variables. Now, conceptually how this is to be executed is you take each borrower record and then execute this sub query. Now, when you execute this sub query borrower dot customer name is fixed. So, you are executing a sub query with a fixed customer name. And how do you find depositor records with that customer name? If you have an index on customer name you could use that otherwise you scan the relation to find records with that customer name. And you do this once for every record in borrower. If you actually do it this way it can be very very slow. If you have an index there is still a cost of looking up the index and going in. If you do not have an index you are going to have to scan sequentially through the entire account sorry depositor relation which will be very slow. If you have a million borrowers and a million depositors the time taken will be million times million approximately. This is horrible. So, you do not want to execute the query this way. You want to do it in a different way. Logically this is what the query means. You should get the same result, but you can execute it in a different way. So, what way can you execute this? There are a lot of techniques which can do it differently. But to do that you first have to turn this nested query into a different form which uses a join. So, what the optimizer often does? So, first correlated evaluation what I just told you where the nested subquery is executed once for each class in the cross product generated by the out of from class. Here there is just one table. So, it is nested query is executed once for each tuple from that and this is called correlated evaluation. And as I said it can be quite inefficient since you make a lot of calls to the nested query in lot of unnecessary IO. So, even with something like this even if you had an index if you did not have an index you are totally doomed. Even if you have an index maybe the index look up takes 10 milliseconds. And if you have 10 million outer tuples each one taking 10 milliseconds you are going to take 10,000 seconds that is a lot. But a good implementation of this query might be able to do this in a few seconds in a different way. How could you write this query differently? If you do not care about the number of duplicates what could you do? Forget duplicates. How can you rewrite this query? If you were a programmer how would you write this query in a different way? You can use join or intersect. Now, how does the database process an intersect operation? It turns out there is a very efficient way of doing it. You sort the two relations and then step through them in synchronization. Some of you might have written such code. How many of you have written this such code which merges two relations by stepping in sync through them? You have done a merge of two sorted relations a very standard thing and it is very efficient thing also because you are just reading them sequentially. Relation with 10 million tuples each of 100 bytes that is what 1 gigabyte. You can read 1 gigabyte in maybe 70, 80 seconds or even less today. And that merge will take literally a minute or two as opposed to several hours here. So, if the database realizes that what you want is an intersection or a join on some attribute or a union with duplicate elimination or an accept it can use a merge algorithm like this. So, if you are not use the merge algorithm, let me just explain it for another minute. You have two tables for which you want the intersection sort them. Now, compare the first two records if they are equal the result is in the intersection output it. If not find which one is the smaller and move to the next record on that relation. So, if you have like let us say a b c here and a c d here. So, a b c a c d a a match it is output. Now, you move to the next you have b and c which is smaller b is smaller. So, you move down to c. Now, c and c match now move if you have duplicate it is a little more complicated, but it is merging can be done quite efficiently. So, when you give a query with a join the database generally has two options. One is use an index to do the join for each if you join of a two relations r and s for each triple of r it uses a index to fetch tuples of s or the other way or it sorts r and s and does this merge. So, merge is used even for joins. So, this is what is going on inside and it is they can be very efficient. So, on the other hand a nested query if you do it in a correlated way can be very very inefficient. So, it is very important to convert a correlated nested query into something which looks more like a join. So, that it can be implemented efficiently. Now, who does this? So, most SQL optimizers do try to transform nested queries to joints internally. Pretty much every SQL optimizer will try to do this. How well it does? How complex a query can it handle? Well, it differs. For this one which we saw you could turn it into a join like this or an intersect. We have as I said we are going to ignore the issue of duplicates. It can be handled. We are ignoring it for now. There are standard ways of handling it, but asking the programmer to do this is quite difficult. For the programmer to handle the correct number of duplicates and rewrite the query is troublesome. For the database to do it is easy. So, it should really be left to the database, but unfortunately some nested queries which are more complex if they have aggregation so forth. Turning those into joints can be difficult. Many optimizers go for big time on these. They do not do it. In particular, PostgreSQL can be very bad. For simple queries it can take a nested query and unnested and run it efficiently. The moment the nested query becomes more complex, PostgreSQL does a very bad job. It does a correlated evaluation. It is very slow. We have seen this happen. In contrast, Oracle is better. SQL server is even better. For pretty complex subqueries it can evaluate them very efficiently. So, if you use subqueries, beware of this. Each database you run on can make a very significant impact on the performance of the subquery. If the performance turns out to be bad, you may be asked to rewrite that query in ways like this, but carefully. Most of the time the issue of duplicates is slightly concocted. Most of the time you do not want any duplicates. So, then this query is equally good. Just add a select distinct. The original query also select distinct. This is also select distinct. You are in business. So, the translation is not too hard manually. So, if you find the system is performance is very bad on a nested query, see if you can translate it to a nonnested query. So, here it is easy. There are others where it is quite difficult. In fact, people have written papers on how to do this translation automatically and the papers had mistakes. They did not take into account some SQL quirks. So, you should be careful about these. Anyway, the process of replacing an nested query with a join possibly with a temporary relation is called decorrelation. And as we saw, it is more complicated if there is aggregation, if there is not exist as the linking clause and so forth. Now, we will move to the last topic of the day, which is materialized views. A materialized view is a view whose contents are computed and stored, which means you can access them immediately. You do not have to compute the view again every time you use it. These are most useful when the view is an aggregate view. To compute the view, you have to scan a lot of tuples, but the aggregate itself is much smaller. So, if you take the total policy amount by branch of your LIC, the original policy table is huge, but the number of branches is what, few thousand. That is very small. So, the aggregate result is small and if you materialize this and store it, you can perhaps use it in your queries instead of every time aggregating over the entire raw data. So, this is called a materialized view. So, this is a view here. Create view branch total loan, branch name total loan as select branch name some amount from loan, group by branch name. This is just a view. We can tell the database to materialize this view. So, that when you use it, it uses the stored tuples. Now, what are the potential problems with materialization? The first biggest problem is that the moment you make any change to the underlying loan relation, this view is out of date. If you now use the view, you will get old data, which can cause all sorts of problems. So, the key is that once every time loan changes, you must update the materialized view. How can you do this? If you the programmer had to handle it all yourself, SQL luckily already provides a way to maintain it. What is that way? Triggers. You can use triggers so that whenever loan is updated, you can go and update the stored view, update the corresponding tuple in there. There are some tricks. If you add a loan in a brand new branch, the update you do is different from if you added a loan at an existing branch or if you deleted the last loan at the branch. So, there are no more loans at that branch. That branch vanishes actually from this. So, how to handle it is tricky. So, it is not completely trivial, but if you do not care about special cases like this, then it is not so hard. Every time you add or remove a loan, just update the value there using a trigger. So, this thing, the task of keeping a view up to date is called materialized view maintenance. A naive way of maintenance is to recompute. Every time there is a change, we compute the entire view result, run the query all over again. That is obviously very expensive if the data is large. What you want is incremental view maintenance and this can be done using triggers. So, every time an account is loan is added, just update the total loan amount for that branch. You do not have to touch the things for any other branch. You just update one tuple in the materialized result. If it is deleted, you decrement the loan balance. If the loan balance amount is updated, you need to know the old and the new value. If the loan amount went from 10000 to 20000, you have to take the difference and add that to the computed aggregate. So, all of these cases have to be handled. It is not too hard to figure out what to do. However, it still works to write these triggers. You can make mistakes. So, careful, it works. It works reasonably efficiently. But you still have to use this view explicitly. If you want the efficiency, you have to use the computed view. If you write a query which is exactly matching the view, you say select branch name sum total from loan. So, you have to run the whole query again. So, you should know what all materialized view exists and use those rather than run the original query. So, it is all up to you. Another issue is when the update is done. It can be done immediately with triggers or sometimes people say, I do not need the very latest value. It is okay if I see the branch balance as of yesterday. So, what if there are 5 loans added today? I do not care. In that case, you may be willing to just do periodic recomputation. Every evening, the view is refreshed or recomputed from scratch. So, oracle supports such policies. So now, there are these two issues. One is wouldn't it be nice if you ran this, you ran a query. Here it is. Use or submit a query. Select star from rst where r dot b equal to s dot b. s dot c equal to t dot c. This is a joint. And this materialized view is available. Select star from rst where r dot b equal to s dot b. Now, wouldn't it be nice if the database system realize that this is available and join v and t. This query is equivalent to this one given that v is materialized. It may, actually it is not clear which one is cheaper between these two. It depends on indices and so forth and on the query. But wouldn't it be nice if the optimizer knew that this materialized view is available and use this as an option. This is, think of this as a query plan for this query. There are many possible query evaluation plans. This is one of, well, this is not a plan. But from this, you can derive a plan. So, that is one option. So, the optimizer can choose the cheapest option. PostgreSQL will not do it today. But SQL server oracle do this. If you create a materialized view, you just run this query. They will consider this option and use it if it is cheaper. So, that is one issue. And sometimes the converse is also possible. If you write a query like select star from v where a equal to 10 and v does not have an index on a. Now, you have to scan all of v. It may be faster to actually replace v by r join s and then run this query. So, they will also do this. Regardless of what you do, they will figure out the best way of running that query. Again, PostgreSQL does not even have materialized view. This is completely relevant for PostgreSQL. But for oracle DB2 SQL server, this is relevant. So, the last slide for today concerns how to use these materialized views. What views should I select to materialize? This is not an easy question to answer. You need to know what queries run. If there are a lot of aggregate queries, then you may want to pre-compute the aggregates for those. But to do that, you need to know what is going on in the database and you the programmer may not have a very good idea of what is going on. This is actually pretty much the same as index selection. What is the best set of indices to create? An index is pretty much like a materialized view in concept. It gives you fast access to something. Materialized view gives you fast access to the result of the query. This gives you fast access to double satisfying some condition. So, the materialized view selection problem is similar to index selection, but it is harder. Index selection is much simpler. But today, there are tuning wizards which will also suggest materialized views. They will take your workload which we talked about before and look at that workload and suggest both indices and materialized views to create. Running this wizard on a database is one of the steps in database tuning. There are other things you can do to speed up processing, but this is an important part of it. There are constraints under which this runs. There are space constraints sometimes if your materialized views are too big. Then there are constraints on time taken for some critical updates in particular and queries. If you create lots of materialized views, you pay a price. Every time one of the underlying relations is updated, you have to maintain this. So, you pay a price. So you may tell the wizard, look, my customer transactions should run in less than half a second. That is the limit. If you come up with a set of materialized views which slow these down, forget it. It is not acceptable to me. So, the tuning wizard can come up with a set of things subject to certain constraints like this. So, a database administrator who is knowledgeable about the application and the workload can run these tools to optimize things. So, many of you may not be doing this job of database administration, but I assume some of you at least will be involved in the tuning process eventually. This is when you have large centralized databases. For a small branch database, none of this may matter. Everything may work perfectly fine, but the moment you go to a centralized database, everything becomes critical. In fact, if you tune in a system well, you can sometimes improve its performance so much that you completely avoid buying very expensive hardware. So, the monetary impact of tuning can be tremendous if you do it right. So, there is a lot more on tuning in the book in one of the later chapters. I think chapter 21 was somewhere there. There is a discussion of other kinds of tuning. So, if you ever need to do tuning, you can find information there.