 First of all, many of them call themselves distributed databases. As a database person, I would shudder to call them a database because to me a database should have quite a bit of functionality including at least some integrity constraints, a query language and these guys do not have anything like that. But they do allow you to store and retrieve data. So I would call them a distributed data store and they are also called so by some others. So first of all, what are these distributed data stores and why do we need them? First of all, relational databases have obviously been used for a long time and they can handle a good deal of load. There was a time in the early 90s before the web came about where famous computer scientists such as Stonebreaker, I do not know if you have heard of him, is well known for building the first relational database outside of IBM long back. He is also a CTO of various companies. So he has had a lot of foresight but he has also really messed up on his predictions. So even the smartest people can mess up. So he went around saying that we have reached the limit of number of transactions per second that will ever be required from a database. Stop wasting your time on transaction processing. Let us move on to other more interesting topics. He said this just before the web. Soon after he said this, web systems came up and a scale which was unimaginable before the web suddenly became normal. So if you have a web application which uses a database and you have a spike on it, why would it have a spike? Well, let us say there is an Olympics website. In a few days during the Olympics it has a huge load. What about other things? Let us say you have a newspaper website. If an event happens, some major event, everybody is reading the news, sudden spike. Now what is this got to do with transaction processing? It turns out all these websites today are not static. They are actually customizing everything. If a new login, you will see something different from what somebody else sees. So you get customized views. So all of these require a lot of access to databases and they need to read a lot of data from databases. So the first approach to dealing with this kind of scale was to say, let us have many application servers to handle the load, hundreds of thousands of application servers. But let us just keep one database. Now if it is a newspaper database, they read a lot of stuff from databases but they do not update it that much. So it turns out that you can use a caching mechanism such that each application server will cache some data from the database and update the cache periodically and then use its local cache to do whatever processing is required. So even though there is just one central database which is itself may be a slightly parallel machine, you can have thousands of application servers and things work. This is enough for certain applications but if the data set is really big, putting it in one database is not feasible. If it is petabytes, it becomes a big problem. If you have a very high transaction rate, so if it is just reads, you can cache it so that most of the reads do not even go to the database. They are already handled directly at the application server cache. So in fact there are caching mechanism which are in fact widely used today. But they also hit the limits on certain applications. So on these kind of things, you cannot have a single machine, even if it is a 10-way parallel machine to run your database. You really need high degrees of parallelism. So what about using a parallel database by Teradata? Well, these are expensive, so people wanted to build their own. So in fact it turns out many of these distributed data stores are open source projects because there are many websites which want a cheap solution. They do not want to pay a lot. Moreover, most of these parallel databases were motivated by data analysis. Not by the update load, that is not so high but the number of queries. So you cannot just buy a Teradata database and say now I will handle update rates of 1000 updates per second. It will run into trouble, it won't handle it. So the next thing is instead of caching, you have copies of the database, master slave kind of architecture. And again this works if there are only a few updates and mostly reads. So instead of the cache being just an in-memory cache, it can be a complete database copy. But again this cannot scale if you have a lot of updates. So a lot of updates means a lot of data has to be sent to the slaves. But this can be used. I don't know if anyone who is actually using this in any large scale system, but potentially it can be used. Now the approach which works and is widely used, if you don't have a parallel database which you can afford to buy, well build your own in some sense. And the basic idea is you partition your data amongst many different machines. Each machine will have its own copy of the database. Now if you bought a parallel database from Aster or Teradata, you can now submit your query at one place and the query will be routed or parallelized across all these machines. But A like I said they are tuned for decision support, large queries and B they are expensive. So instead what many of these websites did is they did the following. They have many machines each running database MySQL or PostgreSQL typically. And now when you get a piece of data the application decides, I mean there is a policy which says where should that data be stored. So the application uses this policy and stores it in one of those hundreds of copies of the database. So the data is now partitioned across all of these. But it's not transparent. If the application wants to store data, it has to know where to go and store it. If it wants to retrieve data, it has to know where to go and retrieve it. So one way is to have a hash function which can be computed to decide where it goes and how to retrieve. There are problems with this. What if you need to add a new server? How do you update the hash function? Can be done but there are a lot of tricky details. So you can actually grow these systems but it's not transparent. That is the application has to deal with all of this. So it's more burden on the programmer to figure out how to deal with all this. You can of course build a library which hides the details of all this partitioning. That is in effect a cheap parallel database. But the most important thing with this is that each database can run transactions locally if you want. But there is no question of running transactions across the database. There is no question of doing joins across the database because each database doesn't know about the existence of the others. The only way you can get data from two databases is you send request to the two databases, get it back. It's now the job of the application to deal with all these things. So there are a lot of limitations with these kinds of things. So what happened after that is people said let us build a system which will be completely transparent, which will scale enormously and it will also give features which relational databases don't give. In particular they are going to have flexible schema. It turns out if you are going to partition like this, certain kinds of normalization are not feasible. It will be too inefficient. So now instead of normalizing, you actually have some kind of denormalized schemas where updates are rare and then the data which you store at each machine, it actually makes sense to use something which is not in first normal form. It's more efficient to not use first normal form. So relational databases will not allow that normally. There are object relational extensions but they are not very good at this point. So people said let us store data with flexible schemas across many machines. So no fixed schema. They don't offer anything like joins. That's why I call them data stores, not databases. Moreover there are known asset properties. You cannot do two updates and assume that the database will make the updates atomic. You do an update in partition A, you do an update in partition B. If there is a failure, A may get committed, B may not. Tough luck. The application has to deal with it. So essentially what happened is there is a need for scale. There is another need for availability. I will tell you more about it which meant that just using multiple databases was not good enough and so people built special purpose system. Now somebody coined the term no SQL which many people have gone around saying that SQL sucks. Throw away SQL, come use our wonderful new technology which is not based on SQL and some people sell it that way. But more sober heads say this is nonsense. Obviously SQL has a lot of cool features and it's so widely used. How can you claim that your puny little no SQL system can ever depose SQL? So what Sainer heads point out is they are never going to depose SQL but they are a supplement. There are certain applications which don't require all the SQL functionality as of now but they need some other functionality urgently. So they have sacrificed certain things to get certain properties. In the long run all those things which said they are no SQL, many of them are slowly adding some SQL-ish features back. They may not quite add SQL but they are also coming back little bit towards SQL, at least some of them. So let's get down to, well there is little bit more on why now. They have kind of exploded on the scene. There are so many projects which are now used widely and the reason is all these web-based applications, in the past 10 years, there have been so many new web-based applications which were not there 10 years back. Facebook, it's new. LinkedIn, so it's called the Facebook for the over 40 crowd, the professionals. Then there are all these web mail systems offer features which were not there a while ago. Then there's so many other sites on the web, all of which have tremendous data demands which were not there 10 years ago. So all of these need a storage system. Then there is a thing, a cloud-based solution from Amazon called S3 which stands for something simple storage, the 3. I forget what the third S stands for. So scalable, simple storage. It's scalable and it's simple. So that offers something like enormous file system which you can kind of grow and shrink on demand and you don't set up anything. You pay Amazon a fee and they will maintain it for you. They will set it up for you. It's already set up. You just start using it. So all of this meant that you want to build applications which can scale easily even if they give up SQL. And the solution to all of these data problems is what I call a distributed key value data store. So what these do is very simple in terms of their functionality. You have a key, you have a value. That value is not an integer. It's typically a big thing. Certain systems will just take it as a binary object which they don't see inside it at all. Certain other systems will say that this value is really a structured object of some kind and they will be able to look inside it and do stuff to it because it's a structured object. How is the structure defined? It's done in different ways. I'll come to it. So there are many such. Google Bigtable was the big data which started all of this off. So what happened is Google File System was the first thing. When Google File System became well known, everybody started building their own competent clone of Google File System. Meanwhile, Google moved on and realized that for storing pieces of data which are not really files, to store them as files is crazy. You end up creating way too many files. It doesn't make sense. The file system is the wrong abstraction. Even to store web crawl, if you store a billion different pages in a billion different files, your file system overheads of file metadata blow up. So they soon realize this is a ridiculous way of doing things. What they need is a way of storing pieces of data where the per piece overhead is very small. So a page which they've retrieved by a web crawl is a value and there's a key which could be the URL for that page. So they want to store it somewhere. Initially, they stored it in GFS, maybe multiple pages clumped together in one file and other hacks. They soon realize that this doesn't make sense. We want to store either the original document or they may analyze the document as it's crawled and store other metadata about it altogether in the key and value. So that became very popular. It is widely used in Google and others have cloned the same thing. Yahoo, Amazon, everybody has cloned it. So what do they give? First of all, there's a single interface. You say you call a library function and store this value, retrieve the value, plus maybe some other functionality. Now where is this value stored? How is it retrieved? All of that is transparent to the user and they guarantee very high availability. They do replication and so on so that the system will be up all the time and will always give you data. So you could build your own from your application by what is called sharding. Sharding is the term used when you build your own set of MySQL or PostgreSQL on 100 machines and you partition it and you retrieve it. If you do that, you are in charge of dealing with what if this copy fails? How do I deal with it? These systems will deal with all that. In fact, it's interesting that at least one of them, the Yahoo, Sherpa, Peanuts, at least in its current version, runs on top of MySQL. They may change that, but they are actually storing the data actually in MySQL. So what they are offering is something like sharding, but not quite the same thing. It's a simpler interface. So none of these, whether you shard or do key value stores, what you lose is joints except within a single partition. They offer some limited forms. No referential integrity across partitions. No primary key constraints across partitions unless it's on the partitioning key and so forth. So there are a lot of limitations. So how do you access data with all of these? This is usually a simple API. You can give a key and say get the value stored with that key. So if I had stored a crawled page, the key is the URL. I want to retrieve the page. I give the URL, I'll get the page data back. Of course, it could be more complex stuff. If I had stored, let's say, a list of all pages containing a word, the key could be the word. The value is a list of IDs of documents which have that word. So I could have saved it earlier. Now I say get with that word, it will retrieve the corresponding list of document IDs. How do you update it? You can put key value. So you store this value. The value can be big. It need not be a few bytes. It can be megabytes even bigger and it can be structured in certain cases. Delete key removes the key and the value and many of these also provide a way to execute a particular operation on a given key with some parameters. So you don't have to completely replace the value. You can update the value. You don't have to get the whole value. You can get parts of the value back by executing operations. So that's a typical API. It's very, very simple. That's it. Well, of course, it's not it. There are a few more things. For example, big table can do a get with a key range. So it will give a range of values and say get me the range of keys rather. And it will say get me all the key value pairs where the key is within this range. Turns out that that is a very, very powerful primitive based on how they construct the key. They actually construct a kind of hierarchical key. I won't get into details, but it's a really powerful primitive and you can do a lot of cool stuff with it. What about the data which is stored? I said that it can be a flexible data model. Here's one example of data which I think this example is from a talk on, I think it's on Cassandra, but big table from Google has a similar model. First of all, the things in there, the values in there can have multiple columns. So with a key, you can have multiple columns. The columns are actually flexible. That is, not every key value needs to have every column. Moreover, the columns are broken up into column family. So column family, in this case it says rockets. So what it means is a column family associates certain columns with a key. You can have another column family which associates something else. You can think of a column family as a relation. Here's a relation called rockets. With the primary key one, there are four attributes. Name, toon, inventory quantity and breaks. So those are the four attributes and the values are given. Note that instead of having four separate columns, I am actually listing the attribute name and the value. This is where the flexible schema comes in. Now if you look at the last one, it has name, toon, inventory quantity and wheels. It does not have breaks. So the last row in here with key three has a slightly different schema than the rows with key one and two. So the schema is flexible. It is basically stored as a name value pair. Of course, some optimization is possible. Instead of storing the whole name, you may have a system which stores an integer ID or something. So there are various things which can be done. But the bottom line is the data model is relatively flexible. You can even have set values and nested values and so on. The system can handle it. In fact, many of them go one step further. This is what Bigtable does. Bigtable also has sets and so forth. What some of the other systems like Sherpa slash peanuts from Yahoo does is, it says look the value which you store can be in JSON. What is JSON? All of you know JavaScript. You have heard of it. JSON is basically a way to represent objects in JavaScript. JavaScript is actually an object oriented language, scripting, object oriented scripting language and its objects are very flexible. They have flexible schema. So that is a very nice property which exactly matches what these data stores want. So what many of them has started doing is they said, look, we will store JavaScript objects. I am not going to create a new data model. The data model is the JavaScript data model, the object model that JavaScript uses. That is our data model and that is very nice because if you are fetching some data and then shipping it to a web browser, if it is already in JavaScript object form, it is very easy to ship it straight to the browser. So there are some benefits. So that is increasingly used now. MongoDB is another one. Now I mentioned something about ordered keys which has some features. Currently, some of these do not have ordered keys. They compute a hash value, but Bigtable uses ordered keys. It uses something like a mega B plus tree. It is not quite a B plus tree. There are some very cool implementation alternatives which they have worked out. But you can think of it conceptually as a mega B plus tree spanning tens of thousands of machines that conceptually that is what it is. Practically, it is quite different. Let us look at the architecture of one of these systems called peanuts. If you remember the HDFS architecture diagram from the morning, it looks kind of similar, but there are some differences. First of all, there are all these servers which are called tablet servers. What is a tablet? It is not the medicines which you eat, but rather if you add the et, cigar, cigarette, table. So these people said we want to store a piece of a table. So the cute name for that is tablet, tablet. So tablet is a piece of a table. What do we mean by a piece of a table? It is some subset of rows. It is partition. The table is partition. Some of the rows are in one tablet. Some of the rows are in another tablet. Why do you partition a table? Well, a table and Bigtable can be enormous. It can be petabytes. So obviously, it cannot be on one machine. So you have to break up the table and the unit for breaking up the table is called a tablet. This looks very much like the chunks in HDFS. Their chunks were 128 megabytes. In fact, that magic number 128 megabytes is kind of common and even the tablets in peanuts are of similar size, many megabytes. So when a request comes in to insert a row, there is a router here which has a partitioning function. So there are multiple routers in fact and the partitioning function, how do you partition? That is which rows go to which tablet? Because the table is split, a particular row based on the key value. So given the key value, I need to know which tablet should store it. That information is maintained by what is called the master or the tablet controller and it has a table which gives this mapping of key value to tablet and it has the master copy and a replica of it is there in these things called the routers. So when a request comes in, it can go to any one of these routers. There are many. There can be hundreds of routers. The router will look at the key value for us. Whether it is a get or a put does not matter. It will look at the key value and based on the key value, it knows which tablet has that key value and which tablet server has that tablet. That is actually a two level mapping. Key to tablet, tablet to server. So all of this is here. So the router figures that out and sends that request to the tablet server. That tablet server looks up the key value or stores the key value or runs the operation. So if you have more data, add a few more tablet servers, move a few tablets from an overloaded server into the new tablet server. The system never shuts down while this is going on. There are some cool ways to make sure that request will not crash because there is some data moved. It can be transparently it will be re-executed at the right place. So lot of cool stuff goes in which lets you add or remove servers as required. If a server fails, well obviously you maintain replicas. So if a server fails, a replica can take over. So all that is built in. So this is basically what PNUTS does. It is now a distributed data store which can run in a whole data center with thousands of nodes and highly available fault tolerant and so forth. So that is what this does. Now let us come to the issue of transactions and acid properties. There are two issues here. One part of it is the moment you have multiple machines, there is no such thing as a single log file. So all our recovery mechanism which was based on a single log for all transactions doesn't work in a parallel system. If you use a parallel database, it works slightly differently. Each machine has its own log but the concurrency control still has to be global. If you really want full database properties, the concurrency control manager has to be one single concurrency control manager for the whole system. It turns out that for many of these systems, adding concurrency control in a way which spans multiple nodes is a very, very high overhead which they don't want to pay. One of the problems, the reason why they don't want to pay it is what happens if there is a partition. Now if the concurrency, if somebody in the other partition has a lock, you may have a replica but maybe you can't update that replica because somebody else has a lock on it. There are, I am not maybe being very clear here but basically doing concurrency control across tens of thousands of machines is not very scalable and as a result they don't provide any concurrency control or any transactions. There is a way to do transactions across machines, something called two phase commit and three phase commit which is actually there in the distributed database chapter of the book has material on this. Read it if you want. But again they view that as too heavy for the applications. So, most of these applications which these people have targeted want very, very high availability but don't care so much about consistency. Now first of all, what do we mean by consistency? There are two notions. One is the database view of consistency, satisfying integrity constraints. Here these guys don't have any integrity constraints. So, that is not even relevant. What is relevant though is that they have replicas. They have to have replicas for availability. Consistency here means all the replicas of a data item have the same value. Supposing you update one replica and don't update the other two, they are inconsistent. That is a consistency property. The availability property for a system says that even if parts of the system have failed, the system can continue running. It's high availability and the partition is actually not exactly a property but it is a fact of life that given any large, any distributed system, it can break into parts where the two parts are both working but they cannot talk to each other. This is called a network partition. They are both running. They can't talk to each other. So, this can happen. You can't prevent it. A router or a switch may fail, breaking it up into two parts which can't talk to each other. So, for a long time people had suspected this but Eric Brewer made this kind of semi-official by stating a so-called theorem which says between consistency, availability and partitions, you can have only two of the three. Of course, partitions are not something you can control. They will happen. So, a better way to look at it is you can either have availability or you can have consistency. You can't have both in any large-scale distributed system. So, when you design your system, pick amongst the two. What does this mean? A database chooses consistency that's required for traditional database applications. So, what this means is if there is a partition, the system cannot work. It will block. At least for some parts of the data, the system cannot do any work and a user will be told, sorry, your data is not available. But for many businesses, this is a disaster. If they ever tell users, sorry, our system is down, they lose business, people are unhappy and so on. So, what they would prefer is to kind of paper over the fact that there is a problem and give some use, the user continues to do something. Maybe if they are very careful, they will notice that something is wrong, but most users will not notice it and availability is guaranteed over consistency. So, it may be that if the user actually accessed their email account from India and from the US at the same time, they would see different sets of emails because the two replicas, the one in India and one in the US had different copies of what emails the user has. So, the user is seeing different things. So, consistency is gone, but users rarely do this. They are only going to look from one place. So, this is the kind of tricks that systems play to give very high availability. So, all these distributed data stores, they have replicas. If something goes wrong, they allow the replicas to go out of sync and they will even allow you to see a replica which is out of sync. They will even allow you to update that replica. So, what is happening? Two updates have happened in parallel on the two copies of that data item. What is the data item? Let us say my mailbox. One mail is delivered on this partition, one mail is delivered on this partition. Now, when the system joins back, the partitions have to be brought in sync. How do you do that? That turns out to be application specific. So, if it is a mailbox, you will simply, you know, the two mails which came in are added to the two. What if you deleted a mail on one of the things? That delete has to be done on the other. So, there are various application level things which are usually done to bring the replicas back in sync. That is not done by the underlying system. It throws it back at the user to do it. So, availability has to be there in the face of failures. We cannot afford very expensive IBM mainframes or sun servers and so on. We have to live with network failures and processing in spite of that. So, what these data stores provide is something called an eventual consistency model, which means, you know, right now the copies may be inconsistent. If the network joins up again, they will at least detect something is wrong. And in some cases, you can specify a policy to bring back consistency. In other cases, they will call a user defined function which can deal with it. And eventually, if no further updates happen to an item, all the copies will be in sync. It is called eventual consistency. Now, this kind of a system which favors availability over consistency. People had fun with this. They said acid is what databases want. All of you remember your chemistry. What is the opposite of an acid? It is a base. So, they had fun creating an acronym for base. So, what they ended up with is basically available soft state eventual consistency. So, it is slightly cooked up. Why basically? It is highly available. But anyway, it is available. Soft state meaning copies may be inconsistent and eventually, they will become consistent. So, many systems follow this principle now. So, if you use these data stores, they are cheap, open source, many of them, lot of replication. You can have nodes failing. You can replace them. They will take care of all of that. Most of them do not have a single point of failure. So, they are very resilient. They do not require a schema. In other words, as your application evolves, you can keep changing the schema without doing anything much, because they are already prepared to deal with flexible schema. You add a field. You do not have to change anything. You drop a field for a particular record. You do not have to do anything. They already deal with flexible schema. So, those are the nice things. What they do not provide are joints, group by. Although, they realize that people do need these features. So, there are some neat tricks which peanuts does, which uses materialized views to effectively pre-compute joints. They do not compute it on the fly. It is too expensive. But if it is pre-computed, you can read the joint result. So, that is what peanuts does. No asset of course. No SQL and so forth. So, now coming back, should you go and start using one of these distributed data source slash no SQL databases? The answer is there are certain applications where they make a lot of sense when you have very, very large semi-structured data which you have to deal with, with high update or insert rates. And by the way, you can link these up to your map reduce systems. So, Hadoop for example, has things which will link it to peanuts or to one of the other storage systems, distributed data stores. So, instead of Hadoop getting data from the file system, it now gets data from the data store and puts data back in the data store. It makes a lot of sense. If the data was not naturally a file, do not try to shoo on it into a file. Use this instead. So, it makes sense for companies which are running large websites or doing lots of data analysis on very large data. So, there are many such companies even in India today which are using these technologies. But if you are using a database for your college or university, even a medium size one like IIT, these make absolutely no sense. Do not even touch them. Stick to normal relational databases because you get all the other functionality which you lose if you move to these. So, what I am saying in other words is there is a lot of buzz, a lot of people have heard this. You want to know what all this stuff is. And I have told you what it is, but I am also telling you that this is probably not what you want to be using for most of you. So, you need to keep track of what is happening, but that does not mean you go and start using it tomorrow. But you can play around with it and as a learning exercise for you and your students, these are open source. You can download them, try them out. You can run map reduce on those. So, there are things which your students can do projects on and learn about these new technologies. And some of them will definitely go on to companies which actually will be using these technologies. So, it may not be practical for your immediate needs, but for some student projects at least this is a very good idea. So, by all means encourage more advanced students to play around with these. So, I am going to stop. I think that is the, there is a slide on further reading. I would not get, you can read it offline. So, let us go back to questions from users. Please raise your flag if you have a question. Right now, I am going over to Valchanth. Sir, in the morning session, we have discussed about the Harika architecture. In that one, I would like to know is that the distributed kind of database architecture, in that how the data actually get distributed and how it can be retrieved? How it is different from the other distributed databases? Over to you, sir. The question was, we saw Hadoop in the morning. Is it a distributed database? How can it be distinguished from other distributed databases? Well, first of all, Hadoop is not a database. It is a map reduce system. HDFS is the file system, distributed file system. And there are a few data storage systems which are built on top of HDFS, which can be used from Hadoop. One is called HBase. This other one, which was peanut slash share, it has two names. I am not sure it is public domain. I mean, open source yet, but they plan to make it open source. That is from Yahoo. It is used inside Yahoo. So, you have to use one of these. That is HBase or you can use one of the other ones, Cassandra, which is an implementation of Big Table and so on. Those are the distributed data stores. Hadoop is just the map reduce part. So, if distributed data stores store the data, how do you run queries on the distributed data store? You cannot use SQL. So, how do you run queries in parallel on a distributed data store? Well, that is where map reduce comes in. So, these two go hand in hand. You first store the data in a distributed database using one of these Big Table, Hadoop, sorry, H Table, Cassandra, Peanuts or one of the others. Then you use a map reduce infrastructure to query the data if you want to do complex queries. If your only goal is to store and fetch data, then the API, the get and put API is fine. So, the get put API is fine when you are doing one record which you store or fetch at a time. So, if your website which is when a user logs in, you fetch the user's profile and display it, that is a get operation. The user updates the profile and stores it, that is a put operation. These are small operations because one user, the amount of data is small and for that the distributed data store has everything that you need. But now on the same distributed data store, if you want to do a big operation, you want to compute the page rank, you want to do some data analysis, you cannot do that by individual get put. That is not parallel. I mean, you would have to start up the process on many servers. That does not make sense, but what makes sense is to use Hadoop or other map reduce infrastructure and in fact, there are adapters. So, Hadoop is map reduce. How does Hadoop communicate with HDFS or with other things? There are basically adapters in there where your map or reduce functions can fetch data from HDFS or Cassandra or any other either the file system or the distributed data store and then do the map or reduce functions. I hope that answers your question in part. The other part was, is it a distributed database? It is a kind of homegrown distributed database. It does have data storage. It does have querying, not using SQL, but using the map reduce framework or get put framework. Is it a full-fledged database in the traditional sense? No, it does not have SQL. It does not have Acid. It does not have many things, but it provides very, very high availability and very, very high scalability which the traditional databases did not provide. So, yes, you can certainly think of it as a distributed database today with some bare bones interfaces which require programming, not very declarative. But, yes, it is a distributed database. Over to you if you have any questions. Just one second. Over to you. Okay. Thank you, sir. Over to you. Okay. Good. Anna has its flag up. Let us go back to Anna University. Any recent research titles regards distributed databases or any relational, algebra expressions which are purely relate with the research? Okay. Are there any interesting research topics related to whatever we have covered? Related to relational algebra? Probably not. But, related to distributed data storage systems, there are certainly some very interesting issues. The one slight problem I have with it is that if you want to do research, one of the important things in systems research, these are systems areas. Some of the theory underlying it was worked out a while ago. So, a lot of the research today in this are systems research. So, you need access to a decently parallel hardware with, maybe you can build your own with the computers sitting in your lab, as long as they are not very, very slow old computers. If they are decent computers, you can build your own distributed data storage analysis framework. And then you can do systems work on this. So, in terms of research areas, there are some more detailed research areas which I had some students look at. I am not currently doing anything on this, but some of the research areas were on how to build a system on top of such an infrastructure which does not provide acid properties. It only provides a data store with the base properties. So, the base properties are fine for many parts of an application. But, there are other parts of an application which cannot live with the base properties. They need some degree of consistency. So, is it possible to build a layer on top of these systems which will give you that consistency that you need for those parts of the applications which need it? This is a very high level statement of the problem. If you take a specific application, you can analyze and see what parts of it needs high availability, what parts of it needs consistency and then give each part what it needs by building a layer. There has been some work from a research group in Germany which has been looking at building a database on top of Amazon's infrastructure and a database which actually supports acid properties where you require them and doesn't where you don't. So, that is an interesting direction and I think there is more work which can be done on this. For that matter, these systems don't do SQL, but people like SQL, there are tools. So, can you build an SQL interpreter on top of these systems? So, these systems form the data store, but SQL continues to be the interface to it. So, building these and figuring out how to optimize queries in this environment, there is a lot of potential scope here. So, Mike Carey gave a talk at the VLDB conference in September and one of the things he noted is that all of these things are no SQL systems. What they have given is a brand new execution engine which is quite different from the traditional centralized execution engines or slightly parallel execution engines which we are familiar with in the database world. Now, it's up to the database people to build all the traditional layers of the database which go on top to figure out how to provide those in the context of these distributed data stores slash map reduce systems. So, that is an interesting challenge and I think there is research to be done. I hope that answered your question over to you. Can you give the brief ideas of giving project and or question preparing the question bank anything else? Thank you. That was a useful question. I intended to talk about it, but forgot. So, thank you for asking that question. The question is what is this about the question bank and the projects and so on. If you have seen the exercise on Moodle, the question bank which we want to create has questions of different types and we want you to provide one question at least in one of those cases, two questions of some of those types. So, it's up to you to come up with those questions. Please be inventive. Don't just take a question from a textbook and put it back in there. Try to come up with your own questions based on your understanding. In fact, one great way of asking question which I have found is take some misunderstanding which either you or your students had. You thought something was this way and after sometime you realized that you didn't you didn't understand it properly. It was slightly different. More commonly, you find students who had those misconceptions. They thought that something was true when it was not and so then you can ask a question which checks for that misconception. So, these are nice questions because they help separate students who have not understood from those who have and if you use them in your internal exams, it means that students actually learn a lot from the exams. They learn that they don't know something and in fact, we usually have a session after the exam which discusses the answers to the questions. So, the exam is now not just a way of evaluation, but a way of learning. Unfortunately, the board exams are equivalent in universities. It's harder to do this, but for your internal exams, you can certainly do this. So, what I want is questions which test a deeper level of understanding not just memorization, but understanding a concept and a lot of these questions will require you to take an example problem and then do something with it. Find, write an SQL query, find a normal form, come up with an ER diagram or figure out what happens with a particular concurrency control protocol on a particular sequence of actions and so forth. Query processing, what happens under certain circumstances if you use hash join versus using index nested loops join. All these kinds of things, if you have understood what the algorithms are doing, you can answer the questions. If you have not understood it, then you can't answer the questions. Those are the kinds of thinking questions which are very, very important that we ask our students and force them to think. As all of you know, there are many students who have been brought up in a school curriculum where they can get away with just mugging up everything and reproducing it. We need to make it clear to those students that that doesn't work in the real world. Maybe they can get away with university exams, but when they go work in a company, it doesn't work. Companies are not asking them to mug up something and reproduce. That's not why they hire them. So, it's important that they learn to think and I hope that the question bank that you create can be used by all of us to ensure that our students are forced to think and that will help them in the future. The project is the final assignment. So, back to you one minute. Over to you. Thank you, sir. One other question. Let us assume the entire data is on a different server which is such a different location and the processing side or the algorithms alone we are developing from my university. Do you recommend the distributed environment for this thing or do you recommend any other technology behind this? Because if I use, if I end up using distributed databases, the processing over it is more and the time is also more. And we have to keep in mind that this kind of application needs lot of security also. So, what do you recommend? Okay. Thanks. That was another good question. So, you have something which you are developing for somebody else. Should you use a distributed database or not? The answer is like I said to the end of the NoSQL talk, these new generation distributed data storage systems are not appropriate for most traditional database applications. Most of those applications don't have the kind of scale that these things are designed to solve. So, if you don't have that scale, why would you sacrifice all the nice properties that a database gives? Integrity constraints, indices, many of those systems don't even have a secondary index. Even if they have a secondary index, it is not up to date. It is not guaranteed to be up to date. So, you may insert a tuple and you don't see it in the secondary index for a while. Programming against all these constraints is actually not easy. The programmer has to really understand what is going on behind the scenes. So, the model of the story is if you don't need this kind of scale, don't even touch these systems. Stick to a normal relational database and if a single processor thing is not sufficient, well, you can get parallel versions of Oracle with Shad disk parallelism. That is the next step up. Shad memory parallelism, of course, is there for free. All the databases already give it to you with, you don't have to do anything about it. But that is with a few processes, two, four, eight processes. If you want to go beyond that, Shad disk. If you want to go even beyond that, Shad nothing. But like I said, the standard databases, Oracle, SQL server and so on, do not give you Shad nothing additions which you can just use out of the box. So, that is one level of complexity up which you would touch only if you really need that from your application, that kind of scale or performance. And if you go even beyond that and need thousands of processes, well, then you should be using these, no SQL. I hope that answered your question and over to you. Yes, sir. It was nicely answered. Thank you. So, thank you, Anna University for several good questions. The next one is Samrat Ashok Vidisha. My question is, in PostgreSQL, which commands are used for partitioning of the database, is it similar as we are using commands in SQL? Okay. So, the question was, what are the commands for partitioning the database in PostgreSQL? Now, PostgreSQL does not actually offer, as far as I know, any partitioning features, unless they added something in the recent release, which I am not aware of. So, first of all, what does partitioning mean? One aspect of partitioning is what we have been looking at when you partition the data in a distributed database. PostgreSQL does not know anything about distributed databases. It is just a single centralized database. So, when I said that these people are building parallel databases on top of MySQL or PostgreSQL, what do I mean? Each machine in that system is running a copy of MySQL or PostgreSQL. That database doesn't know anything about anybody else's existence. They build a layer of software on top of it, which partitions the data amongst these different instances of PostgreSQL or MySQL. So, PostgreSQL intrinsically does not have anything, any support for partition, as far as I know. Now, there are other databases like Oracle and SQL server, which talk of something called partitioning, which actually means something else. That partitioning is usually used for breaking up the way the files are stored. So, what they do is to give you a motivation. Supposing, I have an application which keeps data. After every three months, they are required to delete all the old data. Now, this is a legal requirement, not because of space, but companies are legally required to keep 90 days of data. They don't want to keep anything more than 90 days in certain cases because somebody may use it to sue them. So, they don't want to keep it. So, now you have to, every day or every few days, you have to delete all the data corresponding to three months back, older than three months back. So, one way companies do it is they partition the data by date or by week or something, a range of dates. So, what happens is that the storage is broken up. It is still on one machine, maybe on one disk system, but a different part of the file system is used for storing this week's data and next week's data and so on. So, the benefit of partitioning is that I can delete all old data at one go. Another benefit of partitioning is many times, I have a query which says, find me all records corresponding to this customer which correspond to the last one year. I also have records for that customer which are much older. So, if I store that relation clustered by customer, I can find all the records, but now the relation is huge if I keep all the old records. So, what they do is they partition by year, let us say and within a year, they partition by customer ID or cluster by customer ID. So, what this means is that the old partitions don't ever get updated. You don't have to touch them. All data which you insert goes to just the current partition and those files are smaller and indices are smaller, etcetera, etcetera. Most queries are only on the current partition. So, they don't have to look at all huge parts of the index which are irrelevant to the current data. So, there are many performance reasons why you want to partition the data into different storage areas and each storage area has its own relation structures, its own indices, its own everything. So, that is what SQL server oracle and others mean by partitioning today. You could have these partitions on different machines, but the goal of partitioning here was not to parallelize it is for other administrative and indexing efficiency issues. I hope that answers your question back to you if you have a follow question over to you. Thank you. Thank you, sir. Over to you, sir. Okay. Thanks. The next question is from NIT Varangal. NIT Varangal, let me select you and over to you. I have got basically two questions, sir. The first question was time synchronization in distributed databases is a problem till now. Are there any mechanisms to handle this time synchronization problem? And number two, how concurrency control is been maintained in the distributed databases? If you could throw some light over this, I will be… Okay. There were two questions. One was time synchronization in distributed databases and the other was related to concurrency in distributed databases. The first part is how do you keep time synchronized in a distributed database? So, textbooks have some discussion about that. Now, those are most relevant if you are using timestamp protocols or various other things where you use timestamps to do something. If you are not using timestamps and none of these databases, the distributed data stores we are talking about, don't actually use timestamps to do anything very important. Wherever they need a sequence number, instead of using timestamps, they maintain counters or version counters along with individual tuples. So, if somebody updates a tuple, then somebody else further updates a tuple, they keep adding up version numbers which is in the tuple. It's not a separate logical timestamp. Yes, if you use a logical timestamp, keeping them synchronized is an issue. Therefore, people don't use them. Wherever possible, don't use them. That's the solution. For those of you are not familiar with the problem, the issue is the clocks of different machines cannot be exactly synchronized. So, if you use a timestamp, one machine's timestamp may be ahead of the other permanently and there may be a significant gap and if you use timestamp protocols to deal with anything, this becomes a big issue. So, the solution is don't use timestamp protocols, then synchronization is not an issue. Now, you do need some synchronization. If you have a log of what happened, you want to have some idea of time and these days, all computers use the NTP network time protocol and they're all fairly well in sync with each other. They're typically not less than a second out of sync. A few milliseconds is usually the maximum. So, if you just want the time as timestamp in log records, this is more than good enough. So, that problem is kind of done away with by not using timestamps for certain actions. The second part is concurrency control in distributed database systems. Again, most distributed database systems don't do concurrency control. Simple, they say tough luck. There's a lot of work on distributed concurrency control. So, in the theoretical domain, there's been a fair amount of work. Practically, the database does nothing. In fact, through distributed database implementations, the only ones which exist today are the distributed data stores which I talked about. The relational vendors don't provide anything like that. What they provide are parallel databases which have full concurrency control and so on. As it features, they're complete. In terms of distributed databases, they don't really provide much. What they do provide is the ability to connect to a remote database and fetch tuples into the local database, but that doesn't provide concurrency control per se. So, if you are doing something with multiple databases, you are responsible for transactions and concurrency control. Now, the other side of this question is if each database is doing strict two-phase locking, you don't need distributed concurrency control. You only need distributed deadlock detection because that is a part of concurrency control. But again, nobody implements distributed deadlock detection. So, if you are doing transactions across multiple databases, you are on your own. That's the state of the world today. I don't know if there's any product which does anything more than this as of now. In the context of the distributed data stores, well, they don't do concurrency control. Therefore, there is no deadlock. So, there. The problem is switched away. I don't know if that answered your question satisfactorily. Back to you if you have any follow-up. Actually, we wanted to know the final assignment is still not in the model when we are going to get it. Okay. I thought I'd put up the final assignment on Moodle. Perhaps it was hidden. So, I will go back and make it unhidden right away. It is very much there on Moodle. Your coordinators should already be able to see it because coordinators can see hidden items. Maybe participants are not able to see it. If so, that is a oversight. It was supposed to be a visible. I will make it visible right away. So, if there's any other question, go ahead and ask over to you. Thank you, sir. Over to you. Okay. We'll take a last question from Amrita Bangalore. Amrita, I am selecting you. So, I will wrap up with a few questions which came by chat. And the first question is from Anna. In Java, I am able to use NLP support for Tamil. Can the same be done in Hadoop? Or will I need any other package to be installed additionally? First of all, Hadoop is built on Java. It's a Java library plus some infrastructure. So, nothing prevents you from, if you have already figured out how to access Tamil characters in Java, you are done. Hadoop is Java. You don't have to do anything special. The second is, is there any limit to store tables and databases in a server? Is there how much? Now, each database typically has some limit. These limits are usually fairly big. You will not usually run into those limits unless you build an application which goes on creating tables like crazy. That limit, I am sure, is at least tens of thousands of tables at a minimum, probably millions or more of tables. So, as far as I know, you will not run into the limit as long as you are creating the schemas manually. It's pretty much impossible. If you have an application which is creating tables, you should really rethink it. Application should not normally be creating tables on the fly. It's a very, very bad idea. So, practically, you should not run into this at any point that I am aware of. So, for all practical purposes, there is no limit. Similarly, databases in a server, those are probably tighter limits. You can't have way too many databases in a server. Again, you want to create separate databases only in some circumstances. If the two databases do not need to talk to each other, they have nothing in common. They are completely different applications. Then, they are different databases. If there is a relation here which you need to join with a relation there, do not put them in different databases. Put them in the same database. You can't do cross database joins. So, you should not create separate databases except for something else which somebody else is using which you will not be accessing. Thanks for naming GUI tools like VB. Will you please write down those names so we can note down? I actually didn't give too many names, but let me write it down here whatever I did tell you. Let me try to recall what all I told you. One tool which is, if you are using the Microsoft world, then Visual Studio already has whatever you need. This is in the Microsoft world. In the rest of the world, NetBeans had a package called Visual Web. It is an add-on package which at one point was integrated with NetBeans. Today, it is kind of separate for the reason I told you that it has not been fairly buggy and the design had some flaws. So, it is not great. Now, if you don't want to use that, there are two kinds of things which are very useful. There are several frameworks which people use. There is something called struts ones. There are a few frameworks based on JSP, taglibs and so on which make life easier for you to create. There is something called JSF. I think Visual Web was based on JSF if I remember, Java server faces. So, these are, think of them as libraries which make it easier for you to build user interface elements in your web page. So, unfortunately, these are not, in my opinion, they are not that well designed. There is a fairly steep learning curve. We have had programmers try to use it and get very confused about what to do. So, I will not recommend them very strongly. Although, I know that struts is used quite widely in the industry. Then, there is another class of things which are JavaScript based for which the YUI stands for Yahoo User Interface Library. This is a JavaScript library which has number of JavaScript functions which you can call to create things on the user interface, including support for tables which you can click on to resort on different columns. They have functions which can do Ajax calls back to the database. They have, what else? They have various kinds of buttons. Basically, you can build a user interface in JavaScript using these tools. The tool ensures that it will work on all the standard browsers. If you code in JavaScript raw, it will surely not work on one browser or the other. They have made sure that if you use standard JavaScript language constructs and then anything to do with the DOM tree or the user interface and so on, you use only the YUI library functions, it will work seamlessly. So, that is a very nice library which I would encourage you to use. There was one more in the Microsoft world called iron speed, but it is not free. You will have to pay for it, although I think it is good. There are others. I do not have an exhaustive list here. Last two questions. How large is your university database? Are you maintaining the data of passed out students? Do you use Hadoop file system for your growing university database? That is a good question. How large is your university database? It is embarrassingly small. If you just take the academic records, we have all the academic records of every student who graduated from, I think, 1989 or so. Some of the older data is also there, but its schema was different. We have not got round to putting it in the online database, but from 1989 to 2000, the last time I checked the size, 2007 or so. So, 17, 18 years of data, not including photos, just the academic records was embarrassingly something like 600 megabytes. It is that small. All their names, the courses they took, courses, who taught what course, the grades they received, everything was 600 megabytes. I think it is a little bigger now. We have added a lot more information. Then, our whole financial database, all the accounts which are administration maintains, all the payments, student fees received, scholarships paid to students, everything put together used to be about 4 gigabytes. I think it is now probably of the order of 8 gigabytes. Again, this has everything from around the same time frame, 1990 or so. Everything from 1990, we have not dropped a single thing since then. Everything put together over 20 years is of the order of 8 gigabytes. So, overall, when we also have some photos and so on, even if you add all of those, everybody's photos for our security, everything is probably of the order of 16 gigabytes or so. You really do not need Hadoop file system for something like this. A single server is probably more than adequate for all our needs. We actually have two or three servers, but even if you put them all on one current generation server, it would be more than adequate. That is why I said you do not want to use Hadoop or Hadoop file system or any of those distributed data source. They are completely inappropriate for a university. IIT is a medium size university. There are universities in the US which are maybe 40,000, 50,000 students. There are universities. That is about 8 times how big we are, but assuming their records are similar, 16 gigabytes times 8. That is very small these days. Again, very comfortably fitting in a relational database. But if you go outside, you are a website which has millions of users, then you are talking. Then your scale is much bigger. Then that is where you need all these things. You do not need these within an organization. So, I will stop here and thank you and goodbye.