 Good afternoon. So, today I am going to be talking about new generation of massively parallel data storage slash database systems. Now, this is an area which was really motivated by the web. Before the web, most organization, even the largest organization had fairly small databases. There are only a few companies in the world which really had extremely large databases and those were things you could count on the fingers of your hands. Maybe the phone company in the U.S. Walmart, which you keep hearing about, which has huge volumes, had huge volumes, they will see even back then, and a few other companies. And there were a few parallel database systems which were designed to meet the needs of these companies. But with the growth of the web, things have changed dramatically. A small web company can start and in one year, it can be serving hundreds of millions of customers. All of you must have heard about WhatsApp, which was bought by Facebook recently for, I do not know what, 11 or 17 billion dollars, some large number of billion dollars. And what does it do? It takes messages from people and forwards it to other people and it stores it. So, big component of what it does is data management, storing it, forwarding and so on. And the key thing to note is that, here is a company which probably started with a few users and in the space of a year or two, it grew to hundreds of millions of users. How on earth can a company which grows so fast manage its data? I do not know exactly what WhatsApp did specifically, but I am going to be talking about the general principles of how you can manage extremely large amounts of data in a fairly convenient manner. And these techniques have been developed more or less in the last 10 years and I will give a little bit of history on what has happened. So, that is the motivation. And here is a slide which says the same thing again. You have many such companies. WhatsApp was the example I used, but you have Facebook, you have Twitter, you have any number of companies out there. These are the big ones, but there are any number of smaller companies which may not go to 100 million users, but they may well go to 100,000 users very quickly or a million users. And traditional database system, centralized database system do not really work for them. What about there are two aspects to this. One is how much data do they store? Some of these people store petabytes of data, but the second more important aspect which we will focus on today is accessing that data. One kind of data generation is through logs which are generated by web servers. The web servers of the world today generate literally petabytes of logs per day, but the way these logs are used is mainly to do some kind of analysis. There is a program run periodically which goes through the logs of the day and looks for patterns, makes decisions and so on. However, there is this other class of things like the WhatsApp and things which I told you which may have petabytes, but the way they access and retrieve the data is different. They are what would generally be called OLTPs, online transaction processes. When you send a message, a point update is made in WhatsApp. When you receive a message, another update is made and so forth. So you have very high demands on not only storage, but they have to be scalable. What does scalability mean? Today you have 10 users, tomorrow you have 1000, day after tomorrow you have 10,000 and so on. And your system should be able to grow to meet this kind of growth of the user base without having to reimplement anything, without having to redo anything. Of course, you cannot handle all those millions of users on a single machine. So scalability is achieved by adding more machines to the system. And this addition has to be kind of smooth, incremental. Bit by bit you add more machines as you grow and the system should be able to absorb them and improve its performance steadily. The second aspect is availability. Now all of you who have ever used email from your local college email servers would have certain expectations of email availability. In the early days you assume that email would only be accessible 9 to 5. Then the expectations grew that it would mostly be available 24 hours, but you know maybe after midnight there is a good chance it will fail. And then your expectation came up to it is available 24 hours, but you know you expect a couple of days, two, three days of failure to reach here minimum. But the expectations on the current generation of web systems is far more stringent. If Facebook were unavailable for let us say 15 minutes, people are going to notice. If it is unavailable for one hour, it is going to be mentioned in the newspaper that Facebook was down for an hour. And down for a day, well productivity in offices and colleges would improve drastically, people would start paying attention to lectures maybe that will be the good thing. But for Facebook itself there would be a large loss in revenue because they could not provide supply ads and so on. So availability is very important for all these applications, very high availability. So all of these are basically satisfied by parallel and distributed data storage. And of course this idea is very old. Soon after the first relational database system R came, the system R start within 4, 5 years system R start project started. And they worked with tens of nodes. After a while you know this kind of thing continued, but there was not a very big market for it. There was data interoperability and so on which stayed at tens of nodes. Then but then there were systems like Teradata which would collect large amount of records and then allowed decision support queries on those. But they were not good at OLTP, but very large, very high throughput of data. The next step happened about 15 years back when Google and others started building distributed file systems with thousands of nodes. Now distributed file systems with hundreds of nodes actually are very very old. Back in 83 I remember hearing a talk from a person called Satya who was at CMU about a system called CODA, distributed file system called CODA that they were building. So 83 is now 30 years. But those did not have very high throughput requirements. But the current generation not only have large number of nodes, but very high throughput. So I will tell you a bit about Google file system and other things distributed file systems. The next step which happened after that which happened around GFS was publicly described around 2004. And the next step was distributed data storage with not a large number of files, but really hundreds of billions of small objects and each object small mean kilobytes to megabytes. And so that was described around 2006. This is called Bigtable. I will talk about that also. More recently there are distributed database systems which go beyond just storing and retrieving data, but can actually run SQL query. So I will talk about that at the end of my talk. So I have already been alluding to these different types of data. There are large objects, video, large images, web logs and so on which are written once read many times. And distributed file systems are very good for storing this kind of data. And these data are never updated. They are append only data. Then there is a transactional data which I have already mentioned Facebook, Twitter, updates in Facebook, friend lists, likes of pages. And there is also email which is often stored in a data store as opposed to a file system. And all of these have billions, even trillions of objects. A third category is indices like when you search on Google, Google is using keyword indices to answer your keyword search. So the question is how are these built? How are these stored? And so on. So in the old days file system was fine for storing these indices, but Google itself has moved towards indices which are updated very frequently and they need something in between daily updates and millisecond updates, somewhere in between. So coming back parallel databases like I said were on since the 80s. They have indeed been used for transaction processing. In fact the Indian railways reservation system was one of the first online systems in India, OLTP systems in India. And it was from day one it was built as a parallel architecture on the old digital equipment cooperation machines. And there were parallel databases designed in that era for tens to hundreds of processors. Hundreds was more for decision support, tens was more for transaction processing. Now most at this scale the likelihood of failure is relatively small. And if a failure occurs the key goal was to recover within a short time. If you are down for 5, 10 minutes, 1 hour, it's acceptable. You cannot be down for a day though. But now the needs are much more stringent. You can't be down for hours. The next kind of thing which has happened is that there's geographically distributed data. All of us use email, social networks and so on. And you could be anywhere in the world in using this. Now if your data is stored far away in let's say US, just the round trip time you click a mouse button here. It takes maybe 5, 600 milliseconds for this click to go all the way to the US, something to happen and then come back. Just delay on the network. And that will turn into noticeable lag. So companies want to keep data and applications near you. So if an Indian clicks, their clicks are sent to an application server in India or near India, maybe Singapore or something. And so the response time is much faster. So you have what you want is lower latency via geographic distribution. But geographic distribution also comes with certain problems. The first is what is called network partition. I'll come back to this later where the cable between let's say India and Singapore gets cut. And now you can't update your data. But maybe you have a local copy in India, in an India server. But the India server and Singapore server can't talk to each other. That's a network partition. Some terminology which you might have heard already yesterday, I guess, or today morning. One of the terms is replication. You keep copies of the data. So you may want to keep a copy in Singapore, a copy in the US, and a copy in India. So if a network cable is cut, the India copy may still be accessible to you even though the Singapore copy is not accessible. The second is data partitioning. If you have a lot of such a center or if you have a lot of nodes in a center, you can't keep a copy of the full data everywhere. You're going to partition the data. So there will be smaller units. And I will talk a lot more about partitioning data. And of course, you combine replication and partitioning. So the goal of replication primarily is availability. But it can also help in, well, not just availability. I should say availability and latency because it's available nearby. The other two benefits are parallelism because you have multiple copies. You can access them in parallel or reduce data transfer. So if you have a copy locally at your application server, there may be no network traffic when you do some lookup because it's available locally. But the cost is, of course, every update that happens has to be sent to all the replicas and then protocols to make sure that the replicas stay in sync. The problem is if you have two copies, one copy is updated, the other is not, then there is a problem. Now this problem is a well-known problem in databases where the whole theory of normalization deals with this. The thing which is dint into you is that you normalize your schema, you avoid replication of data. But now I'm saying replication is a good idea. But this is a different kind of replication. This is a system controlled replication and it's a system's job to make sure the replicas are up to date. If you do it at the schema level, it becomes the programmer's job and programmer's goof-up and will forget to update replicas and lead to inconsistency. But the system also has to deal with updates on copies which are alive while another copy is inaccessible. So it may go out of sync. Then how do you detect it? How do you catch up and so on? Okay, so that's a background. So how did people deal with the demands of the web era? The most common thing in the initial days was what is called sharding, which is basically build-it-your-own-parallel database. And the idea is you divide your data. Professor, just a second, there's a query. Professor, there's just a query. Sir, you are discussing about that replication, sir. So yesterday, while we are discussing with Ramnathan from TCS, you have told that that is a batch time or there are different ways to update the replicas, like batch-wise or otherwise iterative mode. So which is well suited for updating the replicas, sir? Okay, so there are two kinds of things. In the traditional distributed databases with replication, in some cases the replication would make sense to do it at night. The central database updates certain tables and next day the other sites need to access that data, which is going to remain same for the day. For those kinds of things, batch replication is good. Or even more the other way, the sites are processing transactions all day, but they don't push the transactions to the central database till maybe the night, in which case at night the replication happens. But for the class of applications, I am talking about that is usually not an option. As updates happen, they have to be sent to all the replicas. If a replica is down and inaccessible, well, it can't be updated immediately, but the protocol has to ensure it get updated at the earliest point when it comes back in online. Does that answer your question? Okay, sir. Thank you. Thank you, sir. Okay, so coming back, the approaches to getting high transaction throughput, the standard approach is called sharding, where you have a lot of cheap databases in the web context. In the bank context, they need not be necessarily cheap databases. They may be expensive oracle databases, but the basic idea is the same. You divide data amongst many databases. So in the bank context, they say accounts from this number to this number are on database one, from this other number to this other, or in database two, and so on. And if more accounts come, they start putting it in more databases. The thing though is that an application has to decide which database to go to. It will look at the account number and say, okay, I need to go to database one or two or three or whatever. So now, it's the job of the application to do this. Now what if a new database is added? Well, all the applications have to deal with this. They have, maybe the partitioning is coded into the application code, in which case the application has to be stopped, the code updated, and then restarted, very intrusive. So that's the limitation. The benefit is it's cheap and it is good for certain class of application. In fact, it was very widely used. It is still used quite a bit, but the limitations are actually a huge problem. For example, Google's ad, whenever you view a page in Google Search, ads are displayed. When you view somebody else's page who is using Google Ads, an ad is displayed by Google. So many of these things require money to be paid, transactions have to be recorded in a database, and all of this was done by sharding by Google. But the big problem they had is that, if you change the schema, if you change, not just the schema, if you decide to add a few new nodes to the parallel system, changing the partitioning is a big pain. Moving data is another big pain. So you really don't want to do all this. You want the system to manage all this transparently. If you add a node, the system should ideally move data from the existing nodes to the new node in a smooth way without overloading any node or the network. So what you want is transparent scalability. So that is where the parallel distributed key value data stores come in. And that is going to be a major focus of this talk. So the basic functionality provided by a key value store is to store a key value pair. So you give an ID, a customer ID or a roll number or employee ID or something along with data, and you store that data in the system. And when you want to retrieve data, you give that same ID back, and the system retrieves the data which you stored earlier. You can also update it. Now a lot of things can be done with fairly simple interface like this. Although actually real systems provide much more functionality, this is the starting point. So there has been an explosion of systems which support this plus a lot more functionality. Starting with Google Bigtable, and there is a clone of it called Apache HBase, which is becoming very popular these days. Then Yahoo got into the game and built a system called Peanuts. Parallely, Amazon built a system called Dynamo, which was then cloned by the Apache Foundation, actually by Facebook, into a system called Cassandra, which is now an Apache thing. So HBase and Cassandra are kind of competitors. And this is just a few. There are actually a lot more. Here is a small list, MongoDB, CouchDB, Couchbase, Neo4j, HyperTable, blah, blah, blah. I am not even going to read all of them. And this is, mind you, this is like one-tenth of the number of systems that are out there in this space today. Of course, they are not all the same. They offer different kinds of functionality, but they are all in the same space. Scalable systems where you store and retrieve data. For example, Neo4j over here is used for storing graphs. MongoDB, CouchDB are used for storing what they call documents, but what one might call objects and so forth. What all of them provide in common is partitioning, replication, high availability, and mostly transparent to applications, or fully transparent. But they are not full-fledged database systems. They do not support SQL, first of all, but that is only a small part of it. They also do not support relational, full relational storage. What do I mean by this? What is the relational storage system with SQL DDL? You have relations with columns, types, and so on. That, these systems also provide or provide variations or even extensions. But the other things that are provided are things like primary key constraint, foreign key constraint, and other such constraints. Now, these systems do not provide it. Then database systems provide an SQL interface. These systems do not provide it. So, you give up something in order to get your massively parallel scalable system. So, the focus of the talk is going to be such systems, and then how they have been adding functionality and turning into full-fledged databases slowly, bit by bit. They are not there yet, but they are getting there. So, what is the basic API which these provide? They do not support SQL. They provide an API. In this API, you can get key, extract the value given a key, but of course, to get it, you have to first put a value. So, you say put key value. You can delete a key, and most such systems also allow you to execute some operation on a given key with some given parameter. Of course, the operation has to be registered with the system like a stored procedure or something, which you can execute. And then there are many more extensions to this basic functionality, version numbering, and so forth. So, this class of systems also called no SQL system or no SQL as people also call it. Some people realize that SQL is actually quite useful, and they started adding SQL back to some of these systems. At this point, they started calling themselves not only SQL systems, and no is not for no, but for not only. Anyway, all that is playing with words, but the key thing is they are scalable storage systems, which may provide some functionality of relations, constraints, SQL and so forth. So, a lot of people call these no SQL or no SQL systems, but we do not like that, because you do not define something by what it is not. You can say that I am not a football player, I am not a singer, I am not this. That does not help you to understand what I am. So, you want to say what you are, and I would prefer it to be called a parallel or distributed storage system or even a scalable data storage system, which reflects better what they do. So, now, they are different from databases, because they do not support SQL. As a consequence, they do not support joins, they do not support integrity constraints, they do not support transactions. They sometimes support kind of mini transactions. That is, you can do multiple updates on a single node of the system in an atomic fashion, but that is it. And no SQL means no query optimization, nothing. So, if you want to write a query, you have to code a program yourself, which is a thing. But they do provide some other features, including flexible schema and multi versioning and so on. So, these systems can be classified as below. This particular classification was done by somebody called Wic Catton. So, key value stores are the bottom, in this figure they are shown at the top, but they are actually the simplest version, where you have a key, a value, the value is completely uninterpreted. Dynamo and Amazon Dynamo is a very widely used thing within Amazon, and that does not look inside the values at all. The next level of system starts looking into the values in a limited way. So, CouchDB, MongoDB assume that the data which is stored in the key values. So, there is a key and there is a value. The value is assumed to be in a representation called JSON, which the system can look into and build indices and do other simple stuff with it. So, there are several databases that support that. The next level up in this hierarchy is extensible record stores, which have a notion of rows and columns, and those are very important for many tasks, including indexing and so forth. And if you want to retrieve only some columns, not all columns, well, columns are quite useful. So, the first one in this class was big table, then there were many others which came up, and I am going to be talking about all of these systems today. The final level up in this hierarchy is a distributed, scalable, relational database management system, and Google F1 is probably the best in its class described about a year ago. They probably started using it about two years back. But there are also other systems which may not scale quite as well as Google F1. Now, I have been telling you what I am going to do, but let me also characterize what I am not going to do. A lot of you might have heard of the term map reduce and Hadoop and so forth. This talk is not about map reduce. So, what is map reduce? It is about querying relatively static big data, where the data is broken into large files. It is happened only, no updates. And you may add more files. You may add data at the end of a file. But what you do with that data is query the data in large volumes. You read a lot of data, aggregate it and get some information out of it. And for that, there is a framework called map reduce, which many of you might be familiar with. And that is very widely used, but that is not our focus at all. Our focus is on transactional data. I hope that it will be covered elsewhere in this workshop, or you might have seen it before. So, here is the outline of the rest of the talk. I am going to give a very short background of distributed transactions, concurrency control. It is going to be very brief. Then I am going to talk about distributed file system and then the data storage system. And the last part of the talk will be about availability versus consistency, the cap theorem and so forth. So, this part is on distributed transactions, which Professor Gopal said here already covered. Let me quickly go over a few of these slides. So, distributed database system has to have number of nodes, shown as computer 1 through n. Each computer has its own transaction manager, which manages the data stored on that node. And then there are these things called transaction coordinators, where an application will send request and the coordinator will forward the request to whichever node has the data and retrieve it and send it back. That is a basic architecture. And the coordinator is responsible for many things. For example, for ensuring that if a particular transaction updates data at t 1 and t 3 and t 7, that all of them get updated or none of them get updated. So, that is atomic. Now, when you have a distributed system like this, there are many failure modes unique to distributed systems, which are not there in a centralized system. The first is failure of A site, while other sites are still off. The second is loss of messages, which is handled by network protocols. I will not get into that. The third is failure of a single communication link. Again, that is typically handled by network protocols by having multiple rules. I will not get into that. The critical new failure mode though is network partition, where link fails and as a result, you have nodes on the two sides of the system, which are not able to talk to each other. So, network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them. Now, a special case of this is that a node and the communication links out of that node are cut. So, as far as the rest of the system is concerned, this node appears to be dead, but the node itself is actually alive. So, it is actually a partition. So, distinguishing between partitioning and site failures is essentially impossible. You can make a best effort by having multiple links and so on, but in the limiting case, you cannot distinguish it. So, whatever protocols we use should work whether it is a partition or a site failure ideally, although people do take use protocols which may not quite respect it. Now, given that you have a distributed system, you want to ensure atomic commit. You have commit protocols of which the two-phase commit protocol is very widely used. I assume that this was covered yesterday and it is not very critical to know the details of two-phase commit. So, I am going to skip these slides unless anybody has a question. I will just flash the slides. So, there is a first phase where you prepare to commit and then a second phase where addition is made and then the commit happens. Are there any questions about two-phase commit or shall I move on? I can move on it. Now, the alternative to two-phase commit is what is called persistent messaging and this is very widely used for distributed transactions. I would say even more than two-phase commit, this is more widely used and the notion of two-phase commit is actually inappropriate for transactions that cross organizational boundary. If I want to transfer money from Canara Bank to State Bank, I cannot insist on a two-phase commit between the two databases. However, I can create a draft or electronic fund transfer request at Canara Bank and tell them send it to State Bank. When State Bank receives the message, it puts some money into that account and so that is the persistent messaging model. Why persistent messaging? The message has to persist. You cannot lose it. Once I say transfer money, you cannot lose the message and deduct my account and forget to credit the other guy. There is a problem. So, the persistent messaging systems guarantee transactional properties to messages. Once the message is sent, it will be delivered exactly once. Maybe you saw this earlier. I will skip this. The thing with persistent messaging though is that errors have to be dealt with by the programmer. In two-phase commit, what happens is supposing you are transferring money, there is a problem with the other account. The account does not exist. It has been closed or whatever. Then the transaction can be rolled back immediately and the money credited back. But with persistent messaging, you send a message. The message cannot be processed at the other side. What happens now? What should happen is that the money should be credited back into my account. But that cannot be done by a recovery system of a database. It has to be dealt with by some code which says, okay, I could not deliver the message because the recipient account does not exist. Therefore, the messages can sent back as failed at which point the money has to be credited back into the account. So, all this logic to recognize and deal with failure has to be coded by the application program. So, that is the price you pay for the benefit of persistent messaging. In fact, this is a recurrent theme. In many cases, in distributed systems, if you want a particular feature, you usually pay for it by foregoing something else. This is a priming term. Now, coming to replicated data, the key issue is that all replicas should ideally have the same value, which you can ensure if all replicas are up, update all of them. No problem at all. But what if it is not available? Disconnected, failed or whatever. What if there are two transactions running concurrently? One of them updates copy 1 to 5 and is waiting to update copy 2. The second one updates copy 2 to 20 and is waiting to update copy 1. They have done inconsistent updates to the two copies. So, you need some form of distributed concurrency control. A very simple way to do distributed concurrency control is through primary copy. You choose one replica to be the primary copy and whenever you want to update that item, the update has to be sent to the primary site. It cannot be sent to other sites. So, if two people try to update the same item, both of them will land up at the primary and concurrency control at the primary site will ensure that only one of them succeeds. But you may have many different primary sites. For item A, primary site may be one. For item B, primary site may be two. Not a problem at all. So, locking is can be done by the primary site and when you lock the primary site, you implicitly have a lock on all replicas. You do not have to explicitly lock them. So, concurrency control is much simpler. And in fact, this is widely used in these storage systems, which we will be talking about. But there is a problem, a drawback. If the primary site fails, the data item may be inaccessible even though other sites which have a replica might actually be accessible. So, you need a protocol to transfer ownership of that data item, the primary ownership to some other site. So, that has to be dealt with. But it is still a very simple and inexpensive protocol. So, it is very widely used. And the other thing is when you perform an update, it is done at the primary and subsequently it has to be replicated to all the copies. Usually, it is sent immediately. But if a site is down, you can lock the message and the site comes back up. It is sent to that site. And all the updates to a single item are serialized to the primary copy. And I will just mention the read one write all protocol does not work. Again, this must have been covered. I am going to then move on to replication with weak consistency. So, the problem with majority protocol is that if multiple replicas are down, you may not be able to even update a data item, even though some replicas are up. So, then availability can become a problem. The other issue is when you read a data item, if you have to read from a majority, it is expensive, you might want a cheaper read which could fetch you stale data. So, to get better speed for the application, you may live with a weak consistency. So, the thing with weak consistency is hopefully nothing will go wrong, but there is always a possibility. If it goes wrong, if two people concurrently update a data item, you need to detect that a problem happened and then you have to reconcile it. I will come back to this in the context of Dynamo later in the talk. So, now, let us move on to new stuff which is the distributed file system. How many of you have heard about or at least a little bit about distributed file system? Raise your hand if you have heard about it. Good. Many of you have heard about it. How many of you know a little bit about what goes on inside a distributed file system? So, you heard about it, but do not know much about what goes on inside. How many of you have heard about big, yeah, sorry, please speak up. He is just to know about what are the distributed file system, but not in detail, sir. I will talk a little bit about those things. Next poll, how many of you are familiar with big table, H base, Cassandra, any of those other systems which I mentioned? How many of you have heard of any of those? Similarly, sir, we have heard about the thing. We did not work it out in the H base or the big data. No, not work out. Do you know the principles behind these? How they work? Have you heard of those things, the principles at least? No, relation is, we just know a little bit about the no SQL, sir. Okay. That relation is not that, it is a flat file. Like that only we know that it is a column-wise divided. So, that is only we know. Okay. So, you have heard a bit here and a bit there and so on. So, I will talk a lot more about it here. Okay, good. So, then this part of the talk is not wasted with you already knowing about it. Good. So, let me start with distributed file system. Like I told you, there are very old ones called CODA, but the one which popularized it in recent times is Google file system and Hadoop file system is a open source copy of GFS. So, the idea is that you store files across large numbers of nodes, hundreds of machines, thousands of machines and you want to be able to access a file. You give a file name, the system should return the file to you. How does it return the file? Well, it tells you that these are the blocks in that file and here is where the blocks are located and then you can go to that node and say, give me this block of this file and that node will return that block. So, by reading one block after another, you can read the whole file. That is the core idea of a distributed file system. So, how do you build a distributed file system? How do you architect it? Well, there are basically two kinds of nodes. There is a master and there may be replicas of the master and backup master and so on. The master is responsible for metadata. What is the metadata? The metadata includes the directory system, the file system directory. What are the files? What are the directories? It also keeps track of for each file, what are the data blocks in that file? Those data blocks will not reside at the master node. The data blocks reside in the chunk server and each block of data is usually fairly big. It might be of the order of megabytes or even tens of megabytes. In some cases, hundreds of megabytes. That is it. Blocks can be that big and the master will tell you that if you give a file name, this file name contains this block at this server, that block at the other server and the chunk server store that block and will give you the block. Now, the problem with building any large system with hundreds or thousands of nodes is that failures happen. The chance of one of the nodes in a set of 100 being down at any time is fairly high with cheap systems. With high quality systems, the chance is less but it will happen a few times in a year. So, if you lose data at that point, you are in trouble. So, you have to deal with it by replication. So, all the distributed file systems by default will keep 3 to 5 replicas of every block and the master keeps track of where all that particular data blocks is replicated. And all of these distributed file systems are typically within a single data center. What do I mean by this? Any large company like Google or Yahoo, Microsoft has many data centers. Each data center is in one location and the network within the data center is very fast. So, if you access data from some machine in that data center, you will get it within a fraction of a millisecond. Typically in hundreds of microseconds, you will get a reply. Whereas, the moment you go from data center in say Chennai or in Madurai, you go to a data center nearby, say one of the outskirts of Madurai, some other town, say Tanjavur, it may be 100 kilometers, you are already looking at a significant delay to go from here to there. So, you are talking of millisecond delay perhaps and that can add up and become very significant. So, all of these systems are designed to work within a data center. It is not just a delay, you may have network partitions. Within a data center network partitions are very rare. So, you do not have to typically worry about them. Across data centers, they are much more common. So, here is a picture which shows how HDFS is architected. There is a master which is called name node in HDFS and there is a backup master secondary here. A client which wants to access a file sends the file name to the name node and what it gets back is a set of block ID data node records. For each block ID, it says which data nodes here, the chunk or data nodes around here. So, the master tells you which data nodes have a copy of this particular block ID. The client can read it from that data node. What if the data node is done? Well, the client will wait for some time. If it does not get a response, it will go to the next data node which has a copy of that block and get it from that. That is how you deal with failures. If you want to update a data item, similarly you will ask the master to create a new block. The master will tell you where all the replicas are there and the client will write to all of them. So, GFS was described in a paper in OSDI conference in 2004 and it was soon copied in HDFS. Now, GFS, HDFS are very nice for storing petabytes of files across thousands of nodes. But there is a big limit. The first limit is that if you have things which read files repeatedly, the central master becomes a bottleneck. It becomes a bottleneck in two ways. One is the number of reads that come to it. The second is that if you ask for the metadata for a file, if that metadata is not in memory, it has to do a disk IO. The moment you do disk IO, the number of requests you can serve from a single disk, hard disk is about 100 per second. So, if you can only serve 100 requests in a second, the poor master can only serve hundreds of requests per second across multiple disks which is very low. You want not hundreds, you want hundreds of thousands of requests per second. So, this is not at all scalable to that kind of a thing. And then for every file, there are very significant file system directly overheads which is absolutely a non-issue if the file is very big. If you have a file which is 100 megabytes, a few kilobytes of overhead is nothing. But if you have a file which corresponds to a small record, a single record in a relation, that record may be 20, 30 bytes, 40 bytes. If you have a 1 kilobyte overhead for 40 bytes, you are doomed. Your storage overheads are ridiculous. So, they are not suitable for small data items. The third thing is that they are not designed for updates. They are only designed for append only data. So, there is absolutely no consistency guarantee. They cache blocks locally and so forth. So, they are not suitable for databases. What is interesting though is that you can use them as the underlying layer on top of which you can build a data storage system. In fact, big table from Google uses GFS and delete. In the course of four slides, that is what a distributed file system is. Now, let us move to data storage system. The API as I already told you is the fundamental API is get key, put key value, delete, execute and so forth. Now, what is the data type? It could be an interpreted key value like Amazon, Dynamo or S3 is a more of a distributed file system which is more scalable or you can have flexible schema, JSON or you can have keys which have an order with records and columns and so on. Big table, H base and so on support that and then you can have documents. I have already covered this slide earlier. So, I am going to skip that. Now, many of the applications for big data want a flexible data model because they do not want to do normalization and do joins at one time. It is too expensive in a system with thousands of nodes. If you have to join data from two different nodes, it is expensive. So, many of them have a flexible data model which can have nested tables, nested arrays and so forth. So, that is something which all of these systems support. So, here you have a key and within that key the value could be a name value pair and what is shown here is there are three different products. The name value pairs of those three may differ. All of them have a name. One of them has memory because it is an iPad. Another has wheels because it is a some kind of vehicle. Another has screen size because it is a laptop and so forth. So, this is a flexible schema. There are column names, but each record may have its own column names and you can add or delete column. So, that is a overview of those storage systems in terms of the API and data types they support. Now, let us get into the architecture of these systems. So, we are going to talk about big table, peanuts and mega store here and as I said there are many, many more which have been developed. We are not going to talk about those except for Dynamo which we will come back to. So, what is big table? It is a massively parallel data storage system as I have already said. It supports key value store. It supports flexible schema like the example I showed you. It is designed to work within a single data center. It does not address distributed consistency. It does not address you know keeping a local copy and so on because it is only within one data center. It is built on top of GFS and something else called chubby which I will not get into. It is becoming very important for people these days. It was important because it is the first one of its kind, but now it is becoming even more important because hbase is basically an open source clone of big table and hbase is becoming quite popular these days. Many people are using hbase. So, how does big table work? So, think of having a table which is extremely large petabytes of data. You cannot store it in a single node. So, you have to distribute it across many nodes. So, how do you do this distribution? The key idea is to split the table into many tablets. How many tablets? Lots. You may have 100 nodes in your system, but the number of tablets may be 10000, may be far more than the number of nodes. In fact, what happens in big table is that each tablet is of the order of 100 to 200 megabytes. That is kind of their target. If a tablet becomes bigger, they will split it. Now, each tablet server manages multiple tablets. 100 megabytes is small. I mean your machine may have disks of say a terabyte, which means it has about 5000 tablets can be stored in one machine. So, each tablet is controlled by just one server and if a tablet becomes too big, it gets split. Now, there is also a master node. So, there are two types of nodes again. There are tablet servers which manage the actual data and then there is a master which controls the whole system. It does load balancing, fault tolerance and so on. What I mean by fault tolerance? If a tablet server dies, it is responsible for certain tablets. Now, somebody has to take over and manage those tablets. So, the master is responsible for telling someone to do that. Now, where is the data stored in big table? It is stored in GFS. Why GFS? GFS already supports replication and fault tolerance. Big table could have built its own replication and fault tolerance infrastructure, but instead of doing it from scratch, they decided to leverage GFS. It is a good engineering decision, but there are a few compromises because of that, which we will see. Now, the key thing is because it is in GFS, the data is not associated with a particular node. So, if a tablet server goes down, that does not mean its data is inaccessible. The data is replicated and is available somewhere else. So, all that you have to do is tell somebody else to take over. Now, in addition to data logs are also replicated. So, here is how big table is architected. So, the key is to layer it on top of a file system. The file system does not allow files to be updated. So, how on earth do you build a storage system which allows updates? You cannot update a file. GFS provides no consistency guarantees if you update it. It may update one copy and if something happens, the other copy may not get updated. So, you really do not want to do updates in GFS. So, how do you build a storage system on top of it? The key thing is that you have files which are right once. They are written once. They may get appended, but they are never updated. So, how do you do this trick that is the magic of big table? Any questions? Is there any guideline available for the user to interpret the data that is coming out of any of these systems? There are multiple choices. The technology is fascinating and your elucidation is very, very brilliant. Is there a guideline or is there a set of rules as to how the user can interpret in any of these file systems? So, in all of these systems, there is a range of how the data is interpreted. Some of these systems tell the user store any bytes you want in here. What is inside it is your headache? I do not care at all. Other systems say that look whatever is stored over here, the data should be in JSON format, which will give the system an ability to fetch particular fields out of that record. Because JSON is, if you are not familiar with JSON, JSON is a format somewhat like XML, but a simpler version which can be used to store fields and objects and so forth. Others like big table give you a sort of relational model. They have columns and you can fetch a particular column of a particular table and they also support versioning. Big table allows you to keep multiple versions of a particular data item or a particular column value can have multiple versions, which is very useful in certain situations. So, how to interpret the data which version to use is all left to the programmer? More specifically, the point I wish to make is, say for example, right now, I am in a very fortunate position that I can move out of the studio, unlike you in this case. And I can go down, I went to the library in the ground floor, you are still very audible, your lecture is audible and I was just going through stacks. My pleasant surprise, I found a wonderful collection, The Collected Works of Mahatma Gandhi. This was published by the Ministry of Information Broadcasting. Copyright is with Navaji Vintras, Samdabad. 66th the copyright, 79th the government of India took over. Randomly I opened it, it was a speech at Chapra. Found it very interesting because right now only Chapra is in the news for all wrong reasons. Dibrugar Rajdhani Express derailed, it is at Chapra and they are saying sabotage and all that. His speech at Chapra was on December 6, 1920. And the first line was very interesting. Mahatma Gandhi began his address after sitting in a chair, that is how it starts, the first light. God only knows why reasons it is like that. Now I halt here. If we add personalization in some manner as we track the user today, all that that you are telling is as I keep accessing your system, the system will get to know more about me. It will start the implication of all this is the system will get to know more about me and it can tune the retrieval according to what I have been doing for some time, track record. So if I halt here and go in the sense that this is what is the narrative, I have read lot of things in this simple snippet. What are the chances that the trackers will track my further transactions in for what I have done with this piece of information too? Am I making myself clear on that? I think you are asking a lot of questions which range from philosophy to privacy and so on. I am not really qualified to answer any of those. A doctor of philosophy degree does not qualify you to necessarily talk about philosophy. You know more about it than I do, so I will not try to answer that part. But to take up this last part, the connection is that systems do track what you want. So every single thing which you do could either turn into update of some record which may be tracks the last few things you did or it could turn into an append to some kind of a log, but in either case that information will eventually be used in some way either immediately or later and the way it is used typically is to target ads. So these ads whoever displays ads needs to get information about you including the last search you did, the last thing you clicked on and this is one of the major motivations. There are many motivations, but this is one of the motivations for highest performance scalable transaction processing. That is the motivation, but let us come back. So the key as I said is to have files which are write only, write once, no updates, but to allow updates and the way that is done is as follows. The data base consists of multiple tablets. Each tablet actually has many files called SS table. What is an SS table? An SS table is you can think of it kind of like a B plus tree of some form, a complete B plus tree which looks kind of like this. It is not really B plus tree, it is more of a flat index, but you can think of it as somewhat like a B plus tree. So it has data broken up into 64 kilobyte blocks and there is an index on that. So think of this as the leaf levels where the data is stored, the data is sorted in order. So you can build an index on the sorted data. So this is an SS table. So the idea is as data is inserted you will, it is coming up in the next. I am just going to skip two slides and then come back here. So here we go. So as you insert data, there is going to be a particular tablet called, which is in memory. All the other particular tablet, sorry particular SS table which is in memory for each tablet. So here is a tablet. One of its SS tables is in memory. All its other SS tables are on disk. The idea is as inserts come in, they will all be done on the in memory table. They will not go to this table yet. In addition they will be logged to a tablet log, which appends log records to a log file. That is something which GFS supports, adding records to a file. The records can be small, few bytes, but you can append it to a file. So every insert which comes to this table will result in an insert here. But you will also notice in the log, there are deletes. In a delete comes, what would normally be done by a database is to go to the SS tables for that tablet and wherever that record is present, delete it. That is what you would normally think of too. But as I said, these SS tables are files and files in GFS really are not suitable for updating. So you are not allowed to update this file. So how on earth you delete a record which is present there. And the key is to keep deletion entries. If you do an update, you delete a old record, add a new record or delete a field, add a field. So this mem table would have these deletion entries, which say that a particular record from this SS table or that SS table should be deleted. Now what happens as you get more inserts or deletes, the mem table fills up. Once memory is full, what do you do? You cannot keep adding more memory. So it is going to write this mem table out onto disk in the form of a new SS table. So this tablet which had two SS tables now has a third SS table. So far so good. What if you do a look up on this tablet? You come in here, you look up in mem table. But the data you are looking for may not be in mem table. It may be in one of these SS tables. In fact, you have to go to every one of the SS tables and look into the SS table and use the index. As we saw, each SS table has an index. So you can look at that index and fetch the data relevant data from that SS table. But there may be another SS table which any of the SS tables could have the data. In fact, so you go to every SS table, look up the index and see if the record is there. So what has happened? Inserts are relatively cheap, but look up somehow more expensive. Why? Because they have to look at every single SS table in that tablet. So if there are a lot of SS tables, there are a lot of IO operations. That is a very high price to pay. So the key idea is that as you add more and more SS tables, you will start merging these SS tables into bigger SS tables. So then the total number of SS tables which you have is not too large. Now I also said there is a tablet. What is the tablet? The tablet has data in a particular range. So supposing we are storing strings, this tablet has strings, key values which start from arduak and end at apple. Anything greater than apple or less than arduak is in some other tablet. So, you have to talk to the master and find out if you look up a particular key. If I look for the key Gopal, which tablet contains the key Gopal you have to find out. You cannot have two tablets containing the key Gopal. It can only be one tablet. So you have to find out which that one tablet is go there. In that tablet, there may be many SS tables. You have to look up each of those SS tables to find Gopal. So what has happened is tablets are the unit of partitioning. They contain a range of rows based on the key. So the range here is arduak to apple. Now across tablets, ranges cannot overlap, but within a tablet, this SS table and this SS table can both have overlapping data. That is why you have to search all of them. So this is word immutable. What does immutable mean? Cannot be updated. Once the SS table is created, it is never going to be updated. What is instead going to happen is, as you get more and more SS tables for a particular tablet, you will merge all of them into one single SS table. Now in an SS table, the data in these blocks is thought sorted. When you merge, you create a single sorted file with multiple blocks, a larger file with an index on that. Once I have merged, let us say 3 or 4 SS tablets into one new SS table, then I can throw out all the old ones and keep only the new one. And henceforth, I do not have to look up 5 SS tables. If I do a look up, I go to the merged SS table and look up only that one. That is the key idea. But of course, life is not so simple. I just did the merge. Meanwhile, new inserts have come. A new SS table is being formed. So after sometime, more SS tables get created. Again, I have to merge. So it is a process which is ongoing. So the fact that SS tables are immutable simplifies scaching, sharing across GFS and so on. Consistency is not an issue. There is no need for concurrency control at the level of SS tables. There may be a need for concurrency control at a different level, at the tablet level, but not here. Now, how do you know what are all the SS tables corresponding to a particular tablet? An SS table is a file in GFS, note. So I need to know which all files in GFS correspond to a particular tablet. That itself is stored in another table called metadata table. Now, whenever I do a merge, the old SS tables can be discarded and they are garbage connected by the master. That is one of the jobs of the master. Now, what happens is, as I get data into a tablet, I keep adding more SS tables and merge SS tables. Eventually, one of two things may happen. One is that a tablet has so much data that you really do not want to keep such a big single tablet. I want to split it because the tablet has become too big. Maybe I want tablets to have 100 to 200 megabytes. Lot of inserts is going to 500 megabytes. I want to split it. The other reason could be that there is a lot of look-ups or inserts on a particular tablet. The load is very high. It all goes to one machine. Now, if I split that tablet, maybe I can divide the load amongst two machines or three machines. So I may split to balance the load. So both of these can happen and it is the job of the master to make these decisions. Now, only memtable has read and writes concurrent. So you have to have some concurrency control on memtable. I will not get into the details. So now, when you want to delete or update an entry, you log what is called a mutation. A mutation could be a deletion or an update. You log it in the tablet log and you record it in the memtable. Now, when you do a look-up, I am going to look up this SS table, this SS table. But I am also going to look up this memtable and I will find that, hey, I was looking up Gopal and Gopal was deleted. And there is a deletion entry here, but I also find the Gopal record from this SS table. But there is a delete Gopal here. They cancel each other and Gopal is deleted. But now, there is an update on Gopal. So, there is a fresh insert on Gopal here and I will find that and I will return that. So, that is how that works. So, big table table has multiple tablets and each tablet is assigned to a server and replication is not an issue for big table because it is signed by GFS. All updates and look-ups on the tablet are routed through the tablet server. So, in effect, this is a primary copy. What do I mean by that? The point of a primary copy scheme was that you do not want two different nodes where updates are happening on the same data item. Because there is only one tablet server managing a particular tablet at any given point in time, all updates go to that tablet server. So, you cannot have inconsistent updates or two different nodes for the same tablet. So, the last part of big table is how to find tablets and this basically shows there is another system called Chubby which helps you find the root tablet. From the root tablet, you will find metadata tablets and then go to the actual user table and find data in there. I am going to skip the details over here. So, the last thing about big table is that it really does not support any meaningful transaction except very simple single row transactions. You can have atomic read, modify writes on a single row, but no transactions are cross sides. It does not support secondary indices at all. So, that is a limitation. Now, there is a parallel system called PNUTS which provides a different approach. It does not layer on top of GFS and I will talk a little bit about it. So, PNUTS is very much like big table, but instead of using a distributed file system, it has databases at each node and it handles distribution, replication, all that by itself. And a lot of its focus is on geographic replication. So, let me explain a little bit more about that. The architecture is like this. You have clients and they connect through HTTP API to routers. The router knows where to send any request to. There are many storage units just like tablet. There are tablets in PNUTS also which are stored in the storage unit. There is a tablet controller like the master in big table. It is very similar. But there is one extra thing here which is a message broker which is not there in big table. And the idea of the message broker is when you do updates here, I want the updates to be replicated to some other site which is somewhere else. And the way that is done is by sending a message to the broker who will then forward it to the other sites. And this guy will do persistent messaging essentially. That is the job of the message broker. And note a few things here. You can have many routers. You can have many storage units. You can also have many message broker copies. These copies work together to provide a service for persistent messaging. But it is scalable. You can add more message brokers in order to handle larger load. That is the key idea here. And now, this is a remote part. So, this is one region, same as what we saw before. Now, you can have another remote region which looks identical. And yet another remote region which also looks identical. So, any update done here might be sent to this region and to this region through the broker which is called the Yahoo message broker YMB. So, this YMB will send updates to that YMB and to that YMB which in turn will send it to the local system which updates the tablets over there. So, the Yahoo message bus is what is called a published sub frame service. I would not get into the details for lack of time. But now, let me focus on what is different in peanuts from Bigtable. Bigtable did nothing about concurrency. It does not care. It is the programmer's job to deal with anything which happens. In peanuts on the other hand, there is built-in support for some form of version numbering which can be used by programmers to build some primitive concurrency control on their own. So, the way this whole thing works is updates happen at the primary copy and then they are sent lazily to replicas. So, the replica may not have received the latest update yet. So, maybe it has reached replica 1, but not replica 2. If you read replica 2, it might give you old value. So, fully consistent system will not let you read old data. Peanuts will and the reason it lets you do that is that there are applications which may not care about asset properties. They want a reasonably new value. If you are doing train booking and you want to know how many seats are available, you know very well that the seats change very frequently. If you get a value which is 3 minutes old, it may not matter except if you are booking a Tadkal ticket 3 minutes is a huge amount of time, but for everything else it really does not matter. So, many applications can live with slightly old data. Now, reads can provide a version number. It can also be a time stamp. Say, give me a data which is newer than this version number and if the current data is older, it will go back to the master and fetch the most recent version. So, essentially what you have got is the ability to read old versions and you have to program in your concurrency control management yourself. So, there are a few slides here on the consistency model where you have different versions v 1, v 2, v 3 of a particular data item and the transactions can say give me v 7. It will get v 7. It may say give me a version greater than or equal to v 7 and whichever node it goes to, whichever is the latest, it will give it if it is v 7 or more. If it is older, it will go to the primary and get it. So, you can have reads which fail. You can say read the latest version which will go to the primary and I am going to skip some details. There is a final useful thing called test and set required version. So, what this does is supposing I read the recorded version 7 and I have read the record, I have updated a field, I added a value and I want to save it back. This is going to the P nuts API allows me to say test and set this data item provided the version number is 7 and this goes to the primary copy which has the latest version. If the primary says realizes that the latest version it has is 8 and this guy is trying to do a write based on version 7 which was read earlier that write will be rejected. So, the test and set will be rejected if the version is 8. On the other hand, if the latest version is still version 7, the test and set with version 7 will succeed and you will have a write. The write will create a new version of course. Some more details about record and tablet mastership which allows the primary copy to be moved. So, if a particular site dies, just let me just finish this slide and then I will answer your question. So, if a particular site dies, any record for which it was the primary will now have to be shifted essentially to another site. So, there is a transfer of mastership, there are protocols for all of this which I am going to skip. Now, let me ask and answer your question. Question, I discussed the CAP theorem and I was mentioning that consistency can be you know a little bit relaxed, gradual consistency. Your earlier slides, your three, four slides were pointing to that. I wish that the participants have caught that point. Yeah, I do have slides on it coming up. I am planning. The CAP theorem, evolving consistency. There are stale versions, there are other versions and Mrs. the coordinator was also mentioning that the replicas are there. We do not have to wait for all replicas to get the consistent data. Gradually it can evolve. Yes. So, all that, Naveeta Devi was also. Yeah, indeed I have slides on this coming up. I will talk about it. So, the next system after peanuts was this mega store from Google which was built as a layer on top of big table and the key things which it added to big table are geographic replication, but there is another very important notion which it added which neither big table nor peanuts had, which is something called an entity group and I will tell you what that is. And in fact, it provides asset properties for transactions that access only a single entity group. So, what is this entity group and what is going on? So, an entity group is a set of related rows. So, for example, I have a user and there are many records for that user. If I am you know Gmail or something else, let us say I am WhatsApp, I am storing data about the user including the messages sent by the user and so forth. Now, there are certain situations where I have to perform transactions on the data of a particular user and I want all the records for that user to be co-located. So, I can efficiently access all of them together. So, now if I have a transaction on one entity group, that entity group will only reside on one machine. So, it is relatively easy to support transactions within that one machine. And in fact, Megastore supports asset transactions within an entity group. But not all transactions can be done within a single entity group. So, now what do I do? I have various means. I can use two-phase commit. I can use persistent messaging. So, the traditional solutions are in fact supported by Megastore. Neither of these was supported by either big table or peanuts. But Megastore now supports two PC and persistent messaging. And the reason it did this is that many programmers realize that yes, it is very nice to have the flexibility of reading old data and so forth. But the price you pay is that you have to manage consistency. It is a lot easier if the system will provide a direct way to ensure atomic commit through two PC. Or it supports persistent messaging, which is a nice building block for transactions that span multiple nodes. Megastore also supports secondary indices and so forth. I am going to skip this entity group slide and show this quick picture of Megastore. This vertical cylinders here represent data centers. They may be far apart from each other. They may be on different continents. Within a particular data center, you have multiple entity groups. But that same entity group might be replicated in different data centers. So, if you perform an update on an entity group here, its replicas at these two centers also have to be updated. And using either two PC or persistent messaging, you can get asset semantics within an entity group. Sorry, you do not even need two PC. Megastore directly supports asset properties for transaction within an entity group. Two PC and persistent messaging are for transactions that span two different entity groups. And the data is stored in, it says no SQL database here. It is actually big table. It is what is used underneath Megastore. So, this picture shows two-phase commit as one way to do cross entity group transactions. Or you have a persistent queuing system here, which lets you send a message, which is later delivered. I am going to skip other details of Megastore. It has multiversioning snapshot isolation. I am going to skip that and also skip this slide in the interest of time. So, what we have so far is people started very simple with big table, which gave only a few features. And there were lot of applications that really needed scalability with across thousands of nodes. And they were willing to live with the minimal features that big table provided. But as things evolved, people started asking for more and more from big table. And the result was Megastore. People continue to ask more and more from Megastore. And Megastore itself had some architectural limitations. So, the next thing which came from Google is something called Spanner, which is a descendant of big table and a successor to Megastore. So, what Spanner does is, it has what are called directries, which is somewhat like entity groups, but with some more flexibility. That is at the data representation. At the level of transaction, it has a multiversion database. So, you can create multiple versions of data. It supports general purpose transactions, that is, acid properties. Sorry, it does not, sorry, this is a mistake. Spanner does not support SQL directly, but it supports some DML aspects of SQL. The full SQL query language is a layer on top of Spanner called F1 that is coming up. So, there is a mistake in this slide. The focus of Spanner though is in managing cross data center replication. So, global distribution is key to Spanner. It was there in Megastore, but Spanner put a lot more effort into doing this in a nice way. So, amongst the issues are synchronous data center replication. So, you update in one data center. It will also happen in other centers. It is scalable, partitioning is transparent and so on. But the key new things in Spanner are on this slide. The first new key thing in Spanner is that it provides what are called externally consistent reads and writes. And the idea is this. When you have multi-versioning, there is a notion of a version number. If you looked at multi-version concurrency control, there is always a version number. That version number is usually an artificial counter in the database. So, that counter is in a centralized database is not an issue. But when you have a distributed database, a distributed counter becomes very problematic. So, how do you know which counter is less, which counter is more? It is all a pain. What Spanner decided is to use timestamps instead of these artificial counters. And the benefit is that timestamp is a well recognized thing. So, if I say give me this data item as of this timestamp. In the external world, I know what that means. I have a wall clock time. I know when updates happen. So, I can talk of this data item at this time. So, now, each transaction has a timestamp corresponding to the actual time, not artificial counters. Of course, you have to ensure uniqueness. So, you will add site ID and few other things. But the key problem with timestamps, which is reason that they were not used very widely before, was work back in the 70s, where people recognize that computers clocks run at different speeds. If you take two different machines, their clocks are not going to run the same speed. This is well known. The clocks are not that accurate. So, you might think that right now, it is 431. I might think it is 433. If I, if you commit a transaction at 432 as per your time, if I run a transaction at timestamp 433, I should see your update. But the problem is, it is already 433 for me. And I do read at 433. Your update does not even happen. But when you do the update, you set the timestamp as 432. We have a problem. This should never happen. So, how do you deal with it? There are two parts to the solution. The first part is to keep computers reasonably in sync. This kind of large one minute differences are not acceptable. So, what is a good source of very accurate time? It turns out the GPS, which you use in your mobile phones these days, is based on extremely accurate clocks. And if you just attach a GPS unit to your computer, you can get time from the GPS satellites, which is very accurate. You do not need one of these per computer actually. You would have one of these per data center and you will get time from GPS. Well, Google also worries what if at some point in time there is no GPS satellite available overhead to get the current exact time, then you have also atomic clocks, which are very accurate clocks. The second part is that even with all of this, you cannot keep on checking the time from GPS for every single lookup. That is too frequent. So, what you would do is maybe every few micro, 20 micro seconds, every 1 millisecond, something like that, you will go and read the GPS clock to get the current time. But after you read it, you are going to have a local clock, which is counting time from there. So, I got the time as 431. Now, my watch is ticking, like 1 second, 2 seconds, 3 seconds and so on. So, if I look at my watch now, if I synchronize the watch at 431, it was pretty accurate. But by the end of the 1 minute, it may be 0.1 seconds off if the watch is not quite accurate. So, when by the end of the minute, I am thinking that the time is 431, 59 and half, somebody else thinks it is 432. Now, this notion of precise time may not be a big deal for footballers. How many of you are watching the World Cup? How many of you watch the World Cup? Anybody been watching the World Cup? Nobody? Few of you, surely you do. At least read about it. So, the interesting thing about time in football matches is, it is decided by the referee. The referee keeps the watch and he knows the exact time since the match started. But the referee is going to add a few seconds for injury time or slow substitutions or whatever else delays. It is all in the mind of the referee. This is like cricket in the old days, before we had all these decision review systems and third umpire and all that. Primitive technology. Football is very primitive today. So, the referee decides how much time to add. You do not want to live like that in the context of computers. So, you want something more precise. So, the idea is that you will keep correcting your watch very frequently, so that you are never too much off from the correct time. And the true time API will not only tell you a time, it will also say how accurate is this time. It will say that right now the time is 437 and 27 seconds plus or minus half a second. So, it has an idea how far away from exact it is and it will give a pretty tight bond. It is very unlikely that the time is more than that much away. And that uncertainty is now known to the system and you can use it in the following way. If you want to commit a transaction at 43758 and the time now may be 43757 and half or 43758 and half, I am not sure. The trick is I will wait 30 seconds. In 30 seconds that I will now know that even though there is uncertainty, I know the time has crossed 438. That is guaranteed. And therefore, if I just hold the locks till then, the transactions timestamp can be safely assigned as 438 because by the time I release the locks, the real time in the world has definitely crossed 438 and I am held the locks longer than required. That is ok, but I will never release the locks too early. So, with all that you can get snapshot reads. You can say give me the time of the data at this time across all data center and I will get a nice transaction consistent cross data center read which is very nice. There are of course, a lot of details which go into this. There is a nice paper which talks about it. It is not as simple as you would think it is. It is a fairly hairy paper, but it is possible. And now with timestamps, you can also do multi version concurrency control. You can get serializable transactions if you want. So, Spanner is a very nice new addition which gives you this kind of real time consistency with the clock. And Spanner was actually used in a database system called F1 which is replaced their earlier sharded system, sharded MySQL which they were using for their ad business. And what is very nice about it is, it is a full-fledged relational database system which supports SQL and constraints and so on. But it is also completely scalable. You can keep adding nodes to Spanner and F1 performance will keep improving incrementally. And you also get availability, consistency and so on. There are some other details which I will skip, but this is kind of how the world is evolving. It is evolving towards relational databases which support full transactions. The price you pay is here, high commit latency. So, if you want to commit, you may not be able to commit instantly. It may take one second for the commit to happen. But the trick is that most of the time you have a very large number of transactions that need to commit, but they are unrelated to each other. I have 10,000 or maybe 1 lakh people booking tickets on IRCTC and Tathkal time. Now, if it takes one second to commit a particular user's transaction and the next person will have to wait a further one second and then the next person has to wait one more second. We are in deep trouble. In one hour, I can only commit 3,600 transactions. That is horrible. But if all these commits can happen in parallel and each one takes a second, it is okay. I can handle all the 1 lakh transactions in one hour, not a problem. So, that is the key trick which is in here. So, that is where the world is headed today for many applications. But now, let us come back to CAP Theorem and Consistency which Professor Gopal was alluding to. I am going to talk about the basic principles. So, there is a notion of consistency in databases which is with respect to integrity constraints and asset properties in general for a centralized database. And anyone who has done a database course would be familiar with this. The key new issues which come up in a distributed setting are to do with replication. So, all this notion of consistency in the CAP Theorem and so on has nothing to do with the traditional database notion of consistency, but rather it is purely to do with replication. And the key thing is that replicas must be kept consistent with each other. Now, there is a notion of strong consistency where you are allowed to read from a replica or write to a replicas. In any order, subject to this constraint that what all you did and what others did concurrently with you can all be put into a single serial schedule. Similar to the notion of transactions, serialization in databases, but the idea is that this is with respect to a single data item. It is not a transaction spanning data item. For a single data item, all the things which happened can be viewed as happening on a single copy. This is also called single copy consistency. Even though they were reading different versions concurrently, the values which they got back would be equivalent to some serial schedule on a single copy of the data item. That is the cost of course. And strong consistency can be ensured, but there is a cost. So, many systems go with weak consistency. And the key issue here is availability. So, in a distributed system, availability does not, you know, if a single node goes down, you do not want the system to stop. You should continue to be available because there is a very high chance that a single node is down at any time. The thousands of nodes, something will very likely be down, but the system should continue. Now, distributed consensus algorithm, which allows you to update the replicas in a consistent manner. I would not get into the details of consensus algorithm. They work nicely with one or two sites. You know, you can continue updating. The majority protocol, for example, will work quite fine if one or two sites are down. But if there is a partition and you needed to write to a majority of sites, but the system is partitioned three way. No partition has a majority. Rights cannot happen in any partition. Even if it is partitioned into two parts, one part has a majority. If a transaction happens on the other side, it cannot commit. So, what happens is transactions get stuck. So, is that a good or a bad thing? It depends. If you are a bank, it might be a good thing because you do not want people to withdraw more money than they have. But it turns out, even banks are not worried about people withdrawing a little bit of extra money because they feel they can go back and get it from them. They know who they are. We have KYC norms. We have an introducer and the banks presumably know who we are. And if you take 10,000 rupees extra because your ATM was cut off from the network, it is some risk they are willing to take. So, what they are doing is they are giving you availability at the expense of consistency. So, there is an inherent tradeoff here which was made kind of visible by Eric Brewer in a talk he gave. And the way he put it is that there are three properties of the system. There is consistency which ensures that all copies are the same value. There is availability which means the system can run even if several parts are failed, even if there is a partition. And the idea is that you use replication. And the third thing is partitions can occur. You cannot prevent partitions. The network can fail. And so what Brewer's so called theorem cap theorem says is you can have at most two of these three properties. This is actually a confusing way to look at it. A better way to understand Brewer's theorem is to say that any large system can partition sometime. And given that it can partition, when a partition happens, you can either choose consistency or availability. And if you look at it that way, the idea is fairly intuitive. If you want consistency, you only allow updates in a partition which has the majority copies. The other partitions wait. So, they do not have availability. They sacrifice availability for consistency. Conversely, if you want updates to happen even during partition, you may allow two partitions to update the same data item. That means consistency is gone. But the benefit you get is availability. So, you have to choose between these. So, which one do you choose? Now, sometimes it is done at a whole application level. So, there are many web applications which say look, I really care about availability. I do not care about consistency that much. And I will go with availability except maybe for some key parts. If I place an order, I expect the money to be charged and the item to be delivered. If I lack consistency, meaning the amount is charged, the item is not delivered, I will get upset. If the amount is not charged, but the item is delivered, the company will be upset. So, you want consistency, but availability can be sacrificed in this situation. So, you want the application to be able to make this choice. So, many systems allow replication with weak consistency, which has two forms. One is, you allow rights to happen even though you do not have access to majority. The second is, you allow reading of old data because you may be in a minority partition. You have an old version of data, you may read it. But there is another reason for reading old data, which is not to do with availability, per se, but to do with latency. What do you mean by this? It may take tens of milliseconds to read the current version of the data from a remote site. There are many applications which do not want to pay this price and are willing to read a local copy, even if it is not quite up to date. So, they are sacrificing consistency in order to improve latency. So, all of these trade-offs are made by applications. Traditional databases do not allow you to make such trade-offs. They are more or less 4-0, not quite true, but at one level they do not allow these things to happen. These new generation systems let you make these choices in order to get good performance. But the key issue is that once you allow inconsistency to happen, what do you do next? I have two copies of the data item, which have two different values. What do you do? A very crude way of dealing with it is you say, you know, I want you and detect it. If you read one copy, you may read a value 5. If you read the other copy, you may read a value 10. Good luck. I do not care. That is usually not acceptable. So, at the least you want to detect inconsistency at the earliest possible time when the network comes back in shape. And once you detect it, you also need to resolve it in some way. Again, a crude way of resolving it is to pick one of the two versions and store it. But that is not necessarily the right way. A better way may be to somehow merge the updates and get one new version. So, we will look at this in a minute. Now, given that you have mechanisms for detecting and resolving inconsistency, the idea is that you would like the system if it was partitioned, it became inconsistent. What you would like is after some time, you will detect and correct all this and the system will become consistent again. So, this is called eventual consistency. So, for a period of time, the system may be inconsistent with two replicas having two different values, but eventually it will be detected and brought back to consistent. That is eventually consistent, eventual consistency. So, this thing which, this approach which provides availability at the cost of consistency, but with the goal of getting eventual consistency and which is based on a notion called soft state which basically allows items to become inconsistent, but it will ensure that eventually the updates will be merged. This is collectively known as base as opposed to acid. Acid means full consistency is required. Base means, well, more important than consistency is availability. So, basically available and eventual consistency. So, that was to answer Professor Gopal's question. So, let me open the floor for a few questions before I move to the last part of my talk. I am going to talk about how you could do detection and resolution of inconsistencies in the last part of my talk, but before that time for questions. Is there any representation mechanism to cut down the size of the database and we represent it in some other manner, digital database, capture of data. Everything is making it large, large, large and large. Can we think of a representation mechanism which is slightly different and can compact it? Right. In fact, that is a very important question. I have not covered it at all in today's talk, but that is a very important area of research. So, one of the key things for that is compression and what people have realized is that a very efficient way of getting compression is to store data not as rows, but as columns. So, there is a lot of work on columnar databases in the last few years. One of the very important reasons for the efficiency of columnar databases is that it is very easy to compress data. So, this kind of compression turns out to be a very good idea for a system which store large volumes of data and on which you do decision support queries. So, columnar databases have proved a very good match for decision support systems, but when it comes to transaction processing system, columnar databases have very significant overheads. So, they have to date not been successful in the OLTP scenario which is the focus of my talk today. So, depending on your needs, you would either go with the row store which is the traditional thing which these systems also use versus columnar, but actually life is a little more complicated. Even traditional row stores already support columnar storage for very large fields. So, if you have a blob or a clob type, then the system will probably store those columns separately not as part of the main row. So, columnar storage already exists in a very limited primitive form and some of these systems big table, for example, allows you to partition a table by column and store some columns together, some other columns together. So, it is possible to do compression even potentially in this setting. Although I think not sure if big table supports compression, but if you break up the columns, then compression is more effective when you do this kind of columnar partition. Does that answer your question? I understand. There are many things that are still open like you know what I said, God's rules, it is still far-fisted in today after 3 decades. Yeah, in fact it is very interesting. If you see these system which I have talked about, they do not support joints so far. So, how do you deal with this? The traditional ways which all these systems have been using in recent year is to do denormalization and people are realizing that denormalization hurts. If you do not keep your copies up to date, things go out of whack. So, now people are in some sense suffering a lot of problems which relational databases were plagued with in the early days before Cod and others popularized normalization. So, in some sense the whole thing is repeating itself. So, now we will have to go from here to normalize representation, but with a different way of denormalization. Denormalization should not be the programmers responsibility. It should be the systems responsibility to keep denormalized data up to date. And I did not talk about it in my slides, but the PowerPoint version of my talk actually has some hidden slides which talk about something called asynchronous view maintenance in peanuts. And what that supports is essentially it allows you to keep your data normalized, but also have denormalized views. And peanuts will do view maintenance and you can now read from this denormalized views to get very efficient access to certain pieces of information. So, that is actually a very good solution which was introduced in peanuts. So, if you are interested, you can get a copy of the slides and look at it in more detail. That almost all participants are taking down running notes, noting down what are the major points. So, nicely done. So, for the last part of my talk, I am going to be talking about this base properties and so on and how these are supported or dealt with. And as an example, I will use Amazon Dynamo, which was one of the early key value stores. And Amazon has pushed a lot on this notion of eventual consistency through Dynamo and through other means. And if you have used Flipkart or now Amazon is also available in India. So, if you have used it, you know that there is a notion of a shopping cart. And the idea of Flipkart, Amazon and so on is that any point in the screen, there should be a button add to cart. When you see a data item, you should be able to click add to cart. And that should always succeed. It should never fail saying that, I am sorry, your current cart is on a different partition, I cannot accept your order. That is a no-no for Amazon. They do not want to lose orders like that. But now, people keep clicking adding to cart. And once in a way, they say, now I am done collecting a bunch of things, I will now place my order. When they place the order, you will go to a different system which provides the asset properties. But the shopping cart itself, you are willing to sacrifice consistency for availability. So, think of it this way. I added items A and B to my shopping cart. I added a shoe and I do not know what something else, pencil to my shopping cart. And that shopping cart was stored somewhere. Now, there is a partition, I cannot access the main shopping cart, majority replicas. But I have one replica or even none, I can create a new shopping cart. And now, I am adding, I do not know, something new box of mangoes to my shopping cart. And I have created a new shopping cart with only the box of mangoes. So, in effect, there are two inconsistent shopping carts, one which has my first two items and one which has my box of mangoes. Now, as of now, I cannot see the other one. But if I now try to proceed and check out, the system will show me only my mangoes. And maybe at this point, I will wait a minute. I remember adding pencil and something else to my shopping cart is missing. But hopefully, you know, this would not happen too often. And if I created a shopping cart earlier and I have added an item now, by the time I proceed to check out, things will come back in consistency. For that to happen, the system has to detect that I have two versions of my shopping cart. And now, it has to merge the two shopping carts. So, let me see how that is. So, Dynamo is this key value store. Dynamo DB is actually now a web service for Dynamo, meaning you can use your credit card and order an instance of Dynamo DB and use it to store and retrieve data without doing any setup. So, that is one of the huge success stories of Amazon, that it provides many services for which you pay online and start using with absolutely no need to buy hardware. So, coming back to Dynamo, it is a key value store. It does replication. And the thing is that it does not guarantee all copies are updated. The put interface will say success as soon as the local copy is updated. And then, it will proceed to update other copies. It reduces latency as a risk of consistency. The programmer can actually control this. He can set these three parameters, which I will skip the details to choose whether they want consistency or availability. Now, given that you may lose consistency, you have to detect inconsistency. So, idea is that data items are version. Each update creates a new version. In the absence of failure, there is a single latest version. But with failure, versions can diverge. So, I started with a shopping cart and I have created two different versions of shopping cart. How do you detect whether the following situation happened? So, let us say I have one shopping cart with a pencil in it. I have another shopping cart with mangoes in it. Now, it may be that I created a shopping cart with pencil. Then, I deleted the pencil and added the mango. So, there is only one real latest version of shopping cart, which is mango. But it is also possible that I added pencil in one version, added mango in the other, in which case they are inconsistent with each other. And I have to detect this. So, how do you detect it? There is a notion of what is called a vector clock, which is skip the details of it for lack of time. But again, if you get the power point slides, you will see the details of the vector clock scheme, whose goal it is to detect that there have been inconsistent updates. If the update is consistent, meaning initially it was a pencil, then pencil was deleted, mango added. But it so happened that one of the replicas had an old copy pencil. The new replica had mango, but not pencil. It is possible to compare the versions and detect that you arrived at this version from the older version, which had pencil. It is a newer version. So, there is no inconsistency. That is ok. So, this detection of what looks like the same state, but could have been arrived at in a consistent way or could have been arrived at in an inconsistent way. I need to differentiate this. And the vector clock scheme helps me differentiate it. So, let us say that has been run and I have detected an inconsistency. What do I do? Should I throw away a pencil, retain mango? Should I throw away a mango, retain pencil? Or should I merge both of them? For the shopping cart, it is an Amazon's interest to merge both of them. Even if it is not sure that it happened because of a deletion followed by an insertion, what is the worst that happens? The customer deleted the pencil and after some time, they found it again in the shopping cart and they say fine, I will delete it again. Big deal. So, this is merging or taking a union of the items in the two shopping carts. So, that is one way of resolving inconsistency. But supposing it was a bank account, I had 1000 rupees. One update changed it to 1100. Another update also changed it to 1100. If I merge it by replacing it by 1100 plus 1100, that is garbage. I do not have 2200 in my bank. The correct amount should have been 1200 because I added 100 in two places. How do I resolve it? If I track the operations, if I track that I had a value, I added 100 rupees here. Now, in another place, I added another 100 rupees. There are two separate transactions. I can detect that. The account balance has been updated inconsistently by these two transactions. But the way to merge it is to apply the first deposit, then another deposit. In either order, it does not matter. Addition is commutative. So, whether I applied in order of first, then second, or second, then first, it is irrelevant. The system will come to a consistent state. So, the idea here is if I keep track of the operations, I can resolve inconsistency and come to a consistent state. So, that is the basic idea. What I will land up with in each of these states is a eventually consistent state. If I detect and resolve the inconsistency. Now, when do I detect inconsistency? If somebody reads these two versions, they can detect the inconsistency. Or you may have a background process, which periodically checks different data items and finds inconsistency and then comes up to the application to help it resolve it. So, the resolution has to be done by the application, not by DynamoDB or the database by itself. So, that was a quick overview of allowing inconsistency and then resolving it. Any questions? Thank you very much. Mr. Sudarshan is presently guiding about six with his scholar, officially in his campus, in his office. Well, his research directions are always around. The last four, five years have been fortunate to work with him very closely. And that is Komag conferences, the chairman of CSIC data, which is one of the co-hosts. And as we have been discussing, sir, we have slightly altered the methodology. We are not handing over PowerPoint slides, but we are handing over some legacy papers for them to read through. Just is obviously shake off too much dependency on structuring. And for that, we don't have to go any far than the original ideas, the legacy systems to be called. They had a lot of ideas, underpinnings of the same context. Some of the problems which we are giving are also oriented towards the data comes, how to structure it, how to unstructure it, when to unstructure it. These are the core problems which they are looking at. About, you know, Mrs. Kavita Devi and Dr. Shalini have come up with about 300 page handout, has all those papers were sent to your kind self on the Google Drive. And they have gone through that, plenty of research directions, very modest, when he says there are only six research scholars who are being guided by him, plenty of ideas. Please interact with him, you will get more opportunities to explore. So, because the solution is available, IIT Bombay is heading the department of computer science and interact with him. Thank you, sir. I actually have one or two slides left, conclusion slides. Let me just wrap those up. And I don't have six research scholars. I have three research scholars. But on the other hand, there are a lot of master students who also do research. All our master students, in fact, are required to do research. So, yes, I do have a lot of people working with me currently on research. So, to come to the last couple of slides, the conclusions. So, what has happened is, there is a new generation of massively parallel distributed data storage system, whose goal is to address issues of scale, availability, latency, motivated primarily by very large scale web applications. And there are many, many such things, applications which have come up. There are a lot of companies in this space. And many of those companies are now using things like HBase or Cassandra to store that data. The list of people using this is growing. So, this is something which people should be aware of. Earlier on, I would have said that, look, these highly scalable things are only of concern to Google and Facebook, not to an average person. But today, there are many people building websites. Anybody building a website should probably use one of these scalable storage systems, just on the off chance that their website may be very successful, and they may need to benefit from the scalability. Now, a lot of the research ideas, since I have not talked about research per se, turns out that many of the key underlying research ideas which go into these systems were all researched long back, back in the 70s and 80s. There were a few things which happened in the 90s. But in terms of pure academic research, most of the ideas are old ideas. However, what has happened in all these systems, in particular the path breaking systems like Bigtable and GFS, Bigtable, Spanner and so on, are how to do engineering at scale. And all of these have been about beautiful engineering, how to make use of existing infrastructure, leverage it to the maximum, build a system quickly, which is very reliable and scalable. You have probably never seen Google systems crash. I mean, that is a testament to how smartly their systems are built to deal with all kinds of failures and completely mask it from you and keep running no matter what happens. And so, there has been some wonderful engineering. So, this discipline is now very ripe and great area for engineering contributions, even if core research contributions are a little harder to tackle here. That does not mean there are not research things to do here, but it just means that it is a little bit harder to grab the hold of them. There are things which we are working on in this space, but they are more limited. Now, when you engineer a system, when you use a system like this, a developer needs to be aware of the trade-offs between consistency, availability, latency and make the right choice for different parts of their application. And they also have to know how to enforce their choice on the storage platform, to tell the platform what to do in what context. So, this is a lot more work on the developer. If you want scale, you want performance, well you have to also do this extra work. So, the old days of just running SQL queries and the database handles the asset and gives good performance for your needs is gone for this category of highly scalable applications to a large extent. However, engineers are busy at work, like for example, Spanner plus F1 is slowly returning to these days where you can almost write SQL queries and the system worries about how to efficiently run it on a scalable system. So, that is the holy grail of engineering development in this area and a lot of people are working towards that. But for now, developers need to be aware of it. But even if the systems do all this, eventually developers still need to be aware of these trade-off. The system cannot ever manage it completely on its own. So, again looking at the future, there is a new wave. Programmers have to understand these details. They cannot work purely at the SQL level. The programming model has to deal with consistency levels. You need better support for logical operations to resolve conflicting updates. What do I mean by this? I give you the example of a shopping cart where you can add items. So, a shopping cart is like a set into which you can add items. So, this is a nice model if you need data structures like this for different things on which you perform operations. Now, the operations may be done out of order on different sites, but if you have what is called the logical monotonicity. So, in the end, everything will come to the same consistent point. Then, you can build systems like this, guaranteeing that they will come to consistent state. If you cannot do this, you may land up in soup where you have an inconsistent system and you do not know how to resolve it, which is very bad. So, there are two approaches. One is you take operation semantics into account, reason about it and build a system which will guarantee some consistent state. That is one way to do it. If you cannot do that, if you do not care about consistency, find live life on the edge. Otherwise, use F1 or other F1 is not public, but future clones of F1 to get what you want. There are many more challenges, tools for testing, verification and so forth. Then, tools for choosing the right level of consistency. There is a whole bunch of research possible here on how to build efficient systems which run at scale. So, I will stop there with respect to future work. There are several references for papers in this. If you are interested in this, if you want to look at these issues in more detail, you can go to my web pages, Google for my name with cs632. Let me write it here, Google for my name with the course code cs632. What is cs632? It is a course which I run here, which is based on research papers. The last few years offering of the course have had a very major focus on big data. In this course, we cover a lot of research papers on data management in general. So, Google for this plus name and go to that page, it is public. You can see what I have covered in 2014 and spring and in earlier years. There are a lot of interesting research directions which come there. So, I have the papers and PowerPoint slides done either by me or by my students in the course. So, it is a nice useful resource for looking at some of the cutting edge research. It is biased, I should say. It is biased in the sense, we tend to cover areas which we work on currently. There are many, many more research areas, but if you are interested, this is one of the things. There will be other sites like this, similar courses across the world. You can look for those also and if you are interested in research in this area, this is a good point to start. So, with that, I conclude my talk and thank you. Is there any last question? We can take it and before wrapping up. Sir, on behalf of this workshop, I thank you very much, sir. So, we have a wonderful section on massive parallel data storage. So, we have explained each and every term with a simple example and real-world example. And also, we have explained the future work we can extend to do in the massive parallel data store. So, I thank you once again, sir. And also, I thank Dr. T.V. Gopal, sir, to organize one such wonderful session Sudarshan, sir. And also, I thank the D.C. Sir to provide this environment. Thank you, sir. Thank you and best of luck for the rest of your workshop. All the best.