 Welcome back after the break. Let us start this post break session with questions from all of you. And after taking questions, I am going to move on to another topic, which is distributed data stores or no SQL databases as certain people call them. And I will be basically exploring how to store data at really large volumes. We saw before how to analyze data at really large volumes, but how to store it. We saw how to store data in a file system very briefly, but that is just one way of storing data. You want a data storage system to give you more flexibility and to do other kinds of things. And that will be the focus of the first part of the session. And after that time permitting, I will talk a little bit about information retrieval. Now, the original schedule for this day had several other topics, which included data mining, warehousing, XML, object oriented and object relational databases and so on. The reason we switch to this is because we got feedback from many people that they wanted new stuff. This material which I had planned to cover originally, but in the end I did not, is stuff which is probably not going to be in a basic database course, but it is material which many people use in an advanced database course. Now, of course compressing all of that into three hours is probably neither here nor there. It needs a completely different course to cover those topics. So, I decided not to squeeze it into here. Although information retrieval is another area like parallel processing, which is seeing some very interesting things happening. And if time permits, I would like to say a little bit about what is happening in that sphere. And the other reason I would like to do it is because as far as I know, it is not in the syllabus of most places as of now. So, it is something which may be new to you and may be interesting. So, let us start with questions. If you have questions, please indicate it on a view. Right now, I see that Samrat Ashok, Vidisha has their flag up. Samrat Ashok, open. Yeah. Thanks, sir. My question is that, is there a probability to generate a spurious tuple when you perform repartitioning on parallelism? That is a good question. When you do repartitioning, is it possible for your data to get corrupted and you generate some spurious tuples or other spurious data? This is actually a very interesting question. So, if you look at specifications of disk drives, they will tell you what is the expected rate of uncorrected bit errors, meaning there are bit errors which happen when you transfer data from disk drives. Checksums will catch most of those errors, but a checksum itself is a few bits. Now, if it is 32 bits, then there is probably 1 in 2 power 32 possibility that a bit, some unintended modification is not actually caught as an error because of checksums. Now, in an era when disk was a few hundred megabytes, first of all, most of the time there are no errors and when there are errors, 1 in 2 power 32 chance seemed to be perfectly fine. Similarly, when you have data in memory, most of the time memory does not get corrupted. The probability of a piece of memory getting corrupted is again very, very small, but when you talk of memories of the scale of many gigabytes, it turns out that if you have a few gigabytes on your machine due to random alpha particles hitting your memory or so on, there is most probably a few bits corrupted on your machine sitting here. You do not notice it most of the time. Most of the data sitting there is not critical if a bit gets flipped, nothing dramatic happens. It is not often, but it can happen. So, these two trends put together, increasing disk and memory sizes and data processing which loads so much of data has now increased the chance that if you load data from a file, you are going to have an error which you do not catch. So, in fact, today the new generation of file systems is handling data error detection in a new way. They do it from the storage device to the actual in memory buffer contents in a end to end fashion. What do I mean by this? In the current era, the disk sector has a checksum which catches it, but your file system does not really do anything. If the disk says this block is ok, the file system says fine, I am going to take it and pass it on. But in the process of transferring it over the network during repartitioning, for example, something may get flipped. Now, when you send it over the network, again there are TCP IP packet checksums. So, again most errors are caught, but when you transfer enormous amounts of data, some errors are going to slip through. So, what is done these days is there is another level of checksum on top of everything which is going on underneath. So, there are file systems like ZFS which was developed by Sun, now part of Oracle, which keep extra checksums at the file system level. So, when you read data from whatever device, even if the device messes up, the file system will check the checksum after reading it into memory. If the memory messes up, it will be detected after the data is read into memory, the checksum is computed and verified. If the memory has a problem, it will be detected at that point. So, it turns out that this is a very important issue with large scale system. Recently, my colleague, so, Somen Chakraborty, he has been working on large cluster which is provided by Yahoo and processing web crawl. As web crawls go, it is not very big. It is half a billion pages in comparison with Google which probably has 6 billion pages, but Google also has thousands of machines. We have 40 machines which Yahoo donated to us. So, on those machines, he was running various processes and he found that his program would crash sometime. Looking deeper, it turned out that there was a bit flip which happened somewhere in memory and as a result of which his program crashed. So, now, he is actually changing his data structures such that a bit flip will have a small effect, but it will not crash the program. You may lose a little bit of data, but hopefully that will not be a disaster, but your data structure as a whole will not get corrupted. So, it turns out dealing with errors is very important today in parallel data systems. I hope that answered your question. If you have a follow up, please go ahead. So, the short answer to your question is yes, it can happen. Any system which deals at this scale must have checksums and other things to detect it back to you. Good morning, sir. Good morning. Sir, I want to ask that how the intelligent client locates the location when we have large number of machines. So, when you have a very large number of machines, how do you find the location of something? So, in the context of distributed file system, first of all, there is a master which records for each file, for each chunk of that file. The chunks of Hydebic, 128 megabytes is typical. For each chunk, which machine is the master for that chunk and which are the replicas for that chunk. All updates go to that master and then the replicas get the copy from the master. As such update in the master, it is also propagated to the replicas. So, what the client machines do? First of all, they get this data from the master. Now, in HTFS, the master is a machine. In the Google file system, the file system directory information itself is part of the file system itself. So, what that means is that data itself is distributed across many machines and the client does two or three steps. First, it goes to a master to find the root of that directory structure. Then the next level, it fetches from whichever machine has what part it needs. From that, it finds out where the actual data is for whatever it needs. I hope that answered your question. If you have a follow up, go ahead. Yeah, thanks a lot, sir. Over to you, sir. Anna Chennai. Over to you. Good morning, sir. So, let us assume the problem that we are going to automatically index a large amount of audio and video files. So, let us say a lecture fails in IIT, what are you doing in the full semester? We are in the same process. Really, more files will be added to the corpus. So, if I am going to automatically index, what sort of memory management functionalities are provided in Hadoop or any other technology so that my automatic indexing is easier and faster on this? Okay. So, the question is if you are indexing audio or video files, what sort of functionality is provided by systems like Hadoop when you do it in parallel? Hadoop does not provide any specific functionality for any specific task. It is a library which lets you parallelize whatever you are doing. So, if you have a way to analyze audio or video files, that code can run within the map function. So, the audio-video files could be on the distributed file system. Let us say there are many such files. For each file, it is assigned to one of the workers. Each worker will get multiple audio-video files. The map function will be called on each of those files and you write the map function to do whatever you need to do. If that map function requires you to do a speech-to-text voice recognition and so forth and then you do indexing, we will do it in the map function. So, what Hadoop has enabled in this case is that you can have all those thousand machines working in parallel on different files doing the indexing function, whatever indexing function you have written. Hadoop itself does not have anything built in for any domain. It is just a parallelization framework. I hope that answers your question. The next question is answered. My next question is, if I do some kind of incremental indexing, will Hadoop have facility for those things? Why should I provide the Indian corpus and go to indexing vector? Okay. So, the question was if you want to do incremental indexing. So, you have already got an index. Now, you get a few more records which you have gotten. How do you do it incrementally? Does Hadoop provide any support? Like I said, again, Hadoop does not know anything about indexing. It has no clue what is indexing. It is all your code. So, the question is how would you write code to do indexing incrementally? So, we already have an index. Now, there are some new records. How do you index them? Well, the first part, if you have a few new audio-video things, how do you extract the keywords from those for indexing? That is straightforward. So, if you have, if you put all your new files in some directory and then run map reduce Hadoop on it, it can be told to access only the new files and to do the map function only for those new files. The next problem is the reduce function which takes these. So, the reduce function can actually integrate these into existing indices. So, depending on how you do that, for example, one way is you, this gets into how the keyword indices are built. But think of it as B plus tree indices. They are not actually, but if you think of them as B plus tree indices and you are doing a load of a large amount of data, then you can simply scan the leaf nodes of the B plus tree index and the new data in sorted order and do a merge and then build a B plus tree bottom up again on that. So, when you are doing large incremental, I mean, when the increment is big, this is effective. When the increment is small, you just do a series of inserts into a B plus tree and you are done. So, depending on the index structure that they use, one of these solutions is adopted. I hope that answered your question. There are a few questions which came over chat. Let me answer them. The first question is how can we recover the lost admin password of a database? This depends on the specific database. In, for most databases, what they do is, if you log in to the account which kind of owns the database files. For example, Oracle would be installed with some Oracle account, Postgres. So, let me repeat what I said. The first question is, how do we recover the lost admin password of a database? And the answer is, databases usually have an account from where they were created like the Postgres account which is created automatically for Postgres. Now, if you log into that account, usually you can run the PSQL in the case of Postgres or equivalent for Oracle without specifying a password and then you can change the password from there. The specifics vary by database, but this is what you can do for Oracle. This is a frequent occurrence because you create that admin password in the beginning. You forgot to write it down and you completely forgot it. Well, this is how you do it. Go back. For PostgresQL, our instructions said how to set the admin password. If you forgot it, follow the same instructions after logging in as Postgres and it won't ask you for the old password. You can just set a new password. The next one is, can you please highlight on Oracle RAC or DB2 Parallel Databases? So, first of all, let me answer for Oracle RAC. So, Oracle RAC or I forget what it stands for, but it's basically what is called in Oracle terminology as a cluster. Now, what is a cluster? It means different things, but if you are an Oracle person, cluster means database which resides on a shared disk. What does that mean? There is a disk subsystem which could be a physical disk, but typically it's not a physical disk. It's really kind of a box which has multiple disk network interfaces and so on. That box does not actually execute any programs. It just acts as a disk. You can read a block from it, write a block to it and so on. It's a disk. That disk is connected to a number of machines, each of which can read and write data from this disk. So, this is called a shared disk configuration. Now, these machines obviously have to cooperate. If they start writing data to the same block of disk at the same time, there would be chaos. So, obviously the software which Oracle has allows these machines to run in parallel, do computation in parallel and read and write data from the shared disk in parallel, but they have to coordinate on various things such as locking and logging for recovery and so on. So, they have to cooperate, but with cooperation they can make sure that the shared data is not clobbered. So, this is the setting which Oracle introduced initially for two reasons. One is you get some parallelism. So, there are multiple machines which can run in parallel, but the biggest motivation for introducing this was reliability. And what people realized is that machines tend to fail. Disks, once you put a read and so on, the disk systems are less likely to fail. So, in fact what happens is you get these boxes which have multiple disks, multiple network cards, and they tend not to fail. They are very reliable. They are not 100 percent reliable. Which state was it? Some state in the US recently had a major problem with one of these storage boxes, which there was some flaw in the software used in the storage box. And their entire state website, driver licensing, number of applications of that state went down for several days. Some of them took 5, 6 days to bring back up because they had all the data on the disk. They did not have a proper backup strategy of what. I do not know the details, but because this box failed, they were essentially shut down. Many parts of their administrative things were shut down for a few days before they recovered. So, anyway that is a rare event. These boxes are fairly reliable. It is far more likely that the machines which are running in parallel die. So, RAC has fault tolerance features where if one of the machines dies, the other machines in that cluster can detect that this guy is dead and take over processing for it. Now, the data for it is in the shared disk. So, the data is still accessible. It is not lost or rather it is not inaccessible when a machine dies. So, that is called shared disk parallel. Let me write it here. Shared disk. So, example is Oracle. In contrast, the highly parallel system with thousands of machines, they cannot have a shared disk. I mean, just imagine one single disk subsystem which is accessed from thousands of machines at a time. It is just not scalable. So, if you want to scale beyond a few CPUs, you really need to move to shared nothing and even less degree is shared memory where you have multiple processors that share a memory. In fact, some years ago, all the CPUs on desktops were single core CPUs and shared memory system was something which was a server with multiple cores. Today, every one of the new machines which is in the market today has at least two cores. As a result, everybody now uses shared memory systems, shared memory parallel systems. Shared disk is one level up where you have separate machines connected to a shared disk. Shared nothing is, even the disk is not shared. They are just a network interconnecting there. So, Oracle Rack has been used for many years for high reliability. I believe the Indian Railways Reservation System, for example, was based on Oracle Rack. In fact, before Oracle, Oracle actually bought out corresponding database system from DEC and I think the Indian Railways System used that. That was the basis for Oracle Rack eventually. So, that has been in use by Indian Railways for many years now. How many years? I think 25 years. It is a very old technology. So, it has been around for a long time. DB2 Parallel also has shared disk features similar to Oracle Rack. Can we run the next one is from RC Patel Shripu? Can we run the process of ETL separately in a parallel fashion to build a data warehouse without using the tools available? Can you suggest any simple way of doing it? I think if you use an existing ETL tool as you have pointed out, unless it supports parallelization, you cannot do anything. But if you are building your own, so it depends on what are the steps in the extract process. So, if you are getting some data and you are doing some extraction which is local, it is very easy to use MapReduce on it. So, the extract transform load, incidentally, more recently, people are now talking of ETL, for those of you who are not familiar with this terminology, data warehouses get data from multiple sources. They have to extract the data in some form from whatever form it is. They have to transform it in certain ways by doing some joins to bring it into denormalized form schema which is used in the warehouse and then load it into the warehouse. So, ETL were the three steps before putting it in the warehouse. These days, people are talking about ELT which is extract, load, and then you do all the transforms in the warehouse. One of the benefits of this is the transforms can now be done using SQL in parallel. The warehouse is already parallel. The transforms can be done in parallel. So, only the extracts have to be done separately. So, I do not know much about support for parallel ELT tools. So, I cannot answer that question anymore. Next question, NIT Warungal is MapReduce model portable. Yes, it is very, very portable. The Hadoop implementation is on Java and you can run it on Windows, you can run it on Linux, you can run it on anything. I am not sure about HDFS itself whether it runs on Windows, but the core Hadoop library is Java, so you can run it on anything. I think HDFS also is Java based and I think you can run the same thing on any system that supports Java, but I am not 100 percent sure about that. But bottom line is it is very, very portable. In fact, if you think about it, because it is Java, the different machines that are there can be different. They do not have to be exactly the same. And if you have upgraded a few of those machines to a new version of C library or whatever, it does not really matter. The programs are still Java. As long as the Java VMs are the same, they are ok. Next, from NIT Surathkal, do we need to have the same operating system on all machines having a distributed file system? I almost, I mean, I think the two questions have come in nice sequence. And my answer to the first, previous one is almost an answer to this one also. So, the answer is all the machines which are running the distributed file system do not have to have identical OSs. They have to have identical software though, the distributed file system. DFS software has to be identical across all of those. If that software is different, there is no way they can cooperate. But one can be running Ubuntu, one can run DBN. I suppose if the DFS software is general enough, it can even run Windows for all we care. If the software is good enough, the operating system is not an issue. Now, next one, DOE Axe Reenagar is the distributed system an effective way of making a supercomputer. Yes, in fact, there was an era when supercomputers were made with special chips which could run much faster than your regular CPU chips. They used special technology, gallium arsenide. They had cryocooling. You had Cray computers which had chips which would run so hot that they needed a special kind of refrigerator with free-on flowing through tubes into the CPU and whatnot. They were very successful in that era. But how many of you have heard of Cray computer? According to bet, most of you have never heard of it because that company is now dead. It is kind of alive, but it is no longer what it was. What happened? What happened was the cost of making the Cray computers was enormous. The market was very small as a result. In contrast, the market for your high performance Intel i7 chips or equivalent AMD chips is huge. Millions of people across the world use them. With Moore's law, Intel has been able to put in functions that once were there only on Cray supercomputers. Those functionalities are now there in your desktop CPU. What does this mean? It means that you can no longer do anything special in some sense to make a CPU go faster than what your i7 does. You can make it go faster, but then the heat problem is enormous. Those of you who have seen this trend, Moore's law predicted that the number of transistors per unit area would double every one and a half or two years. That has been happening steadily. There was another thing which people misinterpreted as Moore's law which is that the CPU speeds will also increase correspondingly. That actually kept happening for a long time. Moore never said it would happen, but it did in fact happen till about five years back when CPUs first started hitting 3 GHz. Now, till then every year we would go from 100 MHz, 200, 800, 1, 1.5, 2, 3. Suddenly we hit a wall. After 3 GHz, boom. Nobody is going faster than 3 GHz. The reason is that the heat generated by the CPU increases with the CPU speed. Once you go beyond 3 GHz, the heat is so intense, people have calculated if you want to go even a little bit more, it will melt steel. If you go a bit further, it will be hotter than the sun which is of course ridiculous. You cannot actually go that fast. If you melt steel, your chip is already molten. But the point is that there are physical limits to speed and that limit has been hit today. So the only way to go is to have distributed systems and today all supercomputers are basically built using Intel or AMD or similar chips and there are other companies making chips. So all of them are built using these. The biggest supercomputers today are either a few which are built by governments. China has one, US has several in its national labs. So governments have funded them to do some scientific computation or the only commercial ones which are comparable, maybe even bigger are the data centers at each of Google, Yahoo, Microsoft. Each of their data centers has tens of thousands of machines cooperating on a task. You can think of it as a supercomputer. Last question. Suggest an open source GUI tool which can be used in place of visual basic. That's a good question. If your goal is to build client server GUI programs, there are several. You can use NetBeans with Java Swing that has GUI features. On the other hand, if your goal is to build web-based applications, Microsoft Visual Studio has had GUI features, drag and drop for several years now. I think from 2005 at least, maybe 2003. In the Linux world, NetBeans tried. There was a project called Visual Web which was there in NetBeans which version? I forget. It was included in an earlier NetBeans version, but it was very buggy. So it has actually been dropped from the current NetBeans version. Eclipse does not have it. I don't know if there is any open source GUI tool for building web applications. Meaning we have Eclipse which we have used to build applications. But can you drag and drop constructs in there like here is a table for display. Here is a user input. Can you do that? There is something called NetBeans Visual Web, but it's flaky. That's the only one I know of. Maybe there are other open source ones. There are a few other proprietary ones for the Microsoft world. There's something called iron speed which can do more stuff. There are tools like Ruby on Rails which is not GUI, but it lets you generate certain simple user interfaces very easily. They are called cried interfaces. Let me write it down. So what is cried means? I think create, read, update, delete. So they basically allow you to create tuples, read the tuples, delete them or update them. But they are very simple interfaces. They don't have any semantics. They are not very useful for real life applications. Although they are better than going in and typing SQL to read or modify data. So there are some non-GUI tools, but nothing which is really as good as the best GUI tools that are there. I'll stop there with the questions.