 So, I will proceed with my talk. First of all, let me talk about the conventional methodologies that we adopt in general. If you take any course, the way the course is delivered by an instructor is through lectures. To support the lectures, we often have a set of tutorials for courses such as databases and data mining. We often have a laboratory component and there are lab exercises which students do. In the end, of course, we have an evaluation mechanism. In most universities, we have some theory papers. We also have assignments. We might have some quizzes and we may have a lab exam. Why am I writing things which are well known to all of us? The point that I wanted to make is that this is exactly the pattern of delivery of a course and evaluation of students independent of where the course is conducted. Consider, for example, you use Professor Sudarshan's book on databases. I have a copy here with me. I hope you have seen this. So, if Professor Sudarshan is teaching this course here, he will follow his textbook and he will base his lectures on the slides that he would prepare using the examples and the theoretical foundations from the book. Professor Vihare of J. S. I. T. S. Indore once observed to me that while Professor Sudarshan teaches a course from his book, since he has written a book, he is completely familiar with nuances including possible mistakes. But suppose I teach a course somewhere else and I have only studied, let us say, a course under Professor Sudarshan, my teaching perhaps may not be able to emphasize the finer points to the same level that Professor Sudarshan would personally do. Consider another person at still smaller place who has never attended Professor Sudarshan's lecture but he is teaching from Professor Sudarshan's book. He has only studied databases in a course that he did perhaps during his undergraduate days. He has probably followed an earlier edition of Sudarshan's book. Consequently, when he teaches, his teaching will have perhaps a lesser emphasis on the nuances of the subject and that is natural. This is the reason why while the teaching methodology is exactly same at every place, the exact delivery and the quality of delivery and the absorption by students and the level to which students rise could differ significantly. That is one of the reasons why we say that to have good teachers is a boon. What I would additionally say is that every teacher can become a good teacher, a better teacher and the best teacher by following certain cardinal principles. These principles are not merely about absorbing the knowledge of a particular field such as databases in this case but also understanding the importance of setting the right problems in tutorials and lab exercises and that is the reason why I have written here. Tutorials and labs, when we set tutorials and labs, what kind of problems do we challenge our students with determines to what extent the students will master the subject. Only yesterday in a selection committee meeting in Delhi, I was interviewing some candidates who claim to have taught databases and I asked this question at least 3 candidates to 3 candidates. How many square statements have you yourself written and they said oh many. So, what was the longest square statement that you have ever written and people confess that the largest the longest square query that they had ever written was something like 4 to 5 lines. I physically asked them that if your longest square query is 4 to 5 lines then how would you be able to demonstrate to your students variety of interesting aspects of query. For example, how many tables did you join in a query. I will take one example of where this depth of emphasis is important. Consider outer joints. First of all, many teachers do cover joints but mostly emphasizing on equi-joints. When outer joint is covered very often just a standard example is given out saying this is the way you construct an outer joint. But unless you backup such an explanation of outer joint of why do you need it and in what context you need to use such outer joints perhaps the students will not very clearly understand. And this is what I would like to suggest. For example, just continue with this outer joint. If I have two tables let us say one table contains my budget heads. So, I have 1, 2, 3, 1, 2, 7, 1, 2, 9 and these may be let us say miscellaneous expenses, these may be travel expenses, these may be equipment expenses and whatever. So, these are budget heads. I have another table which is a transaction table in which I would have written down a whole lot of expenditure that is incurred. For example, I spent 5000 rupees on purchasing some equipment. So, I will have 129 as the head. I would have spent 750 rupees on some travel. I will mark it against 127. Sometime later of course, there will be day, day, time etcetera, etcetera. Suppose I spend another 500 rupees on travel, I will again enter 127 and so on. Over a month the transaction table would have many entries. Often I need to find out at the end of the month what is the total monthly expenditure that I have incurred on various budget heads. The curious point is if I do a join between these two, I expect all the budget heads and using a group buy, I expect against each budget head the total sum of the expenses covered under this. However, if under a particular budget head, suppose no expenditure was ever incurred during that month, then the equi join will miss it completely. I will therefore not know which are the budget heads on which there was no expenditure in this month. This is a very good example to tell them how an outer join will be able to list all such budget heads on which there has been no transaction on the table. I am sorry to take a very simple example, but such are the examples which actually teach our students to look at the meaning and the implication of using different constructs in a square. Same thing can be stated about every aspect of querying, about every aspect of database design. We talk of functional dependencies, but the examples that we give and the questions that I see in the exams will always say p determines q, q determines f, thus p determine f. So, always we will talk in terms of attributes a, b, c, d, p, q, r, s and we will try to define dependencies and we can of course examine the theoretical understanding of the dependency matrix and whether people have designed the appropriate normal form or not. But during tutorials, during lab exercises, I think it is important to give people examples from real life. This is where the lab exercises become extremely important. I will submit that taking example of database teaching only, the same book that is followed, the same syllabus that is followed and similar contents which are expected to be covered still leave students to a different level of maturity. And the fundamental reason is teachers fail to challenge students minds by giving them hard problems. In every exam, in every tutorial, in every lab, there has to be at least one problem which people struggle very hard and very few people are able to get it. This will require teachers to spend a lot more time in setting problems and in solving those problems themselves. This is what is the preparation time. Unfortunately, given the teaching load that our teachers have, the amount of preparation time available is reducing day by day and that is the reason why many teachers are not able to prepare high quality problems. Question papers for example, should never ever have a question repeated from previous exams, but that means a lot of work needs to be done. One of the reasons why we believe that the modern teaching methodology must permit all teachers to use some kind of a knowledge repository for a subject such that they will be able to select problems suitable to their students and also use challenging problems without having to spend considerable time on their own. It is in this context that I would like now to talk about the teaching methodology that we follow. Some of you might have attended the workshops that we conduct under the national mission. So, let me talk about the national mission on education started by Ministry of Human Resource Development, attempts to enhance the quality on a very large scale in our teaching environment and for this purpose they propose to use the information and communication technologies. One project that we are doing many of you may be familiar with this. So, therefore, forgive me for repeating it, but the project that we are doing under this is we assemble about 1000 teachers at a time. They assemble in what we call remote centers. Take for example, the present assembly. All of you teachers have assembled at indoor and I do not know the number. There could be 30, 35, 50, may be 100 teachers, but can you imagine 1000 teachers participating from a single place? Logistically, that is a nightmare and that is the reason why we select a large number of remote centers. In the city of Indore itself, we have currently three remote centers and we propose to set up something like 500 remote centers across the country covering all major cities and towns. These multiple remote centers then collect participating teachers at the appointed time. We typically conduct two week workshops for teachers. How does this help? Suppose a two week workshop has to be conducted for 35 teachers at a place, then to cover 1000 teachers, you will have to repeat that exercise 30 times. Plus, suppose you want Professor Sudarshan himself to conduct a workshop for empowering teachers. It is impossible for Professor Sudarshan to conduct it either 35 times even at IIT Bombay and most certainly not at various places. But this methodology, where we use the same AVU technique, permits Professor Sudarshan to conduct a two week workshop on databases sitting here right here in Mumbai, but he is assisted by a set of workshop coordinators. Each workshop coordinator supervises the conduct of the workshop at a remote center. Consider, for example, in Indore, Professor Tanwani could himself be a workshop coordinator at that point. Please understand that while Professor Tanwani himself would be able to teach databases excellently, but the idea here is to use Professor Sudarshan directly. So, Professor Tanwani now acts as his teaching associate. This is a method by the way which is practiced in IIT routinely for last 40 years that I have been teaching here. For example, when I teach a programming course, earlier I used to have six or seven of my own colleagues as my teaching assistants. I would teach a course, I would give the lectures, but the tutorials would be conducted by these teaching assistants and IIT system fortunately is not rank based. It is not uncommon and I had it many times, where I had two or three of my senior colleagues also working as my teaching assistant. The point is a teaching assistant or a TA, who in our case is the workshop coordinator has to be almost as good as the teacher himself or herself. Further, these workshop coordinators assemble physically at IIT for some time and they discuss all the lab exercises, all the problems that are to be solved, the entire syllabus in short of the two week programming details. That is the reason why after the lectures are given by Professor Sudarshan in the morning, in the afternoon tutorial sessions, in the afternoon lab sessions, the local workshop coordinator is able to conduct the lab sessions and tutorial sessions as effectively as they would have been conducted physically in IIT. And this is the reason why we are able to ensure that all participants are exposed to the same quality of discussion on the main topics. Now, this method we are currently using for teachers. We are by the way conducting a two day workshop in the coming Saturday Sunday, I think 18th and 19th. This is on writing technical papers for conferences. Many of the teachers are unable to put forward a very well written technical paper. As a result, many submissions are rejected because of the poor quality. This two day workshop, we are again conducting from about 35 remote centers. Why I mention this is that there will be a larger follow up workshop of five days, which we propose to conduct in the month of June, which will be on research methodologies. Now, this workshop, we believe will be useful to all teachers who are doing research, whether for their PhD or even for their ME-M Tech dissertation, which means a large number of teachers may find it beneficial. Consequently, we have decided to open 200 remote centers and we are planning to address 10,000 teachers in one go. You will naturally ask me what does all of this have to do with teaching methodologies in databases and data mining. The point I am making is that imagine if Professor Sudarshan or anyone of you who is a good teacher can reach out to thousands of participants interactively and meaningfully, how much will be the impact made on the quality of education in the country. That is the reason why I mention these details. I would like to spend now a few minutes on the specific nuances of the two subjects, which are being considered under this event. Typically, when we talk about teaching databases, we often have a subject that is taught in the university in mind. This subject would have a list of references, textbooks. I mentioned Professor Sudarshan's book. Actually, the entire book author list is Corth, Silver Shards and Sudarshan. Why I keep referring to the book as Sudarshan's book is that in the sixth edition, the maximum number of modifications and updates have been done by Professor Sudarshan. But if this is a textbook followed, what we forget while teaching is that this textbook or any such textbook whether it is by Shyamnawadhe, Elmasri or any other book by Ulman, whatever these books take a long time to evolve. Consequently, what I am teaching or what is there in the syllabus may often reflect the knowledge that existed about 5 years ago to 20 years. Now, the basic principles which came up 20 years ago or 30 years ago like in this case, the basics of relational model which Professor Codd invented that basic model and the fundamentals remain the same even today. But various techniques change. The SQL itself has undergone several changes. Since, data mining is mentioned as one of the topics, let me say that variety of analytic queries can now be done using standard SQL because the SQL extensions permit aggregates, permits pivoting, permits variety of things which was not there earlier. Now, imagine if I have studied this subject 4 years ago and I had not studied some of these points, I will not be able to cover those topics when I teach my students. What I mean is that when we teach today, we must remember that while there is a syllabus, we must make that syllabus an evolving syllabus. If whenever I teach a course, if I have not added something this year which is different from last year, then I would feel I have not done my duty as a teacher. As a result, every teacher must ensure that whenever you take up a particular subject such as databases, try to cover as much as is possible from the latest edition of any textbook that you use plus. So, this should have two components, the latest edition of any textbook and plus at least a few research papers. I will submit that even if you are teaching an undergraduate course, it is worthwhile to at least make a mention of the gist of a few important research papers just to give the students a flavor that look while they study a textbook and while they prepare for the exam using the textbook, there is much more to the knowledge that is contained in that textbook than what is seen only in the printed pages. The research papers usually are the bulwark or the foundation on which such textbooks are written. So, evolving syllabus is the first thing that I would suggest you use. The second thing is as I already mentioned harder problems. One aspect of the hard problems I will mention is that when we teach databases, we do talk of design, we do talk of skill queries and if the syllabus permits, we do talk of query optimization for example, but very rarely we talk of large data and very rarely we talk of performance. Please note that if we teach our students to handle a database which has say 100 sample rows in each table and the joints hardly result in two or three rows being taken out and our queries even on a PC run within few seconds, then we have not given our students the right flavor of databases. If you really want to show them how to handle large data, there are many tools available which will permit you to automatically generate large data. Anybody who has seen the transaction processing council site, the TPC benchmarks for example, the TPCH benchmark I will specifically mention which relates both to databases as well as business analytics not exactly data mining, but to some extent business. How many of you know that the TPCH is available for downloading the entire benchmark suit, all the queries, the database, the DDS statements and also a data generator. It is possible to generate data for experiments by your students to the tune of not just 1 GB, not just 10 GB, not just 100 GB, not just 100 GB, not but even 1 terabyte and 10 terabytes. I do not know whether you have ever attempted this, but my request will be that somewhere at least 100 GB data should be generated with some standard schema and people should be asked to write queries and measure the performance. How long does a query take? This will tell them much more about both handling large data as well as handling performance. I am sorry to again give a very simple idea, but I believe that unless we challenge our students with newer things that are happening and with harder problems that they will have to face in real life, we would not be doing justice to our teaching. I will close my talk at this juncture. If there are any queries, I would request somebody to type a query on the chat window and I will be able to answer that. Query is that I should elaborate further on the random generation of data. Let me put it this way, the data generator is not completely random because a random generated data will not give a realistic feed. So, let me briefly explain. You can actually read much more on it from the TPC site itself. Unfortunately, TPCH benchmark is now being superseded by the other types, which includes incidentally an end to end multi-tier architecture benchmark and so on. But concentrating on the generation of data, what the suit comes with? A whole lot of scripts. First of all, you use the DDS statements to create the tables. Please understand that many of these tables will have something like a master-slave relationship. So, for example, if there is an item master, then for each item there are some transactions, some orders plays, some orders, some material has been delivered. So, these will be all tables, which will have foreign key. Now if I generate data randomly, it may so happen that I have so many item codes in the master, but none of them appears in any transaction. So, that is why while randomness is important, some systematic generation is equally important. What is done is, first the set of master tables are generated such that all the primary keys are well known and then the random generation proceeds by using the keys, which are there in the master table. And you can actually control even the distribution of statistics of how many transactions should be for which group and so on. So, this I would not call it random data generation, but I would call it artificial data generation. And in most cases, this artificial generation of data helps you to predict the kind of performance that you would ultimately get when you really implement the system. By the way, in the benchmarks, it is not uncommon when we run benchmarks for large banks, for example, to find out how the analytic queries will work. We often collect about a terabyte of data from the operational store that the banks have, where the data is actually collected over the past two years or something. And then this data is exploded or expanded to 30 terabytes, to reflect the situation, which will arise at the end of a data warehousing project or a database project. So, this also involves not artificial data creation, but expansion of the real data, which already exists. There again, the complete randomness is not good, because you would like to have queries, which are properly returning correct answers. So, what is done is, you take rows from the existing master tables, concatenate primary keys and generate larger data. For example, by changing, by adding a transaction for a future date, by retaining the all details of the same transaction, but the date is different. This way, you can explore the data year by year for several years in future. I think I will stop here, but those who are interested, I would request you to visit the transaction processing counsel site and you can actually download whole lot of scripts including the data generator, including the tables, dds and the queries. It is an interesting exercise. Emerging trend in databases and data mining. Oh, this is emerging trends in databases and data mining. People could spend hours talking about it, but let me just mention a few things very quickly. So far, we have always looked at structured data and that is why we have a well-defined schema, some OLTP operations, online transaction processing operation, some analytical queries. However, if you will talk about the future trends, increasingly people realize that unstructured data will form the major source of information around which people will be taking their decisions about which people will be using to conduct their own business and operations. This unstructured data unfortunately is not malleable to the conventional definition of schema. Consider for example, a whole lot of documents. Consider for example, the whole lot of emails exchanged between people. There is huge amount of useful information in those documents, but to analyze that information is not very easy. And that is why increasingly the data mining and analytics will focus on unstructured data. Documents is one example, latest emails is another example. Images, photographs, video clips, these are the sources of information which will proliferate in terms of their usage far more importantly than the structured data. There are many many problems associated with even storing and organizing this data and far more critical problem in mining such data. But these are the challenges on which we will need to focus our attention. Teaching databases and data mining for future, for example, I would submit that some extension of handling unstructured data ought to be made a part of every syllabus and ought to be covered at least to give people a glimpse of what and how to handle. Let me give you a couple of practical instances and practical examples of why such is important. I was involved in setting up the surveillance system for our securities and exchange board of India, the SEBI. You remember you will note SEBI actually is a regulatory authority which handles the stock exchanges and the stock trades. Now there are many problems where people due to their greed try various ways to artificially rig the prices. One of the things is called insider trading. So, I have some information about how a company is going to do over the next few days and I use it either to buy or sell shares and make profit. Now how do I how does SEBI determine whether whatever trading I did was because I had inside information or not? The standard way is to actually analyze and mine only the transactions which I have done and if I am doing too many transactions a suspicion might arise. But it is possible that I am not on the board of that company but my friend's cousin brother is sitting on that board. That cousin brother tells my friend some important information and then that friend sends me an email giving me that information. What is then required? What is required for SEBI in future is to look at as many email exchanges as possible is to look at as many audio phone call exchanges as possible. On a simpler term when companies submit their annual reports while there are some spreadsheets giving financial results and so on. But a whole lot of information on who are the directors, who are the close relatives of the directors is hidden in the documents. All of this is submitted by the way to the ministry of corporate affairs and the pile up of information is huge. Fortunately most of this information is required to be submitted in digital form. Unfortunately there is no known mechanism to analyze all these documents and to set up a network of linkages saying such and such broker is related to such and such person or such and such person is a client of this broker and this person is related to this director and that director had participated in taking a decision for that company which was known to this client etcetera etcetera. Imagine the amount of complex mining that will have to be done on this information. One can give n number of such examples. I will give just one more example of how large data will have to be handled and that is on the image domain. Not ordinary images, but I am talking about images which observatories collect about the entire universal movement of stars. You would all like stargazing. You would be familiar with the fact that the ordinary telescopes as well as radio telescopes keep capturing the data about the positions of stars, their intensity, their movement and so on. Do you know how much data is getting generated practically every day from hundreds of observatories across the world. The data is of the order of petabytes. Observatories retain this data. Now you have this petabytes of data, but to access this data, to analyze this data and to visualize this data is a challenge. There is an interesting project which has been happening in India over the last several years which is called virtual observatory. This particular project is undertaken by Ayukha. The famous Dr. Jayant Naradikar had set this up. Dr. Kembavi was the director till recently. Now somebody else has taken over, but they have been working on this project. I know this project because it is funded by the Department of Information Technology. So what they do is they collect, they access this petabyte of data and they have built a visualization layer. This visualization layer permits the researchers across the world mostly in India, sitting at various places who cannot have all these petabytes of data, but they can go and demand to look at a slice of that data or different slices put together and this results in actually a mind data which the researchers can then see. So the end user sits here. Another example, you take bioinformatics, the amount of information and I am not just talking of gene sequences. I am talking of variety of useful research on drugs, drug discovery and so on. The plethora of formats in which the data exists are innumerable and it is not possible to easily identify and integrate all these various data sources. So although much of that data is structured, but there are so many different structures available that you will get mired into 200 different schemas which you will have to consolidate at one point. How do you do any mining or how do you do any analytics on such data? Realizing this, people have now started working on what we call an evolving schema. So as a result, when you are dealing with unstructured data, you have to do exactly opposite of what you do in conventional database. In conventional databases, you first study the requirement, set up your tables that is you do the database design first. That means you fix the schema first. Then you populate those tables with data and run your queries and so on. In unstructured data, there is no schema. You will have to actually define some kind of a schema and then you suddenly find that there is another type of unstructured data that is coming in or the same unstructured data now contains attributes which you did not look at earlier and that is the reason why we call it evolving schema. And there may not be one schema. You will have to work with multiple schema for different types of activities that you do. The field of processing data and mining data is getting more and more full of excitement as the days pass. I would only submit that it is impossible to completely cover the emerging trends in this context. But over the next 10 years, I will say only this. Please tell our students that just as structured data is important and it still continues to remain important for operational systems for online transaction processing. Unstructured data is assuming a far greater significance in everyday business in everyday life and sooner rather than later our own syllabi and our own subjects that we teach must reflect this as well. I think I will conclude this. I have taken much more than the allotted time. I am sorry for getting carried away but the questions were interesting. So, I think I will close now. Thank you very much. Over and out.