 Okay, we'll move on to the very last topic, we just have just half an hour to skim it. Not sure that it's very useful to go through this in such a short time, but anyway, let me just do this to point out to you some of the terms that are available and to get you started on further reading. You may not understand everything in such a short overview. So first, I have been using this term decision support, decision support systems regularly. What are decision support systems? They're not meant for processing transactions. They're not meant for selling tickets or maybe updating your bank account and so forth. They're used to make business decisions based on a lot of data which is out there. Typically, retail sales data, phone call data, plus other data about a customer. Plus these days, maybe search terms that people are using on web, Google will happily sell search terms to people who want to buy it to find out what people are querying about their product and so forth. What kind of business decisions, all kinds of things. What items to manufacture, what items to stock, how much premium to charge for insurance, etc. So decision supports have various aspects. There's a data analysis aspect. Once the data is there, how do you do analysis? A typical way is to look at different aggregates and analysts will look at it and eyeball it and decide what to do. And typically, they will look at some aggregate, look at other aggregates to try to understand why this aggregate changed this way. Sales fell in this quarter. Why did that happen? What would be an explanation for it? So they kind of explore the data to find out what happened. Then there are other ways to do it. Statistical packages S plus, there are many packages out there which can be used on top of this. We are not going to get into it. Then there is data mining where the goal is that you don't have an analyst who is kind of exploring the data. But you tell a system, look for patterns and tell me occurrences of certain kinds of patterns. There are many kinds of patterns, clustering, association rule and so forth. And then there are other kinds of things which are used to make decisions such as I have some information about a customer based on information about earlier customers. How much premium do I charge this customer for insurance? So that's another kind of thing which where data mining is used. Then there is a term data warehouse. What is a warehouse? It's basically a system which archives information gathered potentially from many sources and stores it in a unified way at a single site. So a large company may have many operational databases. One for its division X, one for its division Y and so forth. One for geography A, one for geography B. So data warehouse collects this from all the sources and puts it together in a single schema. There's also external sources of data like you said you can buy it from such companies, you can buy it from market survey organizations and many others. So data warehouse stores data which is also historical not just today's data but the last year, two years, three years of data. So data warehouse is a repository of such information. Now here is roughly how a data warehouse operates. There are many data sources and the data warehouse has data loaders which take data from these sources. They may be in different formats. It brings it to a common format. Sometimes there can be errors in some data sources. It tries to clean up the errors, remove things which are clearly wrong. If somebody's age is given as 900, it's clearly wrong. It's going to be removed and some other such cleanup and then it puts it in a database in a unified schema. Now from this schema you have query and analysis tools running off that. Now what kind of schemas are used here? It turns out that the normalized schemas which we have focused on with ER design and so on are not actually appropriate for data warehouses. Here no updates are happening except data which you're sucking in from other places. There's no major issue with inconsistency. And what you want is efficiency. And typically what happens is you have large fact table with many dimensions. You have a picture of it here. So a company which is selling stuff, they want to analyze their sales. So they have a sales table which is usually a very large table. It has something like, you know, you have this item sold in this store to this customer on this date, how many units and what is the price. This is actually simplified. There may be a few more fields. But note that there are many fields here which are identified. Item ID. That's actually an index into some other table which has the item. Now usually the item table is smaller than the sales table. This may be pretty big. You know, Amazon stores millions of items. Flipkart probably hundreds of thousands of items. But the sales is much larger than the list of items. Along with item ID, you have things like name, color, size, category and so on. And your aggregate query might want to see sales by category, sales by color and so on, group by various attributes here. Similarly, you have date and date has associated with it. It feels like month, quarter, year. Now note that all of these are redundant. From the date, you can compute all of these. But it is still stored for efficiency. Similarly, you have store ID with city, state, country. Customer ID with various information. This is a typical thing where there are a lot of foreign keys from the central packed table to a number of surrounding tables which are called dimension tables. So this kind of a schema is called a star schema. So in the simplest cases, there is just one large fact table. But there are domains where you have multiple, a few fact tables. Usually not too many. For analysis, these are the kind of schemas which are most relevant. There are many other things which you may not bother about in analysis. You throw them out. There are many other schemas in the OLTP system which might get folded into one thing here. So customer, there may be many things, relationships which may get folded in here in some way. So that is a super quick look at what a data warehouse is. It is not a deep look at all. It is a super shallow look. But there are many issues in data warehousing. How to load data? What schema should you use? How to optimize queries running on it? How to pre-compute aggregates so that queries will run faster? And so forth. The next topic which we will brush upon very quickly is data mining. So roughly speaking, data mining is the process of semi-automatically analyzing large databases to find useful patterns. So there are several kinds of tasks which can be done as part of data mining. One is prediction. This is one of the very important tasks. So if somebody has applied for a job, somebody has applied for a loan, somebody has applied for admission, how do you decide whether to admit them to give them a job and so forth. Now for some of these there are human intuition and so on play a role, but that can be fickle. Instead if you had a lot of prior data about people and based on that you can predict whether this person is likely to pay back the loan on time or is likely to pay back the loan but not on time and pay us large fees, interest and late fees but is still going to pay back eventually or is likely to default and run away and leave us with unpaid loan. So these are the kinds of things which you want to predict and credit card company allows the second type of customer. The first type is they are not losing money. The second type they are making tons of money because that person pays interest on credit card debts. And the third type who runs away totally they don't want because then they lose money. So how do you predict this based on various attributes, income, job, age, gender, where they live, there are so many other factors. So that's one kind of thing. Another kind of use of data mining is for classification. So I want to group items in some way and given a new item predict to which class it belongs and we will see some of this later. And then there is a variant of it which is regression formulae which is I want to compute a number through some method. So for example the prediction earlier right, it could be predicting whether this person is in bin A, bin B or bin C, three categories or it could be a number, a credit score which is a number between zero and one let's say. In that case it's a regression formula. And there are other types too. There are associations. If you have gone to any bookstore sites such as Amazon or Amazon is not just books it's everything. So they look for associations. So if you search for a particular book they will say people who bought this book also bought these other books or people who search for these books ended up purchasing this other thing. So maybe you want to take a look at that also. So these are associations. There are other kinds of associations which say that people who bought item X also bought item Y. Now in the case of the shop their goal was to suggest other things which you might find interesting. But in other cases like retail shops if they find out that people who buy item X are also likely to buy item Y then they may do something about that. They may put up a thing in the banner in the store in this section saying hey we have a sale on this other thing in this other section go check it out. There are many other things they can do to drive sales. So a lot of this is driven by how to increase sales. It can also be the first step in detecting causation. So there is an association between living in this part of the city and having say tuberculosis. Now is that a cause of tuberculosis? Maybe maybe not. But you could maybe that's choosing a pattern and you may want to look deeper into it. Maybe the pollution there is very high and that is causing tuberculosis. And another way of doing that is to look for clusters. Instead of association between location, clustering is when you have a map and you say that this region of the map there are lots of cases of typhoid and that may be because of contaminated water over there and you might respond by going and fixing the water mains there. So all these were different types of data mining. Now for the first task which is classification you could create rules. So the rules could be like for all person P if the degree is masters and income greater than 75000 credit is excellent. This is an example of a classification rule. Then there can be other such rules. Now who comes up with this rule? It could be a human but the goal here is to automatically infer these rules. And there is a decision tree. So the rules can be represented as a decision tree. So a decision tree starts on some attribute in degree here and has multiple branches. So your bachelor's degree and your income is 50. This should have been, I think something is messed up here. This should have been less than 50k. Then maybe your credit rating is bad. If it is greater than or equal to 50k it is considered good. On the other hand if you have a doctorate and your income is low you might still be considered good because the assumption is you are a poor professor but you are an honest person even though you may be poor. Whereas this other poor guy may not be as honest. Anyway how did you come up with these rules? Like I said based on old data you want to there are algorithms for coming up with decision trees. That is one kind of classifier. There are many other types of classifiers. Neural, Nets, Bayesian, support vector machine. Some of these are covered in the book. Association rules I already mentioned I am going to skip the details. Clustering also I think we are kind of out of time. I am going to skip those. They are standard things which many of you already know about. But let me mention a few other things which are more recent. Text mining is an area which has taken off in the last 10 years or so. What is text mining? It is basically applying data mining techniques to textual documents. For example you have a number of web pages. You want to cluster web pages to find pages which fall in some categories. The classification could be useful for many purposes. Maybe you want to cluster pages that a user has visited to organize their visit history. I know I looked at some page related to PostgreSQL some time back. But I have looked at many other sites. If the system can cluster it, I can say, you know, find me things in the cluster where PostgreSQL appears. And then I can see only that part of my history. I may also classify web pages into a directory. There are many more applications here. So I should mention that this area, like I said, took off in the last 10, 15 years. And one of the leading people in this area who wrote probably the first textbook on this topic, in particular web, not just text mining. But text mining and its connection to web, that is textual data on the web, was Professor Soumen Chakravarti who is my colleague in this department. He has got a textbook on this area, which has been used widely across the world. So that brings me to information retrieval, which is actually Soumen's core area, web information retrieval. So all of you have used web search engines. You are all familiar with keywords of the web. Every person from a child now knows about it. But if you go back to 1995, hardly anybody knew about it. The first web search engines were debuting around that time, 94, 95. So it turns out that web search may have come at that time, but information retrieval is a much older area. It's at least dates back to 1960s. It's that old. And back then they didn't have web, but they had text documents. So it was an important area for a very long time. And the idea was that you give keywords and get back relevant documents. Now note that we are not running SQL queries on these. There is no structure. Documents are just text. So without having any structure, we still want to find relevant documents. So that's what information retrieval has been about. These days there has been work on merging these scenarios, where on the one end you had only keyword queries on textual data with no structure. On the other side you had structured data with only SQL query. Now people have been trying to mix these two. For example, there was work which was done at several places including IIT Bombay, where we said can we issue keyword queries on relational databases and get some meaningful answers out of it. What does it mean to get an answer to a keyword query on a relational database? This is a paper which Sawmeen, I and some students wrote about 11 years ago. And that paper was at the right place at the right time and it has got a lot of notice across the world and there have been a lot of citations to that particular paper. So that is the area which got people's imagination. Like I said, it was at the right place at the right time. So there are many ways in which information retrieval and databases are combining today. There is also the other direction. Can you extract structure from unstructured data and then use structured query languages to structure query information which originally was in textual documents. That is also a very active area. So anyway, that is what information retrieval is about. There are differences. IR systems don't deal with any transactions at all. They don't really deal with structured data. It's mostly unstructured. But they do many things which database systems didn't do, approximate searching by keywords, ranking of retrieved answers by estimated degree of relevance. These are very important. So these are things which all of us know. Google succeeded because it did a fantastic job of ranking results. The things which you wanted was mostly in the top few. Many of you may not know it, but Google was not the first search engine. There were three or four companies before Google and they all got wiped out. They don't exist today. Some of them got bought out by others. But they all got wiped out simply because Google did a much better job than the others. Of course, some of the others which came later in fact like Bing and some of the existing ones did improve and they caught up with Google. So now, how do you even decide what things are relevant to a particular query? So I type some keywords. How do you know what is relevant to it? There are some very basic things that are used. The first is what is called term frequency, which is a measure of how relevant is this word to this document. At one level you can say it's how many times the keyword occur in the document. But in a deeper sense it is, if it is in the title, then it's more important for that document. Even if it occurs five times in the end of the document, that may be less important than occurring in the title. So term frequency is a measure of how relevant is this term to this document. So if I have a home page of Saum and Chakravarti, Saum and Chakravarti are very important for that page. But if there is some mention in there that he went to BLDB in Singapore, it's mentioned but it's not so important. The next is inverse document frequency. So given a query with five keywords, supposing I give something like the or some other keywords like that, which occur very frequently. Most search engines would not give much importance to that particular term. Many documents have those. But supposing I have a query which says the followed by Saum and Chakravarti. Now Saum is not a very frequent word. It occurs in a number of web pages which is not very frequent. Chakravarti is also not very frequent, maybe a little more frequent than Saum. And the is the most frequent of all. So weightage is given to these terms where, you know, if a page is say there is very important to the page, it doesn't matter. There is of less importance overall to this query. Chakravarti is viewed as the most important. Sorry, Saum is viewed as the most important if it is the least frequent among all websites. Chakravarti was a little more frequent so it's considered slightly less important. So that's the inverse document frequency. If you documents have a term, give more importance to that term. The term is the key word I use synonymously. And the last thing, so these two predate the web. This last part came up purely because of the web which is hyperlinks to a document. If there are more links to a particular web page, it's probably more important. That's the intuition. And those of you who are familiar with Google page rank know that it is based on this concept. Of course, it adds a lot more to this. But at the core, this is what it's about. Important sites tend to have many links into that. So the number of hyperlinks is a measure of the popularity of prestige of a site. And that is extended for lack of time. I'm not going to cover this slide in detail. But the page rank algorithm sets up set of linear equation and essentially solves them. Because the definition of page rank is kind of circular. It says that the rank of a page depends on the ranks of pages that point to it. But it's circular because this page may also point back to some of those pages. So the ranks of those pages are defined in terms of ranks of this page. It's a circular definition, but that's not a big problem because it's a set of linear equations which can be solved. This part is something which I mentioned briefly, information retrieval and structured data. So there's a lot of work on information extraction. For example, if you had an ad for a house or you had a resume of a person or something like that. You want to extract important attributes from that textual document. Extracting data from resumes is a big business actually. If you want to find the right person, the resume is textual. But from that if you can figure out that this person has expertise in these areas and has worked in these companies, maybe I should look closely at that person. The extractor information could be stored as relations or XML. And then keyword querying on this structured representation. It has been extracted from unstructured data. It's important. But you may also want to allow SQL or other similar querying on this page. This slide says XML. But in recent years, it should have updated this slide. In recent years, XML is kind of vanished from this field. This is what is called RDF. RDF stands for resource description framework or something. Its original purpose was to give metadata about web pages. But that's kind of irrelevant at this point. RDF is basically allows you to store information. You can think of it as a graph or as a set of triples such as ID 221, name, John. And the same thing address something or the other. So, essentially what you have done is you have some kind of object identifier. Then you have an attribute name and a value. The value itself over here could be another object identifier. You can say parent or father or whatever. Somebody else ID 101 or something like that. So, this is a way to represent data where you don't have a fixed schema. Because you can keep throwing in new things here. New attribute names can keep appearing. And there is no specific relation name. So, this is a very flexible way of representing information when you don't have a schema. So, RDF has taken off and there's a lot of RDF data out there. So, those of you who are interested in research, this is a good topic to look at. There are a lot of data sources with RDF. And there are interesting things you can do with it. Ranging from query processing on RDF to data mining on RDF graphs to many, many other things. Just because there are a lot of data sources. And the structure is very rich. With that, I will wrap up. What is temporal data business? What is temporal data business? I briefly touched upon this when I was talking of relational design. The idea is that each fact may be true only for certain periods of time. I am a professor here, but I was not always a professor here. I will not always be a professor here. I will retire at some point. So, the fact that I am a professor here is a fact now, but it has a time duration, starting point and the ending point might be said to null currently, meaning it's going to continue till something happens. And that something happens might be my retirement or it might be my living or whatever else. So, data which has a time period associated with it is called temporal data. And most data actually implicitly has a time period associated with it. Things do change. Not everything changes, but many things do change. Sometimes the time period is from now to indefinite or from beginning to indefinite, but things do change. So, temporal databases deal with many aspects of time and databases. One part is how does this affect schema design? The second part is how does this affect queries? You know, maybe if I do a join between let's say students and teachers, if I was a teacher here from say, I have been a teacher here from 95, somebody was a student here from earlier on. I didn't have a connection with them. I didn't overlap with them. So, time might play a role in deciding what things are related when I run a query. More directly, if I took a course, its title and syllabus has a temporal aspect because it may change with time for a particular course ID. If I took a course and I want the syllabus and material for the title of the course I took, you have to take the time when I took the course into account. So, there are many, many aspects. Then, there is temporal efficiently executing queries which involve temporal selections. Then, temporal rule mining, I think, because Sunita mentioned that earlier, rules which change with time. So, there are many aspects related to time in databases and in data mining. So, that's a very quick overview. There's a lot of material on this. There are books and plenty of stuff on this. Maulana Azad. Actually, one slide was escaped yesterday, a JSON slide. So, I just wanted to ask question on JSON. Sir, I was working with it, but I do not know how to pass dynamic values to the JSON string. So, then I wanted to know. So, first of all, I didn't explain JSON. So, what is JSON? So, JavaScript is a language. It has a notion of objects and it's a very flexible kind of language. It's a scripting language and many scripting languages tend to be very flexible. So, among the basic constructs in JavaScript is that you have an object. An object can have attributes and the set of attributes that an object has can vary. You know, there is some kind of a type system, but each object, you could have untyped objects, each with its own set of attributes. So, when you store an object in JavaScript object notation, what you're doing is essentially having an object identifier and within it, you have attribute names and values. So, you can think of the name value pairs. Now, JavaScript also has a concept of an array, but and a map, both of which are actually internally the same thing, which is that an attribute can be a map in itself, meaning that given an object and an attribute name and a value, it returns another object. So, what do I mean by this? Supposing I had an object O1 and one of its fields could be an array. So, one of its fields is let's say marks, which is an array of 10. So, there are 10 marks, but internally marks is basically what is called a map, which takes a value and returns another value. So, in this case, the expected value are between 0 to 9 or whatever. On the other hand, the maps over here can be more general. It can take a value. I can even say marks of, let's say, John. I could set it to 5 marks of Peter, 10 and so forth. And I can create anything else that I want, marks of Salman, so forth. So, all of these is basically a very, very flexible types system. And there is a string representation of this, which lets you ship an object, a JSON object to another system. There is no direct notion of pointers here, but you can create objects which have nested structure like this. So, this in turn could have further structure. So, there is a way of serializing these JavaScript objects and shipping them over the network. So, this standard serialization format with a very flexible schema proved to be very useful for applications, even whether or not they used JavaScript. It became popular because anything which runs on a browser today pretty much uses JavaScript. And JSON is a great way for these scripts to talk with a back-end system, back-end application. But subsequently, JSON was even used for storing data. For example, I talked about the peanuts system, which is a key value data store. And instead of creating their own schema system, they simply said we will store JSON objects. And then they can be interpreted. You can look for fields in them because it's a standard representation. So, that's what JSON is about. Now, coming to your question, can you just repeat what your question was? I actually wanted to know how to pass dynamic values because whatever strings we have seen, the value is written in quotes. So, that is all like the constants we are giving. So, how can we pass dynamic values at that particular place in JSON string? Because I have used JSON with HTTP post. It is working, but dynamic values I am not able to give. Okay. So, you want to do it from Java. You want to create a JSON string thing from Java. Okay. So, I have not used JSON libraries in Java. So, I am not sure of the exact interface, but I am pretty sure it should not be too hard to do this. But I can't answer your question right away. So, the easiest thing is of course to create a string and pass it if you already know it's in JSON form. But what you want is to create a structure and have it encoded as a JSON thing. So, there are things which let you convert from Java objects to JSON, XML to JSON and so forth. And you can probably create a JSON object. There must be an API to create a JSON object, add fields to it and so on. Just like the DOM tree lets you modify a HTML document. There should be a similar API for JSON. I have not used it, so I am not 100% sure about it. The use of JSON I have looked at is mostly from within JavaScript where the language itself supports it directly. But if you are doing it from Java, there are APIs. That's the best answer I can give you. Sorry about that. Sir, I have gone through your papers of parametric query optimization and Picasso and Addis and progressive query optimizations. So, I was wondering how these changes can be made in optimizer because here we cannot have access to optimizers of SQL server and Oracle. So, how do you make changes in an optimizer? Okay, that's a good question. So, there are two answers to that question. First of all, if you want to do it to the PostgreSQL optimizer, that code is very much available to you. You can put it into PostgreSQL. But there are other options available. For example, at IIT Bombay, we built an optimizer based on the Volcano framework a while ago. Around 1998, 1999 was when that was built. So, that code base is available. It's called the Pyro Optimizer. And we have more recently been rewriting it. It's now rewritten in Java, but it does a bunch of other stuff, incremental optimization and so forth. That stuff is not ready for release yet. But the original Pyro Optimizer written in C++ is available. So, you can send me email and I'll be happy to share a link to that. So, these are two ways in which you can get at actual source. Now, the parametric query optimization stuff which you talked about, that was actually built partly into this optimizer, but also partly done as layer on top. So, some of the work which Ben Ph.D. student, how to do these things without going into the optimizer. Do it on top of an optimizer or with minimal changes to an optimizer. So, I think that code might also be available if you are interested. I hope that answers your question. So, send me email and I'll give you more information. Thank you, sir. Hello, sir. This is Swami Maitreya. Sir, I have a question on cascading rollback. Sir, let us assume a GBMS system which implements concurrency control based on lock and a cascade rollback occurs then in future course of time, how the system will regenerate the transaction in the same order as previous. So, the answer is they don't. Simple answer is no, the system will just roll back, that's it. After that, it's up to the user who launched the transaction to rerun it. This might sound like a bad thing to be doing, but that's life. So, what happens is when you try to execute transaction in a database, it might just be rejected. So, sometimes you might see this happening. You don't see it happening very often because the number of times, you know, rollbacks are post on you. First of all, cascading rollbacks only happen if you read uncommitted data, which no database allows, the default is read committed. So, unless you ask for it, you will not get uncommitted data. So, cascading rollbacks won't happen, but rollbacks can happen for other reasons, but they are rare. And if they do happen, it's the job of whoever submitted the transaction to resubmit it. So, the database will not take it upon itself to re-execute anything. So, I am Abhijit Aich from Silicon Engineering College. I have got a question. That is, can we customize the SQL queries? Because I have seen that select star from a particular table and all. Can I customize those particular query according to my own choice? Is that possible in SQL? Yeah, customizing the queries. In select, I want to use it as a star from table. Can I do that? No, no. You can't change the language. The skill language is defined. You can't go around changing the keywords and so on. So, that kind of changes are not allowed. Sushila. Hello. Please go ahead. Yeah, it's on. I have a general question, sir. Like in Oracle 10G and 11G, G means green. In the same way, do we have in MySQL, MySQL supports grid, sir? Grid or cluster? First question. Second, like in data mining and data warehouse in big databases, whether MySQL is now user and are there any real-time examples of MySQL? Thank you, sir. Okay. So, first of all, this business of grid and G, I and so on, which Oracle uses is just marketing, blah, blah. Okay. Don't take any of that very seriously. What the hell is the grid? What was the grid? I have no idea. It's all marketing terms. So, forget that. Now, if your question is, you know, Oracle can be used in maybe clusters, number of Oracle systems working as a cluster so that if one of them dies, the queries can go to the other. That's called clusters. So, MySQL and other databases have also been used to build various forms of clusters, maybe not in exactly the same way that Oracle does. So, what is this? I didn't get into this, you know, database architecture, but there's something which Oracle calls clusters, which others call shared disk parallel databases. And these are pretty good for providing high availability by storing the disk outside of the database. And the, if a database instance, if you have multiple database instances running off the same disk, if one of them fails, the other can continue to run and take over the load. So, Oracle has a nice product based on this technology. I don't know if MySQL has an equivalent product. So, this is good for high availability of one form. But these days, many people use high availability, use a different form where the whole database is replicated somewhere else. And this kind of replication is supported today by PostgreSQL. I know it has been used in a project that I was involved in. MySQL, I think, also supports it. And of course, Oracle SQL Server DB2, they all have supported it for a very long time. So, that's the form of high availability, which is more commonly used these days. Although the other one also has its uses. It is used widely even now. So, is MySQL used in live systems? Absolutely. I mean, there are a number of people who use MySQL, any number of people who use PostgreSQL, Oracle, they're all widely used today. So, you don't have to worry about that. And there was a time when PostgreSQL and MySQL lacked high availability features. Today, even that is available. Of course, there are still some benefits to using a commercial product like Oracle. The benefits have come down a little bit over the years. But yes, there are some performance edges that it has and it has features which other guys don't have. But if you need those features, then that's probably the right thing for you. If you don't need those features, then one of the other free ones may be good enough for you. Yeah, I see you have another question. Sir, according to you, which is the best commercial database? You want beauty contest of databases, you can't rank them as best and so on. Each one has its plus points and its minus points. If you want to say what are the good things about the commercial and open source databases, I can tell you a few points. SQL Server, for example, its plus points, it has a really awesome optimizer. They do some really nice optimizations there and they can really optimize the query beautifully which others would struggle with, even complex queries. So that's one of the nice things there. Another nice thing which they did is that they put a lot of features for simplifying database administration which used to be a big drawback for Oracle. Administering an Oracle system was a very tough job. It gave job security to Oracle DBS but that's not necessarily what company wants. They want it to be easier to administer. SQL Server did a very good job of that because Oracle has also done some work along those lines. But today, you know, SQL Server is still easier to set up and administer than the others. Open source, you know, their biggest benefit is they are free. That's a huge benefit if you look at the costs of some of these other things. But some of them do have some advantages too. Post-SQL being open source, you can add to it, you can play around with it. People have added features to Post-SQL which you might find useful. I'll stop there if you have any follow-up. Sir, in MySQL, I used that at Create and Desk SQL Clubs. So I don't get the what is the exact use of that Create index SQL. So how to use that Create index and what is the use of that Create index. So it will create the index for your table. So I didn't get the exact use of that Create index. That's exactly the reason it creates an index on the table. I mean, what is the point of an index to speed up access to the table? If I have a query which says find me a student with ID 21, I need an index, right? So that's just the SQL class which lets you create the index. I mean, am I missing something in your question? I think that's all there is to it. How to use that index? When you write your SQL query, the index gets used automatically. Now I see what you're getting at. How does the index get used? So you don't explicitly say use the index, you just write your SQL query and the optimizer will choose the best way of executing that query. And if that index helps in executing your query, it will use it. And your assignment of, I think yesterday, right? Or day before, which was on query plans. That's exactly what we did. We created an index and then the goal was to run some queries to see how the plan changed. And that's exactly where you would see that the index is getting used. That was exactly the point of that assignment to see how indices get used and how they affect the time also. I need data mining techniques on software engineering documents like SRS, like classification or clustering. I'm sure people have looked at this. I know that software engineering is a good domain for data mining, especially with Indian companies having a lot of stuff. But I don't know the specific answers to the question which you are asking for. I'm not aware of work in that area. This is not my research area. So you would have to look somewhere else for those answers. They are good questions, but I don't know the answers. The text to mining. So you want, how to find more information on text mining? So there is this book by Professor Samin Chakraborty on web mining. So that has a fair amount of stuff on text mining. There is also another book by Prabhakar Raghavan and Manning, which is available free online. So let me just write this down. I'll put it up on the Moodle page later. The other book I don't remember the title, but it's by Manning. I don't remember the spelling. I think that's the right spelling. So these are two textbooks which talk of text mining. So you can look up both of these. Kent textbook is actually available free on the web. So it's pretty easy to access. Yeah, back to you. So what is the difference between data mining and data mining? Yeah, there is no real difference. There are just two terms which different people came up with, but they mean the same thing. So if you look at the top two conferences in the area, one is called Knowledge Discovery in Databases. That's the name of one of the top conferences. The other top conference in the same area is called International Conference on Data Mining, ICDN. So yeah, I mean it's the same thing. There's no difference. Hello. IPS Academy Indore, please go ahead. Hello sir. Good afternoon sir. Sir, may I ask the question for how soft computing technique using data mining? How soft computing technique using data mining? Okay, so this business of soft computing is a very fuzzy area. I don't know much about the area. What I know is too many people are writing too many papers which use soft computing and fuzzy and so on. And that area is overworked. You know, it's time to move on to other areas. It's my rough conclusion seeing the number of papers that are coming out in that area which don't have a direct impact. Impact in the sense of how does it impact other areas? So certainly there is work in that area related to data mining but it's not area I'm familiar with and maybe you should focus on other areas which are having more impact in the real world. To the best of my knowledge, this area has not had much impact in the real world except in some narrow domains. So that's all I could say. Which type of area? Sir, which type of area? You have to say like that other areas. Which type of area pick other areas? So I did mention several areas in the course of the talks I had on research and even through this workshop which are and there was also stuff on data mining. The thing is that we have given some links to conferences. So if you go to those conferences and read papers you will get an idea of what are the other areas. And I think this is an important question many people have. What area to work on? People get biased too much by areas which people that they know have worked on. That's not necessarily the best way to pick an area to work on. What happens is something becomes a fashion and it might have made sense initially but when too many people work on it there's not much left to do which is new. Everyone is doing the same thing over and over and it gets very boring. So what you want to do is find new areas and one way to do it is to look at papers and recent conferences. In particular the conferences which are regarded as good conferences in the area. So things which have an impact often appear in those conferences. So I have given a few links to conferences in databases and data mining. So you can go in there and read papers. Find things which appeal to you. Instead of reading the paper many of these conferences now provide slides from talks also. So it's easier to go read the slides first to get an overview of what the paper is about. Reading a paper takes a lot more effort. So if you can get the slides, read those first. Get an overview. Then pick an area which looks interesting to you. Then read more in that area. The first area you pick may not be the right area. You might realize that it assumes a lot of background let's say probability and statistics which you don't know enough about. So then you might say I don't want to look at this area. I'll go somewhere else. There is one way to look for areas. There are many other ways. This is one way. Hello. Whenever we are loading images on any social side so what will be the phenomenon they are doing while loading our images on the server and how they are maintaining all this database? So the question is where do they store these images? So first of all images are usually stored as files not inside of a database. They have typically a very big distributed file system in each data center and they have many data centers. So your image will typically be stored in a few of the data centers for redundancy and that's it. It's stored in the file system somewhere and the database will simply store a pointer an identifier for that image. So it can be retrieved later on from whichever copy is alive at a point in time. Sir, my question is related with the business intelligence tools that is related with, you have said once, PENTAHO. So any demonstration regarding this or how to use and just working with it is very disavailable and can you provide the demo on it? Can I provide a demo? No, I mean I have not used it myself. I know it is used. I know many people have used it but I have not used it myself. But it's available for download so it should be relatively straightforward to download and set it up and I'm pretty sure that company has given documentation to help you get started with it. So what I would suggest is just go to the website download it, read their manuals, understand how to use it and that should do. I don't think it's very hard to use. LR Tiwari, Meera Road. Sir, just in a short while from now you are saying something about the research. So my question is regarding to that only how to encourage or promote postgraduate students, MCA students or especially MTech students to participate more and more in research or innovation or these kind of activities which has a direct impact on society most importantly as we are in the professional studies so how to encourage students so that they can do something to get the reflection on society. That's a good question. There's no simple answer to that. This is something which we are all trying to do all the time. So I can't say I have the final solutions but there are a few things which do work here which different faculty here have found to work in different ways. One thing is if you want to do more academic research and publish papers and so on the best way is to have courses where you study papers. You run a course like the CS632 course which I run and every faculty member in IIT more or less runs a similar course which focuses on research in a particular area or a sub-area and exposes students to lots of research make them read lots of papers. When they read a lot of papers a certain maturity comes in and then they get ideas on what to do. At the beginning they have no clue about an area it's very important to read lots of papers and the best way is to teach a course where a lot of papers are read. Now you know many times I end up reading a lot of new papers when I teach that course. So that's good for me also. It's not that I need to have read all the papers up front. I want to learn an area. One of the best ways is to teach a course in that area. It's a paper oriented course so it's a joint learning students and I will read some new papers and when I teach it I will get questions or I also make students present the papers which they have read and so as they present it some things become more clear. So that's a good joint learning exercise to do with students to actually teach a course like this. If you have that flexibility you should do it. That really seeds an area. After that things become a lot more clear. The other part was social relevance. Now it's hard to combine these two things. Social relevance and publication don't necessarily go hand in hand. If you can combine it fantastic but they are kind of orthogonal. So if the goal is social relevance there are other ways which other faculty have used to motivate students. So they pick projects which they believe will have a social impact. Some colleagues of mine here work on technology for rural areas and others who have worked on affordable computing and so forth. Where the goal is not publication but the goal is to build tools which people can use and that also motivates other students. There are students who get motivated by one, some by the other. So both are valid ways and valid things to do and it's very good to do at least one of these. It could be social, it could even be building products. It doesn't have to be research. It could be build a product which people might use which you might even be able to sell. It may not be sell right away but it could form the basis of something you would sell. So it's always good to have students motivated to do something rather than just come take some courses, write some exams and go. That's not satisfying. There should be something more than that and we try to do that with our students. It doesn't always work. Some students lap it up and do a fantastic job. Some students don't bother. They don't put in effort. It varies. There's just one last question hanging. So I'll take that very, very last question and there are still a lot of people waiting there. I'm glad I took this question. Savita College, please go ahead. You'll be the last people to ask questions in this workshop. Good afternoon sir. First of all I thank for the excellent presentation sir. My question is I'm doing my research in medical diagnosis. So in that I'm handing multivariate data. So my question is is there any tool for handling multivariate data because I have to do a lot of experiments based on this multivariate data analysis. So my question is the first question is is there any tool for multivariate data? The second question is what is the role of big data analysis in medical prediction? This is prediction system. These are the two questions sir. Two very good questions. Unfortunately I'm not an expert in this area. So your first question I have absolutely no idea about the answer. So I don't really work in data mining or I have not done anything with multivariate data. So unfortunately I don't have any answer to your first question. The second question, you know, but bottom line is I think medical diagnosis is a fantastic area. This is a wonderful area to work in. The big problem usually is getting data but as our hospitals get more computerized you will get better and better access to data. So it's a good field to be in for the future and your question was the role of big data in this. There have been some interesting things which people have done. For example, Google has used search logs to decide if, you know, particular disease, flu is spreading in an area. When people start piping flu, it is possible that it's because there is a flu epidemic going on. It could also be because a news program talked about a flu epidemic, not because it's actually a flu epidemic. So they need to differentiate between these kinds of things. The problem here is getting access to this kind of data. It's a little harder. Now there are some sources which many researchers use. For example, Twitter data is easy to get access to and many people have done analysis on Twitter data. In the Indian context, I don't know how many people use Twitter. I don't think it... Obviously people are using it. We do hear of some ministers who use Twitter. But is it popular enough to make sense to do data analysis on this? I don't know. It might be. If not today, it might be tomorrow. So that would be... If it takes off, it is definitely big data because there's a lot of Twitter posts. Huge number. So if you can use this to analyze and make a conclusion, good. But again, I'm not convinced that Twitter is necessarily the right thing here. People are going to Twitter that I am down with flu. I have joined us. I don't think so. Maybe they will, but I don't know the answer to this. But if you had actual medical data from hospitals, sales of medicines and so on, that could be a wonderful predictor of what is happening. It's something which we need badly in India. It happens informally. Doctors see that disease is going around and when they see somebody with a symptom, they tend to believe that it is that disease which was diagnosed. So the diagnosis in one patient helps the next patient. Can you scale this up across large numbers of people? If you can, it would be great. So I hope you work in this area and do some good stuff. But I don't know enough about the area to help you too much on that. Even after big data analysis, I am doing the Raghnaz steps also. So based on the past data, we are predicting the disease of the future for the future generation. So that is the motive of the question. Thank you, sir. Thank you. Thank you very much. With that, we will wrap up the workshop.