 Good evening everyone, welcome to the last session for today on research in databases. We have actually changed the session a little bit because we will be having guest who will be joining us in a few minutes. Professor Sunita Sarawagi who is one of the leading international experts in the area of data mining. She started off her PhD in the area of databases, but after her PhD she moved into the area of data mining and she has been the PC chair of various leading conferences and very well renowned in this area. She will be joining us in about 10 minutes time. So, the goal of this evening session was to have a discussion on research in general. So, last time I had presented a few points on how to go about doing research and I had taken some questions on these. Sure many of you would have more questions and the other thing was last time we did not cover any specifics in detail and some people had been asking me about data mining research. So, this time around I requested Professor Sunita Sarawagi to join us and so you can ask any questions you have on data mining which she will be happy to answer. She will join us, she is here and will join us in a couple of minutes. So, let us start the discussion with questions from various centers. There are quite a few centers that have raised their hands here. So, let us start with that directly. Emity University Haryana, if you have a question please go ahead. Sir, sorry to ask this question, but my question is on normalization sir. Sir, I want to make that kind of relation so that that relationship violates all the normal forms. Is there is any real life example for that? For an example that violates all the normal forms. Well if it violates first normal form it automatically violates everything else because that is a requirement for all the normal forms. So, that is kind of trivial, but if you want to explicitly show violation of the other normal forms, just throw in a few dependencies. You know, throw in a dependency that violates BCN. Just add a few attributes and then throw in whatever couple of functional dependencies and multi-valued dependencies as you please to show violation of the other normal forms. That is your goal. It is not too hard. It is a small exercise. Just take five attributes and then add a few functional and multi-valued dependencies. So, you can use this Taylor to violate one of the normal forms and you are in business. Does that make you happy? Is there any real life example we can give? Real life example. Yeah, you could translate these back to real life, but you know, what is the point? If you come up with a really bad design involving real life attributes, sure. You know, you can violate all this. So, you can take these attributes, do it in abstract and map it to some real life meaningful attributes. It is not all that hard. It is a good exercise. It is just a fun exercise. There is no real thing involved here, but maybe something you can play around with and show your students. Any other questions? Thank you, sir. We have Maulana's NIT, Bhopal. If you have a question, please go ahead. Sir, what is categorical database and how can we mine the categorical database? As there are so many algos, but not any, everybody is perfect. Not a single is perfect. Is there any way to solve this problem? Okay, so you are leading with a data mining question. I will invite Prasasunita Sarawagi who is here to join us and answer that question. So, let me introduce Prasasunita. So, Prasasunita Sarawagi, as I mentioned just a few minutes back, is one of the leading experts in the world on data mining. She has worked on many sub-areas including graphical models, OLAP and data mining intersection and doing data mining through SQL among many other things. And she has been the program chair for the leading data mining conferences in the world which means she recognizes one of the leading experts. So, let me transfer the question to her. So, actually I will ask you back a question. So, what do you mean by mining categorical data? What kind of mining were you trying to do? Data mining. As there are so many algos, just K means K mode. Oh, so you are trying to do clustering of categorical data. So, actually for clustering categorical data, I mean clustering in general, you know it is sort of not perfect and it is very much tied to the application. So, what is the goal of your mining exercise and what are you hoping to get out of it? What is your objective for this mining? You want to mine categorical data means with data which is ordinal, nominal. Yeah, so you could so there are. So, for the if the goal of mining is to discover something new and serendipitous which you had not expected, then you can apply clustering or association rule mining. And for clustering, so there are many algorithms. So, if you look at say, so there is a long time back there was a paper by Prabhakar Raghavan on clustering categorical data and then there are many forward and so you know in general I will give a meta answer. So, if you know of one good paper on clustering categorical data, you should sort of stick in that papers title in Google scholar and look for forward citations to that paper. So, look at papers which have cited this paper and then you can find about other algorithms which are possibly improvements over the algorithm which you are trying and which is perhaps not functioning for you. So, I am sure there are lots of such algorithms actually professor Sudarshan will ok maybe the network is not working. You can continue I will bring this up in the background. So, you will find other algorithms and you know if you have categorical data and you are trying to look for some frequency values, then sometimes you are better off just doing explorations manually by loading the data into an OLAP tool. Professor Sudarshan must have talked about OLAP. No, I did not. So, in his book there is mention of OLAP. So, in the advanced topics you can read up about OLAP. Sometimes you might be able to sort of find something under greater control if you use such OLAP tools rather than depend on some black box algorithm. Hello ma'am, I am working in textual dataset and I have some problem to find how I am convert unstructured data into structured format. So, you know there are a whole range of techniques available for converting unstructured data to structured data. Sometimes you might just be able to write some very simple rules to do the extraction. And if your rules are slightly more complicated, you can use software like gate to help you write those rules. But if your data is really, really unstructured and rules will not work very well, then you can use some statistical methods. And there are many software which now you can download and use for training such statistical methods. So, you can use. So, Andrew McCallum has a tool called mallet. And then. So, a whiteboard if you want. So, statistical extractors. So, I am just giving some options. So, one option is to just use Stanford's NLP package. Write a bit slowly. So, another is to use mallet. So, a third option if you have a lot of the infrastructure around extraction available to you and if you just want the core engine, then I have also developed a software for extraction. But it does not have a lot of the supporting routines which for example, Stanford's NLP package might have. So, that is the CRF toolkit which you can get from CRF.sourceforge.net. And so, for statistical extractors there must be others that I cannot quite recollect now. There is also an open NLP software package which is quite good. Open NLP, I do not know if there are two n's or one n in that open NLP. So, you can just search for this term in Google and you will be able to find it. So, for the statistical extractors, I mean you should use them only if your rule-based extractor is not able to handle the variety of data that you see in your source. But unfortunately, even here it is not like there is one golden software package that you will use and if that does not work then nothing else will work. It is not that. It is a matter of trying out different things and something will partly solve your problem. Ma'am, these tools are free available in websites or this purchase. Sir, these are all freely available. License tools. And before these, you can try gate. So, this is so something. So, these are the statistical extractors. But before that, if you want to just try with some start with something which is very classical, which has been around for a long time. So, this is the rule-based total environment called gate. So, you just type gate NLP tool and you should be able to find the pointer to it from a UK university. Ma'am, the capacity to convert unstructured data into a structured format. So, how much capacity it will convert into a structured data? So, if you consider typical rule-based extractor, it will be quite fast. But the statistical extractors will be slower and it depends. If you are looking at numbers, what can I say? I do not have exact numbers on top of my head, but they are not very fast. I mean, most of them will be slow. They will be much slower than taking an HTML page, converting it into a DOM tree and indexing that page. Suppose, if that takes time x, then these will take anywhere between 10 to 100 x time. Recall precision. So, in terms of performance, you mean accuracy. So, if you are just talking about accuracy of extraction, then if you want to say, think about standard entities, like say identifying people name and organization names from things like news articles, which are very nicely formatted English text. So, from there you can get pretty decent accuracy like going to 90 percent also. But if you are talking about extracting even the same kind of entities from not necessarily news text, but arbitrarily, you know, arbitrary plain text documents available on the web accuracy might be much lower, might be like even 70 percent. For any other kind of new entities that you create, you know, it will of course, you know, it is very subjective. It depends on the type of the entity and how much training data you give for. Thank you, ma'am. We can show. Yeah. So, actually I was talking about using Google Scholar for finding out about more algorithms on a particular topic. So, I was mentioning this one paper on clustering categorical data by Prabhakar Raghavan. And the reference was not exactly sort of specific in my mind. And then Professor Sudarshan did this search using Google Scholar and he found this specific paper. And now you look at for that paper, there is a link called Cited by 246. So, if you click on this Cited by link, you will see the papers which have cited this paper. And now you can find, you know, you can find out, you know, any other article which also has high citations and which looks like an improvement over this algorithm and try them out. So, forward citations are extremely useful in helping you sort of figure out if there are sort of better methods than what you have been trying out. But of course, these days you might find lots of papers on a topic. So, in general a good very rough good criteria is to look for papers which are more recent because then they would have discussed all the other previous work hopefully. And also the ones which have a decent number of citations themselves and appear in good venues like top tier conferences and journals. We have Mahatma Gandhi Mission Noida, please go ahead. How we can create data cube in Oracle? I have not played around with data cube facilities in Oracle specifically. The SQL standard has a cube by clause just as you say group by clause, you can add cube by. That will give you relational output which has all the groupings. For those of you who are not familiar with this, let me use the white board and explain it. Supposing we wanted to analyze sales of items by certain properties. So, let us say that we have a relation which has every single sale that happened. Let us say sale ID and then we had various attributes which are called dimension attributes like may be city, then the let us say date, time and may be amount. Now, date and time or may be this is time of date. Let us just keep it simple and you know analyze may want results such as select city, date, sum of amount that is the total sales of whatever items were sold in each city on each date from sales group by city, date. So, that gives one particular aggregate, but there are many more aggregates that are possible. May be we want to group by city alone, may be date alone, may be city date and time and may be a number of other factors as well, may be based on what was the customer who bought it. It was this frequent customer, occasional customer, may be some income level of customer. So, there are many more things in here and you may want to group by various subsets of this. So, the SQL standard has this cube by construct city, date, time. So, instead of group by we say cube by and essentially what it does is it creates a union of a large number of group by queries one for every subset of attributes here. Of course, if I have a group by city and up here I have date what is the value of date you know there is no unique date here. So, that becomes null. So, that is the cube by operator in SQL and I am pretty sure Oracle supports it. Beyond this Oracle does have OLAP tools I am not familiar with those I have not used it. May be Sunita has something more to say. No, I have not used I have also not used Oracle's OLAP tool. I have used SQL servers long time back like say 12 years back and that time it was quite useful and it also had an Excel plugin. Yeah, I am sure things are much better now. There is a you are discussing about just aggregation means you are yes sir, sir you here you are discussing about transaction level data and over that we are using here aggregate function that is the OLAP and data cube. But I my question is what how we can implement data cube by using Oracle. Like a start a schema is not like a schema and fact constellation we can implement it in Oracle by using this primary key and foreign key constraint over the fact table and dimension table. So, that actually implementation how we can implement. I think your question is how do you do schema design for a warehouse on top of which you can do this cube. Is that the question? Yeah. So, we have been looking at various methods for doing schema design, ER modeling, normalization and so on. Now, for data cubes the way you design the schema is quite different. I do not have a whole lot of practical experience in this. But the general idea is that most of the applications where you need these kinds of things the schema kind of presents itself in the sense of having some sales relation or some other such thing one or a few fact tables which are you know there are a large number of records here. This is the core around which everything revolves in this in the case of OLAP. And around this core you need to build various dimension tables. So, in my quick and dirty example I had date and time. But it is not just by date that you want to do grouping you might want to worry about quarters. You might want to worry about financial years. You may want to dig deeper and have a monthly analysis. So, you can build hierarchies on this. So, those dimension tables and their attributes are driven by the business needs. Now, there are some things which are obvious once you have date most of the OLAP is for business analysis. So, the month quarter you know financial year all those things are very obvious candidates. Similarly, on day of the week is another obvious thing if you want to worry about weekend demand versus weekday demand. Obviously, for a business that is actually selling products the demand varies widely between these categories. So, there are various hierarchies you can build on this and then. So, those are all part of the dimension tables and then you can define cubes not only on the underlying date for example, but also on the hierarchy in the date week and so forth. So, you come up with a schema based on your data which you have collected. Typically, it is fairly simple OLTP data in large volume and then a whole bunch of other dimension tables which work around it and the dimension tables themselves you should not usually apply normalization. So, for example, if I have a dimension table date that is has one entry per every date. Now, given the date based on the month and you know the day of the month and or other such things I could always use a lookup table to find out the day of the week the quarter the financial year and so forth. But I am also given let us say quarter maybe I can well this is not a good example, but in general if you have a city ID for example, from that the city ID would automatically determine the state there is a functional dependency from city to state and from state to country maybe states have unique names across countries. So, if you did a normalization of the dimension tables you would split it up, but in the OLAP context people specifically avoid normalization for efficiency because the goal here is not to you know you are not updating OLAP data directly the updates are happening somewhere else and you are just copying it into this and redundancy is a good thing here in the sense of if it can speed up your query processing redundancy is perfectly fine there is nothing wrong with redundancy. So, for the dimension tables there is no attempt at all made to avoid such redundancy and the focus is on efficiency. So, the design of a OLAP schema is quite different from the design of a traditional relational schema. I hope that is partly answered your question if you have a follow up please go ahead. Sir, one more question I want to ask can I ask? Yes. My next question related with the graph mining when we are using graph mining then. So, how we will store graph in database by using Oracle? So, there are a lot of tools for storing very large graph which have come about in recent years which fall in the big data and even and the goal is to partition the graphs across machine. So, this is the research work which I am familiar with. Now, just storing a graph in a database at the core is very simple the schema for it is straightforward. The issue is how does this gel with data mining? So, my experience if you store you know graph in the database and then you want to do querying on it. Yes, you can issue SQL queries to go and fetch data, but it is not necessarily efficient. So, for other graph related stuff for example, in the work on keyword search on graphs which we have done if you had to go do SQL queries each time to access you know node and edges out of the node you are dead it is too slow. So, the only realistic thing was to extract the. So, you store the graph in the database for persistence, but when your service starts up it loads the graph from the database into memory and has an in memory graph representation and then works of it for keyword query. I am pretty sure the same thing would hold for any graph mining algorithm where the graph is small enough to fit in one machine's memory and also to be analyzed efficiently using few CPUs which are connected to that memory. But today graphs social network graphs and so forth have gone way beyond this. So, you have to partition the graph somehow. So, there has been some work on it including some work which we did a while ago in a slightly different context which was keyword search on graphs which are larger than memory. Our goal was not to partition it across machines although that was a side effect, but how to you know keep the thing in the database in some special form and then load parts of it on demand, loading just the minimal parts needed to answer keyword queries again. So, we worked on that that is where I have some familiarity with this area, but there has been a lot of work subsequently on graph clustering and search algorithms on graphs. So, one of the relevant things here let me write this down. So, there is a system called pregel from Google which is based on a very old paradigm called bulk synchronous processing of BSP. So, the basic idea is very old, but pregel is a recent implementation of this paradigm which lets you do a number of queries on graphs and where the graph is split across many machines. So, if you look up the pregel paper you can find more information on that. Just search for pregel and Google perhaps that should get you to that paper. Again as before scholar.google.com is your friend gives you a lot of information about paper. So, actually there is for graph mining specifically a graph lab is another software of a platform which is supposed to do a better job with irregular graphs. So, graph lab it is a software developed in a university. So, I do not know whether it is as good as a platform that you will get with pregel, but it is definitely worth exploring graph lab it is from CME. Anyway you cannot get pregel, pregel is not open source, but there are some clones of pregel which people are working on. So, I think there is some Apache project which is cloning pregel I do not remember the name, but it should be out soon if not already. But yeah that provides a platform it does not implement any graph mining algorithm on its own. So, all that has to be done on top of this platform. I do not know how it provides any support also for graph mining. No, I do not know about now, but graph lab my student has been experimenting and it is sort of he found it better than I do not know with pregel what soft whether he tried and he just did a theoretical comparison. So, graph lab is also worth while trying. They definitely implement belief propagation is basically an inference algorithm on a graphical model built on top of a large graph. So, I have not used Mahout specifically, but there is this Apache Mahout project. So, Apache Mahout is a set of programs if you will which are targeted at mining of big data. So, this sits on top of Hadoop we will be covering Hadoop tomorrow for those of you are not familiar with it. Hadoop is not designed specifically for graph mining it is a parallel processing paradigm based on this thing called map reduce. It is a parallel it is a paradigm is map reduce Hadoop is an implementation of that paradigm and Apache Mahout is set of tools for data mining in general which sit on top of this. So, they can work on very large data. I suspect there would be some parts of Mahout which target some kinds of graph mining. I am not familiar enough with it to say that for sure, but I suspect it would be that. My question is related to data mining especially like association rule mining. Like as we know when we apply association rule mining we do get a lot of redundancy. So, can you give an idea like how can we apply a zenitical law of thumb for removing this redundancy especially the fitness function which will be the beta fitness function to remove the association redundant rules in the association rule mining that is being generated by any algorithm for association rule mining. So, the answer I do not know much about genetic algorithms, but there are many ideas which have been proposed since the original association rule mining paper on removing redundant rules. So, one set of papers again you will have to use the same trick as I was telling about this categorical clustering on searching through Google Scholar, but there is one set of techniques which have been developed by Haike Manila and I do not remember the other author. So, which talks about how to use essentially the very well defined statistical notion of the measure of surprise of a particular item set given the observed frequencies of subsets of the item set to filter out high dimensional patterns which are not surprising given the lower dimensional patterns. So, that is one set of techniques. So, in 1998 I and with another colleague we did we also did another sort of dimension of work along a different dimension on getting rid of redundant rules and that is by exploiting time. So, any rule which holds steady for a long time is assumed to be known by the analyst is therefore and is therefore not interesting. So, you want to modify association rule mining algorithms assuming that with each transaction you also have a time stamp available to output those rules which have interesting, but steady variation along time. So, that is another piece of work. So, in general though if you start from that Haike Manila paper and then there is another paper by Rajiv Motwani and Geoffrey Alman which is sort of which started this line of work on finding interesting rules and if you look at forward pointers to that paper you will find lots of other work on filtering away the many redundant, but not so interesting that association rule mining produces. So, there is actually a very huge literature on the topic that you have asked. So, you will have to do quite some reading, but I cannot comment about the use of genetic algorithms. I am not familiar with genetic algorithms and I am not also a big fan of genetic algorithms. We can take questions from some other questioner. MPS TME, Shirpur. Within the same application can I use a special query as well as a normal relation query. How can I use it? In fact, that is not at all uncommon. So, for example, I may have shops which sell certain things and so I have relational data about what shop sells what things. So, I made filter out shops based on some relational predicates and at the same time I want spatial information also I want things which are close by. So, as long as your spatial data is in the relational database and the database supports these queries. Yes, you can certainly ask queries which have both regular relational predicates and spatial predicates. So, the oracle spatial extensions certainly support these combinations as far as I know. I think Postgres also has this post GIS as far as I know it is not very efficient, but if you just want to try out these queries it is a perfectly good platform. So, I think that also supports spatial predicates integrated with SQL. So, it is pretty straightforward to ask these questions which combines spatial and regular relational predicates. So, the next question is can I use in the same application more type of databases for an example I am using oracle db2 and postgres in same application is it applicable or not is it possible or not. Ok, it is possible, but not highly recommended there may be some historical reasons for it. For example, in IIT Bombay we have financial records on oracle database because that is what the TCS built and our own academic records which we built the system for are in postgres. So, the moment you do this there is always an issue of keeping the two in sync. If you add a student you better add it in both databases, you add a faculty member you better add it in both databases. If you give a scholarship which is recorded in this student the academic database it had also better be recorded in the oracle database. So, all of these have to be there is more work, but if it is needed for other reasons it is perfectly feasible. You just have to decide what all data is going to reside in both and set up mechanisms to ensure that the two are kept in sync. Sir, one last question is there. Suppose I have already created database in SQL which is of extension dot bck and now I want to use that database in postgres where the extension is dot backup. So, how to do that? You want a backup created on one database to be restored on another database. I do not think that will work. You can do an SQL dump that is where the system outputs a bunch of SQL you know create table insert table command and so forth. Those are little more portable, but otherwise I think there are commercial or what may be an open source. I have not used them, but there are tools to export data from one database, convert the format and then load it into another. But I am not familiar enough if you do a Google search you will find these tools. Thank you sir. Thank you very much. We have Shree Jayachamar Rajendra College Mysore. Please go ahead. We use Hadoop and MapReduce. Can we use it for Aadhar sir? Your comments. Can you use Hadoop and MapReduce for Aadhar? So it depends on what kinds of things you want to do. So if your goal is to use Aadhar to validate a person who has come to a ration shop then Hadoop and MapReduce play absolutely no role. But if your goal is to do some analysis on the Aadhar data maybe to even check for duplicate somebody who has registered twice and so forth then maybe these tools would be useful. So it depends on what you want to do with it. These tools are good for distance support and certain other kinds of bulk querying but not for you know small transactions which do just a little bit of work. Those are not the right tools. There are other tools for that if you want to have a massively parallel database. Aadhar is large and I am sure it has to be split amongst multiple databases. Does it qualify as really big data from the viewpoint of validation? Maybe not. It is fairly easy to break it up into multiple data you know to partition it horizontally. The hard part of Aadhar is to see if people are applying twice. That is easily the hardest part. Maybe Sunita has something to say about it. Yeah that is actually really a hard problem and even in general given a database of size n and if you want to look in that database all possible duplicates that might exist. You know theoretically you will have to do a quadratic time computation but of course you cannot afford that. So you know but it is a very expensive computation. You might have to take maybe at least you might have to compare a record with 10,000 other records. Actually maybe in the case of the whole Aadhar database it might be more like 1 million. Yeah so it is a very expensive operation. So you might have to there are many algorithms which are available for doing duplicate detection in large databases. Some which involve which assume a pre-existing index. Others which sort of create multiple signatures and they only will sort of compare record groups within records within a group which have the same signature and then there are other sort of you know blocking techniques where you create signatures where the order makes sense and you sort your data based on that sort of you know order sensitive signature and you do a sliding window kind of search. So there are many algorithms which are sort of like a fuzzy joint kind of algorithms which you will have to implement. Yeah and in such cases map reduce yeah you know if you use the signature based methods you could use map reduce. We have eBit group in Tamil Nadu. Is there exists any widely accepted algorithmic approach for performing outlier analysis on large datasets. I will defer that to Professor Sunita. So actually outlier analysis is like clustering. It is basically the complement of clustering. So there is no one single algorithm which will work for all kinds of data. There are many algorithms around I mean there are I mean for example if you have if you are looking for outliers in OLAP kind of data you would use a very different algorithm then if you are looking for outliers in say you know just categorical transactional data like see the ones which are kind of biased by association rule mining. So yeah I would not say there is one widely accepted algorithms. In fact you know it may be even less so than for the case of clustering where maybe k means is that one widely accepted algorithm. Now in general for outlier detection the the meta rule which is followed is that you create a model. That model can be a clustering model and then an outlier is anything which does not very well in that model. So with that rule you know with different instantiations of model and the models will be chosen based on the kind of data that you have you will have different algorithms for outlier analysis. So I have myself not done too much work on outlier analysis for general data but I did work on outlier analysis for multidimensional OLAP kind of data and there we actually built a sort of a multidimensional model like you know ANOVA kind of model with time series modeling and we looked for outlier in that OLAP data and there that model was a very nice fit and we found interesting outliers but I would not use that model for some other kind of data. It was specifically chosen for OLAP data assuming that the number of dimensions is not very very large. So you know if you have some specific kind of data in mind maybe I can elaborate more but in general no I do not know if anyone will accept it. One more question pertaining to queries, SQL queries. It is assumed that queries, SQL queries involving Cartesian products is always imposing a lot of overhead on the amplification side. It is better to it is suggested that it is better to write queries or replace queries which usually does not involve Cartesian products of tables. This statement is how far this is true sir let me let us want to know this. I mean is it recommended to perform or write queries which involves Cartesian products or there should be queries which are free from Cartesian products of tables. So I cannot think of any pretty much any meaningful domain where you actually want a full Cartesian. There are a few rare cases where you for example if your goal was to do for loop over several things and you want to code that in SQL then there are some cases where there can be Cartesian products. So we cannot rule out queries with Cartesian products saying that they are totally useless. So for example if I have a year table with you know with years each row is one year 2000, 2001 and so on and another table with months with the things being January, February, March and so forth then a Cartesian product of this would give me a month, year combination which I might use to do something else later in the query. So occasionally Cartesian product like this could be useful but it is rare. Most of the time you know Cartesian product unless you intentionally did it is probably an error. You forgot to put a joint condition. I don't know if that answers your question. Okay thank you. GRIAT Kukatpalli, Andhra please go ahead. What exactly the difference between federated databases and data warehouse? Okay so federated databases is a term which is generally been used to mean that let's say you had a number of databases within your organization and sometimes this happens because two organizations merge. Air India and Indian Airlines merged which was a big mess of course but the fact is that they would each have had their own computer system and if you want to merge the computer systems then one of them has to dump its system and move all the data to the other and that might cause some problems operationally. So maybe they continue operating their own computer systems just like they continue operating Airbus planes and Boeing planes and so forth. But now the issue is that you do need data from both of these databases. They need to see each other's data. So federated databases where each of them continues to operate as a separate database but you provide some support such that you can have queries run from one of them. It's allowed to access data from the other maybe even do lines with data from the other and so forth. So that's federated database and then you might even allow transactions on which span these things. So the databases are still separate but you provide some kind of view on top which lets you do queries and updates across the two databases. Does the functionality save in both? The two continue to run exactly as they did for their own functionality. What you do is you add some more stuff on top which lets them interoperate with each other. So if you have a query new query on one which needs to access data on the other it can do that. If you want update transaction which runs on one but also needs to update some data on the other then you can do that. So that's what a federated databases they still have their independence but you allow certain operations to span two or more databases. Last question sir. Can you suggest any sites for I mean same question just I asked in the previous also regarding spatial data mining. Can you suggest any sites for getting the data sets on spatial data mining? Yeah so I think I said I will provide a link to sites I have not had time to do that. Let me just put this down so you can search yourself. So there is I mentioned a few things there is US data. These are things you can find this easily. There is a tiger data set which many people have used. This is for spatial query processing but I am sure there is a mining aspect. But the other thing is you can get in touch with Professor NL Sarda who heads the GISC. I think that is geographical information systems engineering or something lab here at senior colleague here at IIT Bombay. So if you search for him you can also I think they also have a web page for this lab. So they have been collecting data sets from various places. I do not know if they can share those things but you can certainly contact him and get more information about availability of data sets from Indian sources because he has been working on that. Other than that I know that there are many many more data sets of road networks and other such stuff which are publicly available. So you can search for them. I think European road networks there are some data sets. There are many others. I cannot name specific ones but it is not too hard to find some kinds of data sets whether that matches what you want to do. If you want to look at a particular data mining problem you know does a road network data set help you at all maybe not. I have not worked in this area so I cannot directly point you to things but I can help you search for these later. Thanks a lot. We have Rajalakshmi Engineering College Chennai. Please go ahead. Hello. Good evening sir like this question regarding association rule mining. So while applying association rule mining we have been getting a lot of overlapping rules and redundant rules. Is there is any specific tool or algorithm to do that? Maybe so this question was asked just a while back and I gave answer to that question. So let me write down this time so maybe it will you can then search on Google scholar. So there is basically the first paper on this topic was written by you know there were other authors but I remember two of the authors. Rajiv Motwani and Jeff Anman. So there was a paper which do not remember the title but definitely there was association rule mining in it. So this was maybe written say roughly in 1997 I think. And I have to write a little bit of it. Oh ok yeah ok. So this was one pointer and then there is another sort of paper by Aki Manila which also talks about interesting I am not giving the exact titles but just some keywords with which you can search association rules. Yeah but tools so if you search and I will come to that later so this was I think in 1999 and then I have done some work on this with some colleagues when we were in IBM so this was the paper with Chakra Perthi and Dom. So this was a paper on temporal association rule mining so which talks about etc. And now you talk about tools so you know this is there was a question I saw which was sort of put in that which was just asked on the online. So in general if you just go to www. kdnuggets.com you can find pointers to many data mining software tools and I think most of them which do association rule mining would also now have some support for managing the very very large number of rules that a typical association rule mining algorithm would produce and since these are now sort of you know many years have passed since some of these papers were written I expect that this will also appear in the tools which you can sort of search on this website. Can we association rules for this exam or mining XML documents for what documents like XML documents like so far we have XML documents. So association rule mining has you know the input association rule mining is a set of sets so it does not matter whether your document is in XML format or in any text format you have to convert your data into this set of sets format. And then you also have to think whether it makes sense to run association rule mining on the set of sets that you have created. So you know it does not matter you know what in what format you know in what format the data is stored really it is basically you know you think about the abstract input that is required by association rule mining operation and see whether you can you know whether your XML data actually contains sets of records where each set in turn can be thought of or it can be converted through some simple transformation into a set. Narayana engineering college Nellur please go ahead. Sir I am having one doubt in related to the human behavior analysis sir so can you suggest me best algorithm which is suitable to analyze a human specific behavior. I do not know anything about that area Sunita would you know. Human specific behavior you know if you look at the www.conference www.conference they have you know if you are talking about human modeling user behavior for browsing you know it depends on what kind of user modeling that you are trying to do but if you want to consider user modeling for says on web browsing so that you can find out you know what a user is trying to do within a session you know there are lots of you know over the years you will find you know there is some or the other track will be talking about human modeling I mean but I am not familiar with this area I just remember seeing such track titles in the www.conference you know this conference so if you sort of look for you know you know every year you will find you can just look at the set of papers which have been published in that year and essentially search for human and then look at the papers which talk about some or the other aspect of human modeling. Thank you madam thank you very much. We have Bansal Institute please go ahead Bansal. My question is what verification methods of research findings in technical research except mathematical proof are acceptable as research efforts. This is a deep question what is research effort what is not so that is a very hard question to answer in general to me research would be anything where you have some general finding depending on the venue where you submit your research the expectations of what is novel vary so I do not know if you can have generic definition across all areas and so forth but in computer science you know in the olden days research was primarily about coming up with new algorithms for doing stuff but I think there are also other kinds of things which may be acceptable for example if you there are a lot of algorithms but for a particular domain may be finding out which algorithm makes is the best fit for a domain might be considered acceptable research in many venues as an example there are lot of tools for teaching now if you are doing a study of which tools work well for teaching students you know that may not count as research in a computer science conference but in an education conference that would probably count as research so what is research partly depends on where you are targeting it so there are certainly conferences which look at these things and it is important when you decide to use a particular tool you might do it by gut feeling on the other hand your gut feeling may be wrong we put in all these fancy tools in the place and in the end the students learning may not be helped at all so was it worth it this is the kind of things which people do study and it is a valid topic for research I don't know if that answered your question you feel free to ask follow up questions sir I have few more things to ask in fact my prime concern was that you see I am talking about the technical research so in that area suppose we can take an example if we are talking about some database design or some database design so if we propose some model or design so what the initial proof we have to put forward to get it acceptable in the research community we can take an example of data warehousing suppose we say that the involvement of the user is very important for the successful data warehousing development the background work we have to show it is not always the mathematical proof of anything because you see when we talk of the industry environment where they are working on the practical things and in the academics we are doing the things theoretically so how to bridge that gap that is my prime concern I would be grateful if you can give some idea on that sir so in some of these domains you say what is the importance of getting user input on data warehousing so if you are just doing a study with one particular implementation it is very difficult to say anything much you know it is hard to draw conclusions but in the software engineering domain these kinds of things are common people propose particular methodology and you can propose many different methodologies for doing things but how do you compare them so one way would be to study projects and see how successful they were and try to get feedback from the people involved in the project to see what they felt were contributing factors and so this kind of human feedback might help in establishing that you know doing taking user input in a particular way is very important for the success of that project so in software engineering I have seen papers like it is not an area I am very closely familiar with but I know that there is work which looks at such issues and that is considered valid research in that area it is not mathematical modeling it is taking certain ways of doing things and studying how successful they were when used in a variety of projects in general anything like this you want to establish that it has value beyond one project so it is kind of required to do the study across many things so that you are talking of stuff which has general applicability in a wide variety of scenarios and that would be considered publication worthy if you did something which is very specific to one project and it is not clear how to use it in any other project that may not necessarily contest research I do not know if that answers your question better one last question regarding the significance of white paper which are available on the net how authentic they should be considered and because I have seen these kind of stuff having lots of important things lots of technical coverage in that so are they considered as an authentic reference in case of research that is a good question so white papers are typically put up by companies when they have a product and they want to help you understand what the product is about and sometimes they also do a comparison with other alternatives so these are papers written by companies they can help you understand what is going on but if company exclaims that their methodology or their tool is better than tool of somebody else in their white paper you of course have to question who wrote it this is unlikely I have not seen too many white papers which do these kinds of things but supposing you want to cite a white paper and product X is better than product Y that may be questionable but if you are referring to the white paper to say that this is how this particular tool does something then that is perfectly fine so they are useful source of information they can be cited that is not a problem at all but they are not the same as a research paper which is written by an independent party which may be compares two things but of course most research papers are not written by independent parties most research papers propose some technique and then they proceed to compare their own technique with others so every one of these has to be taken with a certain pinch of salt obviously their goal is to show what they are doing is better obviously they will look at scenarios where what they have done is where their technique beats the others does it mean that technique will beat the others in all scenarios may not be so any paper at all has to be looked at with a certain pinch of salt and that is an important part of research actually if you just accept everything at face value you know then you are not questioning things enough but of course your questioning should be in a way that leads to something new and interesting which you can publish perhaps that would be your goal as a researcher I do not know that helps again thank you very much sir Shanmuga college Tamil Nadu what are the database administration tools are available in the research side and which is most used in the research side database administration tools so by this I the from the research context the tools which which have been developed by researcher are things which automate the administration the goal is to not have humans who are doing the administration let the system configure itself to the largest extent possible so many of the tools which you see today which we talked about you know the tools for index tuning materialized view tuning and other related things they all came out of research researchers said like you know we do not want to have highly paid administrators doing this we want the system to do all this more or less by itself with minimal human intervention so how do you do this then there was a lot of interesting research which was required to make those tools work properly and efficiently now other tools which researchers use in order to do the research I am not sure that you know there is anything which researchers use in that sense the tools which come package with the databases were developed by researchers initially and maybe by other developers later does that answer your question or is there some other aspect to what you asked about Thank you We have Sarvajani college surat please go ahead Sir is there any research effort made in the direction of in memory indexing or hasing in context of intermediate results generated during joint queries on joint table in context of query optimization In the context of what are the last thing you said query optimization Okay I don't know there is a research angle to this the just the in memory indices used for intermediate results have primarily been you know hash indices as far as long as you are saying that the joint are simple equi joints however if you look at more complex joint conditions then it's different in fact it's interesting you ask this question because right now I have an M.Tech student and a Ph.D. student looking at exactly this kind of problem how do you build indices where the joint predicate is of a particular form the tuples have patterns that is they could have an exact value or they could have a value which says all possible values are acceptable star now when you do joint processing when you have a number of joint attributes and number of relations and then these tuples in individual relations can have a mixture one tuple may say A B another may say A star another may say star C so how do you do joints in this context so that's something which we are currently working on there may be other work in this area we have not looked at it in depth yet but yes if you are looking at more than simple equi joints there are surely interesting issues here the other aspect is just in memory indexing not during joint processing but in the context of in memory databases what indexing technique would work best for data which is already in memory you don't have to load it from this B trees were designed for this data there's been a lot of work on this for many many years and there is still some ongoing work some of these have been varying because the properties of memory has changed way memory lines are loaded caching there have been a lot of hardware changes along the way and also main memory databases now become not only feasible but they are becoming common in the real world SAP has released a database called HANA Microsoft I believe is also releasing a version of SQL server which is tailored for main memory many others there are like I believe some 20 or 40 different main memory database tools implementation whatever available there is an interesting talk which one of my friends Rajesh Manjramani has he showed me a copy of his talk which he gave somewhere so there is a lot of work on indexing in main memory so it's an active area still it's a little hard to do brand new stuff because there has been so much stuff to do okay the last two hand okay finally we have the very last question from what is this allury institute varangal yes sir which one is the most suitable dataset in stream mining sir that's a hard question I don't know what datasets are available in the context of streaming I mean there are a lot of applications of streaming data you know the sensor data which are available I know that you know this in Berkeley you know Mike Franklin and all you know if you just look in their group they have been working on sensor data and also I don't know now whether they are like at least five years back there used to be all the sensor data which were available which you could use for streaming applications but again I think the google scholar trick works so I know that Mike Franklin so if I have to look for streaming data what I would do is so if you just search for his name and go to look at his research papers many of them are on stream various kinds of query processing models and others having to do with stream data so now you look for forward pointers to those papers and look at some of the recent papers on stream data mining and see what data sets they have used in their papers very often researchers use data sets which are publicly available and those might be also then you know very useful for you is stock market is it suitable for stream mining madam? stock market data it's the particular kind of data I think it's a very it has some specific properties which may not hold for other stream mining applications but you can certainly use stock marketing data it will be interesting to also explore some non-stock marketing data because stock marketing is a very very stylized kind of application sir one more question sir is there any other open source tool in data mining apart from veka that question will go to madam not sir apart from veka so there is a statistical tool which is very popular in universities called system R system R is a database tool this is also R so this is actually a command line tool so this is it is a I mean I have not used it myself so it's not totally on the top of my head but it's a R statistical analysis package and now what professor Sudarshan was mentioning earlier there is Mahut which is available which is an apache data mining tool it is also supposed to be scalable and freely available so these are the options but you know in general this is kdnuggets.com it's a very useful website to visit and if you click on their software tab you will find pointers to lots of other freely available software like any research issues in object oriented databases object relational data research issues in object oriented or object relational databases were also hot a long time ago again I am not sure if there are any brand new research angles which have come up recently like any area which has been examined in great detail a while ago you need some new angle to it to come up with new research in that area the object relational mapping systems on the other hand have taken off lot recently and they did not receive all that much attention in the research community object relational mapping system so there might be some research angles in this and in fact one of the things we are doing is a project which attempts to optimize database access from applications so the idea is traditional database optimization would optimize SQL queries but if applications do bad things and issue millions of queries there is nothing the database can do it can run each query fast but if there are millions of queries things are going to be slow on the other hand you can rewrite applications to change those million queries into far fewer number of queries which will then run much much faster so how do you automate the rewriting of applications is an area that we have been looking at so we took any Java program and so we can rewrite it to optimize so we have changed the JDBC calls and then rewrite it to optimize database access so there is a very interesting project my last PhD student who graduated some years ago did a fantastic PhD on that and I have another student working on it and several master students that area worked out very well one of the related things is that people are also using object relational mapping tools like Hibernate so we also did some work on how to take a Hibernate program and optimize its access to databases so that work is we did a little bit of work there is probably more to be done so there are certainly issues when these have become real systems people are more worried about issues in here how to ensure that Hibernate's concurrency control mechanism doesn't cause a mess somewhere there are a lot of practical problems which people have with these systems efficiency is one of the major factors so this could lead in turn to research problems in this sub area thank you sir thank you very much okay thank you very much I think we will call it a day thank you for staying back so long beyond the official end which was 6 o'clock so we will see you tomorrow then