 Good evening everyone and welcome to the first of two sessions on research and databases. As I said earlier, my goal is to have an interactive session, but I thought I would get started by having a few slides which talk about my view of how to go about doing research. There are of course many other views, each researcher would have their own view. So, you should talk to other researchers and find out which view you agree with and perhaps follow that. But I will take the first five to ten minutes to put up a few slides and then we will have a lot more discussion. But before I start, maybe I can take one or two questions about research, just to get a feel for what is it that people are looking for in this session, what do you want out of this session. So, if there is anyone who is ready for interaction, maybe I will pick up a few. So, we are with eBit group, Tamil Nadu, if you have some introductory questions, please go ahead. Sir, actually if you are doing research means we ought to extract lot of data from the database. So, if you are extracting data from the database means which kind of language is efficient. So, the question is, does supposing you want to extract a lot of data from a database, what language is efficient? Now, that is not exactly necessarily what research is about. Research is about doing anything innovative. Now, the amount of data you need for it could be very small or it could be very large. And depending on what kinds of things you want to do, just Java, JDBC with SQL is good enough for many, many things and that is been the focus of most of what I have done. For certain things we have used C++ and these days to deal with big data which spans many machines, there are tools which we use. In particular, Hadoop is something that we are going to cover later in this course. So, you have to use the right tools for whatever job you are doing. Luckily, there are lots of tools in the open source community which are very good quality tools used by industries also. So, you just have to pick the right tools once you know what specific area you want to work in. We are with Pyagraja College Madurai. Now, normally in authentication, we are using password and username and password. Is it possible to increase the security so that we can add location also? Is there any possibilities or there? Yes. So, there are many tools out there to improve the security of authentication. Username and password is actually very weak although that is the primary mode used today. There are so many ways in which it can be attacked. People can guess passwords, people can install keystroke logos on various computers which record the passwords that you are typed in. But people have also worked on solutions to some of these problems including sending a one time password by SMS which is widely used in India now for credit card transactions. There are also some other ways of checking where you are logged, where you have logged in from. So, the IP address from which you have logged in, if the bank knows that you normally log in from a certain range of IP addresses which is based in a certain location, if you suddenly log in from somewhere else that is probably an alert. So, this is where location comes in. So, if state bank of India is getting a request to transfer money from China and it should probably take some extra precautions to make sure that it is not fraudulent. What can it do? In such cases we do not know, I mean there are various things an application could do such as totally ban things from a certain area or you know maybe allow it but subject to limitations on what kind of things you can do. So, the location could be inferred from an IP address in such cases. There are many more things such as you have hardware tokens which will generate a new number each time depending on the time. So, if you do not have that physical device you do not have the correct password to be used at that point in time. There are many things which people use. So, overall you know it is clearly a major problem. Is it a research area? Yes, it is not a database research area. People have looked at it as a research area in other communities. So, people have talked about for example biometrics. Biometrics is very good when you go to a place and say you go to a bank or a ration shop and they want to identify who you are. But if you are doing it over the net biometrics does not work because you know think it is vulnerable to recording and replay and so forth. So, if somebody gives a biometric over the net there is no way to verify whether somebody recorded it and is replaying it or it is a genuine person. So, there are many issues here. So, there is a whole a community looking at this. I am not much familiar with the details. Any other research related issues, questions you want to ask? Sir, I am working on a temporal database. So, is any website where I can find such type of data related with time series data set? So, that is again let me generalize that question. Whatever area you are working in it is very important today to get data sets in that area. So, that is something that people are looking out for all the time. So, where can you get good data sets for such research? So, that is not a easy question to answer in general. In the data mining community there are many data sets which have been put up by various people. There is a site called kdnagits.com which provides sample data sets for many applications. In fact, they do not provide it themselves. They provide links to other sites that provide data sets. Recently I heard about this startup called Kaggle. It is a company which will give you data sets and the problem and you can come up with solutions for predicting something which is specifically about data mining and prediction. So, they give you a data set and the prediction problem and so the idea is that anybody who wants can download those data sets and they give some test data meaning. So, they give a training data and then you can test if your prediction was correct on some other test data and so that could be a source of data sets I have not tried it myself. So, there are other sources I cannot give you a generic thing. So, specifically for temporal data I do not know I mean, but the good thing today is Google is the great equalizer. You can search on Google just as well as I can it is not just Google of course any search engine. So, if you just spend enough time trying different queries, generally you will hit upon a few interesting data sets it is not a general it may not always work, but it often does. So, I do not have a specific answer sorry. Sir, what is the implication of applying data analytics to a database is there any tool to formulate data analytics? Data analytics is a general area which is you have data you want to analyze it to come up with interesting conclusion, which you can then use to drive your business. So, it is a very big area both in terms of you know commercial prospects and in terms of research that has gone into it and it has many sub areas the earliest forms were online analytical processing which let analysts view various aggregates on the data look at it in different ways. You can drill down you know roll up there is a whole number of operations which these tools provided to let you look at aggregates in different ways. And eventually from that figure out you know what is going on and find interesting things which could help you in making business decisions. Later this moved to data mining and people have been applying data mining algorithms to find patterns which could be useful in making business decisions. So, this is a very rich area there is a lot of scope still there has been a lot of work on data mining by no means is the area new it is been around for since the mid 90s. And in fact it has a older history from the machine learning community in AI which is much older. So, there is a long history to it, but as of today you know I think it is still a rich area in the sense that there are many more problems which are not been studied in great depth. So, there is still potential for interesting work to be done. I think what I will do now is go through the slides which I have and then again we will have a discussion maybe this will help in some way. So what is research if you see the Oxford English dictionary definition the systematic investigation in to and study of materials and sources in order to establish facts and reach new conclusions that is a long sentence, but there are few key things to note here. The first is you want to reach new conclusions, so you want to do something new. The second so this is a very generic definition across many areas of research ranging from you know the humanities research and humanities to research and sciences. So, again what is relevant to computer science what parts of this are relevant to computer science are not exactly what are relevant for example it says study of materials and sources. So, materials are not so important sources are only important in the sense of finding ideas whereas in humanities the sources are what the research is about in many cases. So, from the perspective of computer science at least what is research it could be many things, but here are a couple of attempts to say what it could be. One thing that research could be is to solve a problem in an innovative way. So, here is the problem here are some solutions which work, but we are not happy with it. Let us find a new way of solving it and you know that is research, but of course a new way does not necessarily mean a better way. So, along with coming up with an idea of how to do something how to solve a problem you also have to compare with things which have been proposed earlier for the same problem. So, you need to be able to do a comparison with earlier solution sometimes it could be qualitative, but more often it needs to be quantitative to show that what you have done works. Another way of doing research would be to read up what is there and find a existing published solution. So, here is a solution it is a solution to some problem. Now, if you read most papers carefully you will realize that in order to solve the problem in they make a lot of assumptions. Most research papers make lots and lots of assumptions to simplify life. So, rich source of you know doing research is to look at these things and find out hey what simplifying assumptions are they made which maybe we can generalize. If you remove some assumption then let us see how we can extend the solution to solve it in this case. So, improving so what I would call this is improving on existing solutions to extend them. So, this could also be research because there are many more definitions of research, but let us see how to go about this process of research. So, I would say there are three key steps in this process. The first is to find a research area or areas. So, that is the very first step with anybody who is starting up on a PhD needs to do what area to work on. Now, many times there is a shortcut solution to this which is to find a guide and then work on whatever problem the guide suggests. Of course, then it is a guides problem to find a research area, but regardless of whether you have a guide to tell you to a current area or not you should explore part of doing a PhD is to explore. And when I say explore what I mean is to find out what people are doing, what is going on in the world of research. So, how do you do this exploration? So, the typical way you would do this is to find out what are the areas of interest in recent years? It does not have to be this year, it does not have to be last year, maybe in the last few years. Why do I say this? Because anybody who reads the textbook on Operating Systems gets thrilled by some of the ideas there and don't realize that those ideas are 30, 40 years old. And a lot of research happened in that era and it's very hard to come up with brand new ideas in that area. Databases is a somewhat younger area, but even in databases now there are many things which were looked at in great detail many years ago. So, it's a lot of work to figure out what you can do new in those areas. So, a shortcut is to find out what areas people are looking at currently and hopefully some of those have not been explored in as much depth as an area which is 30 years old. But that said, what happens often is an area was explored 30 years ago, but not in depth. And there are new issues which have come up which makes it worth revisiting 30 year old areas, things which you study about in basic textbooks. You might want to relook at some of these in the light of something new. So, if we say query processing techniques or let's take even simpler case, storage and indexing techniques which we are going to look at tomorrow. They were all designed for disk system and a few for main memory databases. These days there is flash memory which is now very widely used. So, there was a lot of people who looked at, here is a new technology. How will it affect all these solutions which people have come up with over these years and can we change what we do for various things? And more recently, there have been proposals for other kinds of memories called storage class memory, which are persistent, but give a interface much like random access memory. And people have been looking at how do we build storage and structures for these kinds of memories. Flash has certain properties, these have certain other properties. So, revisiting old areas in the light of some new technologies is definitely a good way to find the research area. But if you read papers from these things, you will find people have looked at some of these areas. And if some area excites you, probably worth looking at in more detail. So, where do you look for these papers? Where do you go and start searching? A good heuristic is to look at the top conferences in an area. So, if you focus on databases, some of the top conferences are the ACM-SIGMOD conference, the very large database or VLDB conference. There is the IEEE International Conference on Data Engineering, ICDE, then there is EDBT, ICDT, which are European conferences. And then there are many, many more. So, these I would say are tier one, which are the hardest to get papers into. And you do not necessarily want to try to get a paper into these conferences right away, until you have got a footing in the area. But there are also tier two slash three, the next level down conferences, which are not as hard to get papers into. And, but they do have interesting papers, the good papers which are worth reading and seeing if you can get ideas from those and follow up. So, these include Dasfar, the COMAD conference, which we hold in India each year. It is an international conference with participation from all over the world. But it is held in India, which makes it easy for everyone here to attend. The costs have kept very low. We keep the, I am involved in it and we make sure to keep the registration fees very low. So, everybody should be able to attend. It is more or less the cost of travel and cost for a hotel in case of faculty, for students, even arrange hostels to make it even cheaper or free. So, it is good to attend these things to find out where the buzzes. What are top people in this area saying about what they think are interesting problems to look at? So, somebody asked me before the talk, what are interesting research areas? So, there are many, many areas. And what you should do is, look at maybe some of the keynote talks in some of these conferences, where people talk about what they think are interesting areas to look at. There are also online talks, some from these conferences, some from other places, where people have talked about a research area. And it is worth looking at those. Another way is to ignore all the published literature, more or less, but go find real world problems and especially find those which are not solved satisfactorily and then try to abstract them, so that there is some hope of solving them. So, how do we do this? So, there are, it is actually hard here. This is a little harder, because the line between development and research is not very clear. There are many real world problems, which don't lead to new research. They lead to new tools which you can build to solve them. But is that tool a research contribution or is it just a tool which you built? It depends. There are tools which are sufficiently innovative that you would count them as research contributions. And then there are tools which are straightforward. You build a web app. That's probably not research. But then having built it, if you see how to stress it, how to scale it, how to do various other things, how to solve new problems that seem very hard to solve efficiently, how to solve them more efficiently, usually any real world problem that you pick can be stretched to come up with research. And the trick here is to build a system, but you keep your eyes open as you build the system. You use existing tools, but you keep your eyes open and say, hey, this didn't work very well or this is not good enough for some scenarios which I anticipate. Maybe it's good enough for the current problem I'm trying to solve. But in other cases, it may not work. Or, hey, I just hacked up some solution here. I came up with some solution quickly so that I could build the tool. But maybe there are other better solutions. Maybe I should study alternative solutions and compare them and see which works. That is research. The solutions don't necessarily have to be completely brand new solutions. They could be extensions of existing solutions. But applying them in a certain problem domain and seeing how they perform and comparing them could be a useful piece of research. It may not make it into a tier one conference, but it would make it into a tier two conference. So finding a research area is kind of step one. Step two is learning a lot about the area. In fact, step one and two are kind of iterative and I'll tell you why. So the key thing to doing research is to read up voraciously. Read lots and lots of papers. Everything you can lay your hand on in the area. See how other people have thought about the problem. So there is a saying which says when you do a bachelor's, you know nothing about everything. You're very broad. When you do a PhD, you know everything about nothing. In other words, your area is very narrow. So what happens is you start with a broad area and then you find sub-area which looks interesting and then you read more and more in that sub-area till you have read all the papers. There are in a very narrow sub-area and now you are an expert in that sub-area. Now in this process, you might find interesting things to do in that sub-area. And all the effort you put in has paid off, but it may not always work out immediately. So you may make a few iterations. So you may then go back one step up and say, okay, let me look at this other sub-area and dig deeper into that. So, but the next question is, how do you learn more about an area? I said you can go read up conference papers. That's certainly one way, but sometimes it's confusing. There are lots of papers. You don't even know what the paper is about based on the title. So how do you find relevant papers to an area that you're looking at? So one way is read a few interesting papers. And if you see the papers that they refer to, which are also in the same area, read them, go backwards. These days, there's also an option to go forwards in the sense if you go to scholar.google.com, it will tell you about papers. It will tell you about which other papers cite this paper. So if you found a paper which you are interested in, which is say three years old, if you want to know, has there been more work in this area? Well, you see what papers have been citing this paper. And if you browse through the titles and you may quickly find a few new papers in that area, which are relevant. So going back and forth, reading cited papers and reading papers that cite the papers that you started with, you can find a number of papers in an area and get a feel for the area. Now again, when you read a paper, you should read it in two passes. The first pass is to read it quickly, to get a feel for what is the paper about. You may not understand all the ideas in the paper, but you'll get an idea of what the paper is trying to do. Then you can read it in a little more detail. And then if you decide it's a paper in an area you want to work on, you will actually read it in great detail and understand all the little nitty gritty things with the paper address. This is important because that will help you when you start doing research to understand what are the issues that you need to work with. Another way to do it is to find courses which have collected together interesting papers in an area. So somebody has already done this work for you and you can find a collection of papers. There are many sources, many universities have courses and they put up reading lists online. For example, I teach a course called Advanced DBMS here, which is based entirely on reading research papers. So I cover a number of different areas to give students a feel for many areas in computer science. But I tend to focus on more on areas which I am interested in. And if you see the latest version of 632, it's on the web. The links are there from Moodle. If you go to the last section of the Moodle page, there are some links there. So if you look at the most recent one, I have collected together a number of papers on query optimization and big data. So we will see Hadoop coming up, but if you want to go beyond Hadoop, what is there? So there are research papers covered here. But as you read those papers, you should also be thinking about innovating. So there are people who love to read papers and they really understand papers thoroughly. But then, they get stuck. They don't know what to do next. They said look, I have read all these papers and I have understood them, but I don't know what to do. A PhD requires publication. A publication requires you to do something new. So how do you do that? How do you innovate? The key thing is, as you're reading papers, you should always have an inquisitive mind. Don't ever accept things as they are in the paper. People always try to act as if they have solved a problem. But most of the time, most papers have not fully solved any problem. There are always gaps. There are always assumptions which people make to get the paper out. In the real world, any problem takes a lot of effort to solve. And it's an incremental process. Somebody solves some parts of it, writes a paper, somebody else reads this, gets ideas, solves another part of the problem and slowly it builds up to a point where there's enough knowledge to solve a real world problem. So what you need to do as you read a paper is find out what are its limitations? What are the flaws? Maybe this technique, the claim it works satisfactorily, but maybe it doesn't in some cases. If you read what I consider a really excellent paper from Google on Big Table, has a whole bunch of wonderful ideas. At the end of the paper, they'll say, we don't support secondary indices because we don't think anybody needs them. But maybe a few people do. We don't support transactions because we don't think anybody needs them. We haven't seen enough demand for it at this point. But that's a clue right there that those are interesting areas to follow up on. And of course, Google itself followed up. They have subsequent work which looks at exactly those things which the first paper claim was not all that important. It was important and they did go ahead and solve it in subsequent papers. So when you read a paper, look at the beauty of it, but also look at the flaws. Don't just look at the beauty and see if you can pick holes and see what you can do to address those holes. Now, when you come up with ideas for addressing those things, sometimes it's a brilliant new algorithm which you can publish. But most of the time, that's not what happens. Most of the time, what happens is you come up with a few ideas which could be useful and you have to show that they actually work, in some cases at least. This is very important. If you read most papers, there are a few key ideas in the paper. And the rest of the paper, the paper may be 10, 12 pages, even more, 20 pages. They spend a lot of the space building up on these core ideas. So first of all, laying the groundwork, presenting the idea, seeing how to apply those ideas on specific cases. And then doing a performance study, which shows how those ideas compare with other ideas which had been proposed earlier. So this is a key thing in the systems area. Databases is very much a systems area. So most papers, you would have to implement something and show that what you're doing works better than what others have done earlier. So this is key. Now all of you are faculty, which is actually in a sense good because you get access to a resource that is students who can help you with some of these implementation issues. So if you get good students and get them excited about something, they will do a very nice job of implementing. In fact, some of the good students will also ask good questions, which will help you in your research to find out what can be done. So do, by all means, involve your students in the research. Give them credit where it's due. If they did work, make them co-authors or acknowledge them in the paper. And that will motivate them and the future batches of students to work on these areas. So by all means, use the wonderful resource which all of you have to build up on your own research. Having said what all you should do, I should also mention a few things you shouldn't do. So avoid these temptations. The first is plagiarism. Many people are tempted to take some ideas which they have read somewhere and then write them up as if they came up with these ideas. That's a very bad idea today because there are so many, I mean it is always a bad idea, it's immoral. But people could get away with it in an earlier era. These days there's no getting away with it. There are tools on the web which look for duplicates. And people do use those tools. And equally important, there are tools to find papers in an area. So I'm subscribed to Google Scholar, I'm subscribed to some keywords. So whenever Google Scholar finds some papers in an area which are relevant, it thinks are relevant to my work, it sends me links to those papers. And I do follow up and see what is there. And sometimes I say, hey, wait a minute. I have seen exactly this thing published earlier because it's an area I look at. I know what is the prior work. And somebody has written a new paper which claims it's new, but this is exactly what was done earlier. Sometimes I have to read it in detail to understand it, but more often than not, very initial readings, you skim the paper and you realize, look, this is identical to something I have seen. Or I think I have seen this, I will Google a few key things from this paper and lo and behold, the original paper pops up and I know that this is plagiarized. Now, today there is not sufficient punishment for plagiarism in India, but it will come. You can never hide from it. So recently in Germany, they had a minister for higher education or something. He was thought to be in line eventually to become whatever, prime minister or president in due course. But somebody found out that in his PhD, and this is interesting, in Germany, apparently most of the politicians have done PhDs including the current chancellor, Merkel, who has a PhD in biochemistry. She was also viewed as a good researcher. But this particular person had copied something and 20 years after his PhD, somebody found out that he had plagiarized it and he was forced to design. His career hopes for being prime minister or whatever were all dashed after that. So it will come back and bite you. If not today, 20 years from now, because everything is out there, it's indexed, it can be searched and it will come out. You don't need to file RTI queries. It's already indexed by various search engines. So it's not a good idea to plagiarize today because you will get caught. The other temptation which people have is to publish articles in journals which will publish anything, absolutely anything. They don't care what you have submitted. They will call it international journal on XYZ and what is in it for them? Well, you have to pay money and then your paper gets published in that journal. All they care about is getting money from you for publishing it. They will say open access and pay charges for open access, pay so many thousands of rupees, 10,000, 20,000 rupees. And you publish a paper, you put it in your resume. Maybe you get away with it. But then again, if you advance in your career, eventually somebody will find out that you have been publishing in journals which have very bad reputation for publishing anything at all without checking anything. And that will hurt your reputation eventually. If you want to do well as a researcher, avoid this. Because I have seen this happening to some good researchers. We had interviews where there was a candidate who obviously knew what he was talking about. He seemed quite intelligent. But if you look at the candidate's publications, several of them were in some of these fake journals. And that immediately lowered our estimate of the candidate. And then when we asked the candidate, why did you publish in such journals? The candidate said, well, my advisor told me. So here is an advisor who had actually spoiled the chances of the candidate to get into a good job. Because people recognized that some of the publications were in really bad places. So again, avoid this temptation. These things are known informally today. But sooner or later, there will be whitelists and blacklists. And some of these will show up in blacklists. And people will check against these. Okay, that's it from me. And now let's take questions. We have NPR College. Please go ahead. Sir, very happy to hear that topic related to research areas. And we have that one single question is, so what is that emerging areas in research in DBMS? So the question is, what do I think are some emerging areas in database research? So there are many interesting areas. I'll tell you about some of the areas which I'm looking at because I'm familiar with those areas. One of the areas is big data. So there has been a lot of work on big data. And what has happened in that area is it started off with like databases did initially it started off with file systems and programs to parallel processing systems based on Hadoop to work with data and file systems. And it's now evolved back to being relational databases have come back in this area query languages have come back. Very optimization has come back. But I think there are still interesting things to do on how to actually run queries in such a setting. And more important, given that you have a parallel processing infrastructure, how do you solve problems using this infrastructure, which maybe you would have done in a different way earlier on? So already in the context of data mining research, people are taking all the old data mining algorithms and can we do them to work efficiently on this Hadoop and similar other parallel processing infrastructure? Same kind of things could be asked of many other situations. How do you do maybe view maintenance? How do you partition a database and to make it scalable along with all the other goodies which you want in databases, recovery, concurrency control and so on. So it's a current area of interest and we are working specifically on some query optimization angle to this. A second area of interest is on unstructured or semi-structured data. So there's a lot of data in text files, web pages and so forth. But more recently, there's been a lot of work on RDF, which is basically data representation format, which you could, in some sense, you can think of it as a successor to XML. XML was tree structure, RDF is arbitrary graph structure. And it's typically thought of as, you can think of it as a graph, where there are nodes and there are edges. So, or you can think of it as a number of facts. Fact is something like subject, predicate object, something like a subject could be a person, which is somehow identified uniquely. The predicate could be name and the object could be the name of the person. The predicate could be institution where the person worked and the object could be object representing the institution where they worked. So think of it as entity relationship model. So you can think of the world as having a number of entities and relationships between them. So people have generated a lot of RDF data from many sources. Some of it from relational data sources. Some of it is taken by extracting information from text. There's a lot of data out there, huge amounts of data. There are billions of tuples of RDF data out there. And how do you run queries on them? How do you efficiently process queries? The issue here is that, if you think of it as in a relational form, where each predicate is a relation, then there are millions of relations. You can't use the normal relational database techniques. They don't scale to millions of relations, but it's not unstructured either. It is partially structured. So there's been interesting work in recent years on how to index and process queries on them. And I think there's a lot of future here. As queries grow more complex, how can you process them on such kinds of data? How do you process complex queries? So I think that's a rich source. Then just keyword queries on graph structure data. That's a project we have had for many years. There's been a lot of work which has happened in the last 10, 15 years in this area. But I think there is still a little bit of work left in that, when you deal with web scale processing. Then data mining is a very big area. There are many sub-areas. I won't talk about it, because that's not my research area. But there are many people, I'm sure, here who are probably working in data mining related areas. It's very closely related to databases. But I will not speak for that area. But I know that there's plenty of areas waiting to be explored there. Within databases, what are the other areas which are active? There are many areas. If you go to the proceedings of any recent conference, look at the session titles, you will find stuff that is going on. How to do transaction processing in a highly parallel way? How to do transaction processing efficiently across multiple machines, in particular with main memory resident data? And then there are some other areas which are kind of emerging. For example, testing. So people build database systems. How do you trust them? How do you test that they are actually working and doing what they are supposed to do? People have built database applications. How do you trust that they are working correct? What if they are giving you wrong results? And you make wrong business decisions based on wrong results. How do you know it's doing the right thing? How do you test these applications? How do you generate test data for testing applications? That's a rich area. That's also an area where we have done some initial work. I think I mentioned it earlier. How do you check if a student query is correct by generating multiple data sets? That is part of this broader area on generating test data and also approaches to testing applications and database systems. Are there are many more areas? I won't try to list them. Go read it up by browsing conferences. Does that answer your question? Thank you, sir. Another question, sir. So what are the different methodology adapted for transaction and real-time database, sir? It's about real-time databases. So real-time databases you can think of in different ways. One way is you want results reasonably quickly. That's what many people think real-time is. If you ask a question, you get it back interactively. The real-time database community or real-time systems community doesn't consider this real-time. What they consider real-time is when you can give guarantees about how quickly the results will come. And the motivating applications for these are often in embedded systems. You have a self-driving car. It had better make a decision of whether to go straight or turn right before it actually comes to the intersection in the road. And if it sees something ahead of it, it had better make a decision to stop before it's too late and it's going to crash into that or to slow down. So all of these are real-time decisions which have to be made within certain deadlines. So there's a lot of old work on real-time databases. In fact, the foremost expert in that area internationally is a professor called Krithi Ramamritham, who is my colleague here. So if you want to know more about work in that area, maybe you should get in touch with him. So again, what is current active research in real-time databases? I'm not very sure what is going on. I don't think Krithi himself is working actively in real-time databases per se. But he's doing a lot of work on real-time systems overall, including power grid monitoring and reaction to things which happen on power networks. So there's a lot of real-time components here. And there is a huge data angle also, because many of these things generate enormous volumes of data. And you may need to make decisions taking into account lots of data. But what are the research issues here? I'm not sure. You can get in touch with him. Jain, to you, Hyderabad. The first question on research, sir. Since I'm a beginner in the research area, so I would also like to start publishing some papers. But before publishing the papers, I would like to do some literature surveys. So for that, I want to find out the impact factor of a journal and find out how good is the paper or not. So I would like to know details about the impact factor of the journal or which journal is a good journal to choose my paper and what is the minimum value of an impact factor that I have to take into consideration. And also if you can suggest some good international journals wherein I can search for papers and publish my papers later on, that will be great, sir. And also something about the ISO certification of the journal. So that's the first question on research. I'll come back to you later regarding the teaching question part. So if you can answer the research question first. Oh, thank you. So this is a good question. What journal should be reading papers from and what journal should be targeting for publishing papers. So in the computer science community, traditionally, a lot of the focus has been on conferences, more than on journals. And that is still true today. If you want to find out the latest interesting stuff that's happening, your first bet is to look for leading conferences. What most people do is first they publish ideas in a conference and then they refine the ideas, dot the i's, cross the t's, extend the ideas a little bit and publish that in journals if they wish to. But then many ideas are never published in journals. They're only published in conferences. And that's the reason if you see the, when I said look at these top tier one, two, three conferences, I said conferences. I didn't say journals. The reason I didn't say journals is not because there aren't good journals, but because the hottest ideas tend to get published first in conferences and later appear after some time in journals. That said there are many journals. How do you pick journals? So there's been a lot of analysis on impact factors and so on. Some of these metrics which judge impact factors are kind of to some extent questionable metrics. So it's a perfectly good journal, the very large database journal. One year its impact factor was so high that it ranked among the top journals in computer science. The next year its impact factor went down sharply. It's a why, nothing has changed, excellent journal. What changed? It turns out that in a particular year there were two or three very highly cited papers. And those two or three papers boosted the impact factor of the journal tremendously. But the metrics for impact factor only take into account papers in the last few years. Once that window passed, those few very highly cited papers vanished from the window. The impact factor of the journal came down sharply. That doesn't mean it's any worse or better a journal. So I would to some extent question how this impact factor is defined that leads to problems. That said of course the impact factor does mean something if not the exact value. So one way to do it is to see how people have ranked this. So there have been a few efforts in this direction. In Australia there was an effort to rank both conferences and journals into different categories. They had A or A plus, BC and so forth. These are based on people's perceptions of these, not just on raw numbers like impact factor. Of course there is a bias in how people do this. There are some sites which should be included or rank lower than they should be and vice versa. But if you're looking for places which tend to publish good papers, it's worth going at this. It's called the Australian core ranking, C-O-R-E, core ranking, Australian core ranking. So if you search that on the net, you will find their site and then you can go in. The thing is that they rank all areas of computer science, not just databases. So if you want databases, I have put up a few and I will put up some more in the journal side of things. But then you can go to the core ranking and dig deeper. Similarly for other sub areas, data mining, they actually have ranked it by area also. So it's a good source to find what are the top and the next tier conferences and journals. Now if you want to publish in these things, you have to, first of all, don't worry about where to publish. Your first step is to really understand an area and absorb the techniques which people have proposed. At the same time, you keep questioning what they have done and see how you can improve it. That should be your first goal. Once you have some ideas on how to do something different or better, you start thinking about it. Write down the ideas tentatively. Then you start thinking about how to show that your ideas work. What kind of a system to put them into? How do you, what kind of database or query or other benchmark you should use to compare your ideas with others? And so you start putting together this system, build it, compare with others and write this up, you have a paper. Around this point, you should be worrying about where to send that paper. So that's not an easy question to answer. If you feel that what you have is a really new idea, there's something which you are really excited about, which you think others will also be excited about, then target a top notch conference. If you think what you have done is interesting, it is something which people may be interested in, but you're not sure if it's a really top notch idea. Sometimes it's okay to send it to a conference and see what feedback people have. People may say this idea is trivial. People may say it's not bad, but it's not good enough for this conference or to your surprise people may say, hey, this is a good idea, accept it. Any of these could happen. It's okay to send papers like this to get feedback. You shouldn't overdo it, but it's okay to sometimes send a paper to a good conference. But then you should have a feel for where it has a chance and target conferences or journals which are at the appropriate level. You don't want to sink to the level of journals which will accept any papers. There's no point publishing in such journals, except if you desperately need a PhD and somebody is ticking the number of publications you have without caring for the venue and you need your PhD desperately, fine. But that's not about research. That's about getting a degree. If you want to do research and have your work recognized by others, try to get it into as good a place as it can get into. That's how others will notice your work. And when others notice your work, citations come in and then there are various metrics which can be used to judge the quality of your work. Did that answer your question? And we have a couple of other questions, sir. Okay. Among them, the first question would be from a teaching point of view, sir. From the feedback which I got from the students, sir, they feel that the topic of relational algebra, tuple-relational calculus, domain-relational calculus, it's kind of boring or most of the students tend not to listen to that particular topic. So how do I make it interesting and teach them the topics and how do I differentiate between the tuple-relational calculus and domain-relational calculus? You could throw some light on that, it will be great. Okay. If you have control on the syllabus, like we do, we are lucky enough to do at IIT Bombay, I don't even cover it in any detail. I just spend one lecture introducing them to it and that's it, in terms of the calculus. Okay, so I have that flexibility and I do realize that people get bored, you know, why should we do this? What is so interesting about it? There will be a few very theoretically inclined students who actually find it fascinating, but for most people, yes, what you're saying is true. The relational calculus is, sorry, relational algebra is different and here it's actually very important for the underlying implementation and if you link up relational algebra to SQL, it actually helps people understand what an SQL query does. So that linkage will help them get motivated. So again, the way we teach it now is to first introduce people to the concepts of, you know, the various operations, join, just the way I taught it here, right? I just very quickly covered all the basic relational algebra operators, exposing you to it. I did not make you write complex queries in relational algebra, that's optional because that's not how people do things, but it is still a good exercise to help people understand how to write queries. In fact, I do make my students, I didn't do it here, but in my course at IIT Bombay, I do make students write some non-trivial queries in relational algebra to help them get a feel for it and they find this interesting. It's not that they find that boring and later on the connections with the implementation of the query processing system will become clear and that also helps keep them motivated. Yes sir, one more question regarding teaching. Why don't we cover the development of database management system software? If not, I understand it's very complex task, but even if we could do some part of it and maybe by some simulators or something, then it would be advantages and is there any research scope in that and if we have to do it, how would we start doing? Teaching students how to develop a new DBMS. Oh, that's a good question. There are several possible responses to it. The way this was done in Wisconsin where I did my PhD was to have a course where they would provide an API for various layers of a database system and people had to build this from bottom up. They had to build the code to implement the lowest level API. Once that was working and they would provide test programs to test that level. After you clear those test programs, you build the next layer up, like you start from the storage manager, then you build indexing, then you build a very simple SQL engine and then I don't remember if they did concurrency control. Maybe they stopped at the SQL engine. So they built a few layers of a real database system. It was a toy, nothing in it. You can't actually do everything in a real system in one single course, but that was a very database implementation intensive course. It was a course by itself. It does take effort to do it. When we started teaching it here in IIT, Bombay, there's another colleague of mine who also did his PhD with me in Wisconsin. He had come, he had joined IIT, Bombay earlier. He left later on. So he did a similar thing here. But after a few years, we found that asking students to build the same thing over and over again, it was very difficult since the same system, same API, people would just get code from their seniors and submit it as their own. They wouldn't take it seriously. They were not learning because they were taking shortcuts. It was difficult to check this. So we took a slightly different approach. We have an internal scores, a database internal scores where we no longer use a toy system. We use PostgreSQL. So we actually have students do project which go and modify parts of PostgreSQL to add some functionality or change how it does something. And that is actually very nice because this is a real system. They are exposed to an actual real system and the intricacies of it. There is a flip side that it is a complex system. It takes time to understand. And if you have a single course which covers starting from SQL and so on, all the way into internals, there's no time for all of this. So I don't get into that in my first course. I only cover it in a separate internals course. And PostgreSQL is a wonderful resource that you can essentially dig into it and modify it to do, say, query processing, indexing, storage. There are so many things you can go munger on with and actually implement and show it working. We have a list of projects which we suggest to our students. I'll be happy to share that if anybody wants to run such a course. This is Vaishnav Institute. Sir, my question is regarding the quality of journals. Is there any certifying authority that you can measure that whether the journal is fake or not? If it exists, then what procedure it follows? Okay. That's a good question. Unfortunately, there is no such certifying authority that I am aware of. But the closest is the Australian core rankings which say that these are good places. What we don't have is something which say these are bad places. This is something which I and some colleagues have been tossing around. Should we get into this exercise of listing journals which are bad and urge people not to submit papers there? That is something we have been thinking about. If we get the time to do it, we may do it at some point. But if you find a place listed in the core rankings, that's probably a good place. So at least you have a positive list and we don't have a negative list as of now. If we have, Kanmuga, I think. Yeah. Is it possible to store data-based graphs? Can you give us ideas, sir? Data-based graphs, graph data. Is that your question? Yes, sir, yes. Yeah. This is actually a very hot topic currently. There are extremely large graphs available today. If you look at Facebook data, there are, you know, I think a billion people on Facebook and each of them has lots of friends. So it's a graph with billion nodes and the number of edges is every, you know, friend relationship with an edge. So the Facebook has an enormous graph. That's just one example. There are many other examples of graph data which are really large, very much big data realm. So now there's been a lot of work on how do you process queries on such graph data. There have been papers from Google system called Prigal and then there's another one which has come out and then there's some open-source systems all trying to figure out how to deal with such big graph. Then there have been work on how to partition very large graphs to break them up. So it's a good area which is, there's a lot of interest in it currently because the big data techniques are available now and also very big graphs are now real and how to do interesting stuff on such large graphs is an active area. Now I'll take a couple of topics from chat. Somebody asked up to what extent plagiarism is accepted and I would say zero. What do I mean by zero plagiarism? Now when you write a paper, there are definitions which you have read and you want to use the same definition in your work. What you need to do is you can use the definition but you have to say that we use the following definition from this paper. So what you're doing is you're not copying it and not acknowledging it. You're setting up the stage for your work and saying that this is past work, this is how people have defined it. You may even describe prior algorithm and say this is how that algorithm works. In all such cases you're citing it. You're saying that this is the paper which described the idea and this is how they did it. And when you're describing definitions and maybe algorithm, sometimes you have to take it as is but anything else you should put it in your own words. It should not be taken verbatim. You can't just take paragraphs from somebody else's work. Even sentences from other people's work and put them into your paper without citation. So the only things which you can copy as is things like definitions and algorithms which you cannot change. And they should be properly cited, saying that we took it from there. Then you should make clear what you have done new. If you follow this, then it's not plagiarism. But if you just take paragraphs from some paper and it is not clear that it is work which you have taken from somewhere else, then that is plagiarism. And there is essentially zero tolerance for that. Then the next one is impact factor confuse a lot. What is the highest impact factor? You know, how do you choose this? It's a complete mess. Impact factor is like I said a very questionable metric in some sense. If you take raw numbers, as I said impact factors go up and down. Yes, the best journals do have the highest impact factor and the worst ones may not have very good impact factors at all. But in between, it's very difficult to take impact factors, a raw number and what to do with it. Like I said, you are better off looking at these rankings like the Australian core ranking to decide where something fits. Is it tier one, tier two, tier three? These are all considered acceptable. Tier three is still considered good. And then there is maybe tier four which is considered not so good. Workshops which still do publish something interesting. And after that, it's unranked, meaning they don't consider, either they don't know about it, that can be. There are perfectly good things they don't know about, but also it may be something which is not considered really good. There's also a whole other class of workshops which are a good way to publish initial results. So typically what happens is every conference has a number of workshops associated with it. The workshops themselves may not have a very high ranking. They may not even be ranked in these rankings. But they are a good place to publish work initially, get feedback, talk with others working in the area. And later on, you can extend the work and publish it in a higher rank conference. So you should consider publishing in workshops which are part of larger higher rank conferences. Another question is, is it okay if we publish literature reviews in leading journals? This is a harder question. Some journals do accept surveys in an area. And they generally do it in an area which maybe is considered new and they want people to know more about it. But it can easily degenerate, because if you take every MTech student here, has to do a seminar which is basically a literature survey of some area. Now we could take every single MTech seminar report and publish it as a literature survey of some sub-area. Is it worth publishing? That is questionable. It has to be large enough and should have some value to be considered a publication worthy of being counted against your needs. You could put it up on your website and make it available to others. That's a good way. So our seminar students' reports are used by other students who want to learn more about the area. So they are used. But I wouldn't consider them as a true publication. Okay. I'll take the last few live questions. Rajam Bapu Institute. Sir, a query regarding development of application, a database that contains around 25 lakhs of faculty information. So we want to develop an application that contains all faculties of India at one place. So could you suggest me a design on this database where we should go for a split kind of thing which contains several tables or should we go for a single table that has several attributes and so on? So we have covered ER modeling. You should apply it to your domain. You should first of all decide what all you want to record. Then you should do an ER modeling of that and that will lead you to whatever is the correct solution. It's all a function of what all you want to record and then what all you want to do with that data. What's the use of compiling a list of 25 lakh faculty? What is the use? Who is going to give you the data? What is going to happen to that data afterwards? There should be something interesting which one can do with the data you have collected. So you should be thinking about that even before you start any data collection exercise. And then that will lead you naturally to the schema which you should be using. Sarvajani College, Surat. Can you throw some light onto research possibilities in the area of database testing and database refactoring? Database testing and refactoring. So refactoring I don't have much to say. Testing definitely. So there are many sub areas in database testing. One kind of testing is how does this perform under load? So some of it is not necessarily research but some things in this are in the domain of research. So how do you generate data sets of sufficient size which satisfies some properties? And the problem here is that sometimes there are data sets of large size which are available like say banking data and so on but nobody is going to give you that data. It's considered secret. There's no way they will give it to you. But you might be able to get some information about the data. What are the sizes? What are some properties? How are the distributions of transactions? And so forth. And then the goal is to generate data sets which respect those properties in terms of load distribution, size distribution and so forth. So how do you do that? How do you generate such data sets? How do you generate really large data sets at big data scale in parallel? So there's been a lot of interesting research in recent years on this topic. So that's one side. Generating data sets which have some properties. Another area which I had already mentioned is how do you generate data sets that can help you check the correctness of queries? So there's been, we have done some initial research in this area and there have been a few others but there's a lot more to be done here. It's still in its infancy, in particular complex queries, applications and so forth. So there is definitely work to be done in this area. Then there's testing of database systems themselves. How do you know the database implementation is actually working correctly? Another kind of testing is with respect to concurrency. This is actually a big headache because it's very hard to debug a system and say that it will work fine when things are done concurrently. Things can and will fail if you don't code your system properly. So how do you do testing of such systems where concurrent accesses can cause trouble? So can you build tools which will help you stress your application and expose concurrency problems? That would be a very interesting thing. We have done some initial work here on that by analyzing the database queries and transactions that are generated by an application but that was very preliminary work and there's a lot more work to be done here. These are problems that you actually see, things going wrong. Can you build a tool which can test an application to see if it is vulnerable to such race conditions? That's an interesting idea. People have also been looking at it from a different angle. Can you take an application and prove that it is safe? So there's been some work from Microsoft Research in Bangalore in this context. So if you can't prove it is safe, maybe you should change something which will then allow the system to show that it is safe. Or sometimes they say that let's declaratively specify what we want and the system will generate the required semaphores or whatever concurrency control mechanism which will ensure safety. So there are many different approaches to this. Databases have inspired some of these things because the notion of transaction is very powerful. So people have been trying to use this notion of transaction to build other parallel system which are not database systems per se. So there's a lot of ongoing research in all these areas. So testing of these systems is pretty important. Any follow up question? Yeah, sir, can you throw some light on to research areas in data warehouse schema designing? Research in data warehouse schema design. I don't know that's an active area of research currently. There was a lot of work on warehousing a while ago. I'm not aware of what is current work in this area. Now big data has certainly made a lot of changes to how we are doing things. The sources of data are different today from what it was a while ago. The volume of data is different. The systems which are used to manage big data are different from the old system. So how do you do like extract, transform, loading in parallel with very large data sources? So there are interesting issues which are going to reappear in the big data context. Whatever work was done earlier is going to be revisited and looked at again in the big data context. So that will be work, but specifically schema design I don't know. Thank you.