 But that is nothing great about the projects, projects are still in development phase, you may have to really work hard on this to reach a level where in we can say that there is a real project, otherwise it would not be a project anymore. So these are the five projects and so data analytics for educational data. So that is the first project, you know three, see previously we had data analytics you know it is not that data analytics had just started because of big data, it is not correct. The reason for big data you know analytics, the reasons are different, the reasons are because of these three things that is volume, there is a volume. Transactions is not one-one transaction, you know this one query, no not that type of transaction, there is a volume in that, terabytes, records and then you know tables and unstructured data, you know so variety, variety will tell you unstructured data, structured data, semi-structured, so different types of data available, even Wikipedia unstructured or structured, semi-structured, you know data could be from anywhere not only from a system, data also could be from cloud or from Google APIs or anywhere. So you have to find where is the data, how do you make that connection, so how does Facebook work, so Facebook has got partners where it collects data, then you have your social media running here and then somehow it is related and it says that you also have an account here somewhere which you never disclose to Facebook, am I right or you tell Google or Google tells you that you have an account here also, you say how do you that is not internal data, it is data from everywhere, metadata, velocity, batch, running in batch, you know not just small, small, near time, batch analytics is something like you have huge records and you need to compute something from those records, so that is the general notion of batch analytics which happened some 20 years back where the data was mostly taken in a historical form which was kept, you know data warehousing systems where data was kept in the warehouse, then you do an ETL and then you say I have this, now it is going to take another 2 days only to run one query and then I get my results, so that is the batch thing, but now this batch is becoming different, you know you are not going to work on huge data, you may want to work on a substantial subset of the data, maybe if you have exabyte of data, actual data with you, you may want to work on 1 terabyte meaningful data and that is batch, so this is my batch or batch mode wherein you actually push in the data at one time this much and you know because you know some schemas already defined, some schemas are evolving, you know many things are happening, so unstructured data gets converted to structured form sometimes, you know you define such schemas and then your data you see it is coming from somewhere, streaming and flowing onto your system and now you want to do some analytics on it, so people may want real time, real time is something like, you cannot wait stock market analysis, real time, you get data, how do you get it, possible. So, in memory data management systems built by Hasso-Platner Institute just in Germany, so there is a book also by the way they have written a book and they talk about analytics being done in the memory rather than on the disk, so you do not read from the disk all the time, so you keep everything in memory, the data is compressed or encoded in some form that you know it can fit in the memory, now memory is not expensive, you can buy 128 gig or even 1 terabyte of memory is possible nowadays, so you put everything in memory and you do not put anything on the disk and that is how it is possible sometimes to get real time analytics, even streams, so these are the things, there are possibilities, you have a mix of something then you are here, but if you do not have more memory you are there, so you are going back to 20 years, you have more memory you are going that side, it is like that, what I am saying is this is an ETL phase here, ETL is extract, transform and load, so you extract the data, so I will go through it, so what is an E phase, so first you need to acquire the data, so where is the data, this is the data, this, this, this, this, so what do I do with the data, so I need to extract it somehow, I mean extract something, some based on certain schema that I need, I may not want to extract everything, I may need to clean the data, cleaning is something like where you have null values, what are you going to do, so or some other condition appears and you are feeling that your final result will get affected because if you do not clean it, then there is a problem, so you need to clean the data, you need to annotate sometimes, so some schemas are missing or you want to populate something in the data, you are allowed to do that, so that is an extraction phase and you know exactly what you have done because you have recorded it that this and this is what I have done on that data and when I actually do the final interpretation or analysis phase, that time I can remove those things which I have, I created something in the data which is not, it was required to be there and now I do not need it, something like that, so then I have a transformation phase, I need to transform it in different form because I have structured, here I have unstructured, so how do I, if I want everything structured, I need to do something, so that transformation, aggregation if you want, I mean you may want to integrate, so that is integration because you would like to integrate one set of structured and one set of semi-structured information or another set of say unstructured information, I am going to put it in one place and that is exactly what you want, so that is what gets loaded finally, so that is a load phase, in the load phase that is the data, you have a whole metadata of your system and that which is loaded and now you know what to do because you are going to fire queries, aggregation, representation, so these are all the terms, jargon, so do not worry, you know this is not, in the simplest sense you just extract something, you transform something for your own use, wherever comma is not there, you put comma like that and then you load somewhere, that is it, in the simplest way, if you go little bit more and more complex, if you have actually seen data, how it looks like, somewhere comma is not there, what you will do, you have to do something no, you understand, so then you do some analysis and as I said you have already written those queries and these are the queries which are important for you and then you need to analyze after it has been loaded and you predict something, I mean you are predicted well earlier that this is what could have happened for the data and you see and you match and you say that my analysis is correct, somewhere I had predicted or I had hypothesized and now I feel that it is correct. So, that is analysis and modeling, I mean modeling is generally used where you would want to, so you mean a build a model is more about understanding the behavior of the system actually, so the behavior of the data, so if you say that something is being affected by some other data, some other sources available and from that you derive analysis, I mean the conclusion saying that this is dependent on that, something is dependent, something is independent, that is a model you build finally. The model is, you can say mathematical model or something, it is a model, model is simple equation, y is equal to sin of x, sin of x is a model, simple. So, there may be many other things, statistics involved to understand that this is my model. You do some interpretation, I will not go on to interpretation because you can interpret it very well. So, tools for big data analytics, one month back I felt Hadoop was the right thing to do, but today I am feeling Hadoop is not good enough actually. So, Hadoop is widely used and in fact, Hadoop is mostly a disk based system, so every time it does something it writes something into the disk which is not good, disk is slower. How do you get real time analytics or something like that, I mean difficult to, if you think if your stock market was running on Hadoop, I do not think would have been very difficult. It is okay with Facebook because they can, you can say that, oh I have not received this, you sent a message on my Facebook account, how do I know I should be receiving or not, how do I know, you do not care, do you care, somebody sends you a message on Facebook, you get it only when you get it, you see it only when you get it. So, it is not very important in that, so you will always get it somewhere, though they have some, it is not correct even to say that Facebook actually ran on Hadoop and they did some re-engineering underneath and they have pretty good system now. The huge clusters are, I mean almost 5000 and above machines do exist and Facebook does collect almost 5 petabytes of data every day, every day 5 petabytes of data whatever you are doing. You know why people do that, you may ask me this question because it is a norm that any data which is available or which is being collected has to be kept somewhere archived. Many, many companies did that before, those who operated on huge data collection having who had their client base, huge database. So, they did that and some of the CEOs or CEOs or some of those high officials, they were always asked, what do you do with the data? They say we do not know, it may be useful after 10 years. So, it is like that, so no data is to be thrown out, remember that. So, that is the, this has been followed for last has been put in the system, data is not thrown out, data is always archived somewhere. So, do not think that old data can be thrown out, it is not correct. So, they are historical data, they are important like your history. So, you have a distributed file system there, the file system with little bit enhanced features wherein you have why it is called distributed because the nodes, I mean sorry the objects are replicated across. So, much of the storage space allocated to those objects, where the data exists, they have been replicated to many places and how do we supposed to run on commodity machines, remember. So, do not ask for a supercomputer if you want to do a project. So, small, small machines, 5, 10 machines you take together and you should be able to demonstrate. So, and then the success actually came from map reduce thing, it is just a map thing and the reduce. I will not go into the details of map reduce, it is more about mapping something, you know key value pairs are there, they are just mapped on to one. It is like saying I have so many things to do here, I take a bunch here, I tell him please do sorting, please do sorting, please do sorting, please do sorting. Now, if I have 1, 2, 3, 4, 5 people who can sort, I may need 1 or 2 people, not more than that to merge. So, it is map and then reduce. In reduce phase, you do the merging and all those things. So, it depends on what you want, you know every time you will not do merge and every time you will not do search. It depends on what you exactly want and give an example. So, you have high, high is high level way of you do not have to write map reduce programs if you have high and high is very much like SQL, but not exactly like SQL. Remember high is supposed to run much more than what SQL can do, but not in conventional sense where SQL can do more than high. So, some much of the syntax of high is very similar to SQL 92. So, that is the standard in which it has been compliant. So, in that these are the queries which are supposed to run. And there are other queries like suppose I have JSON objects or I have some other objects, I can actually query those objects rather than writing a flat table like structure. So, I do not need to create a structure, flat table structure. I have an JSON object, which is a hierarchical object, hierarchy. There is a hierarchy. If you have seen XML files, the hierarchical form. So, some of them do not exist somewhere it exists. So, high is high do have such queries which can interpret such objects like JSON and then it can extract information based on the query, the way you write and you can create views and whatever. So, that is mostly like no SQL as well as SQL. So, to exist both of them and I have seen both of them do equally run well, not bad. So, then your Mahut actually I think Pushpa Gurangai in the morning he actually talked about high and Hadoop somewhere and then Mahut I think he missed out Mahut actually. These are all Apache projects. The Mahut is one of the Apaches project. And Mahut is more about machine learning models do exist in Mahut so that you do not have to write again, keep on write. So, they are already available. So, you just use them. So, you want to do something with the data, you do not want to do predictive analysis, you want to do clustering, you want to do many things which are available in data mining is available in Mahut. So, it is a library like the other library is rapid miner and the other one VECA which is being also used. So, this is like that. So, it is a library for doing some data mining type of job. I do not know why I fell in love with NIME because I found it interesting and there is nothing great about NIME. It is more of a that is what I feel I am never installed NIME, but I feel it is much could be useful. It is a front end system. It is a front end interface for your data analytics, for your data analytics. So, you do not have to write that system. So, that system is available. So, then so I want to see how does it look like. So, I have not seen till date I have seen only high when Mahut at the most. So, NIME is something if I want to see I want is it difficult to use is it is it flexible is it I mean or is it going to work across different platforms. That is also important because today is Hadoop tomorrow it may be something else. We do not know. So, presently it is Hadoop. So, do not worry. And then you have SARAS. SARAS is without writing code you can create some big data workflow and such workflows can be mapped on to the underlying query system. So, once you couple them you can shuffle them here and there. But this is still in the development phase remember the there is only one case study which is done on SARAS. So, if you are going to use it be careful because there is only one case study and you may have to work more on understanding what is what is a workflow. So, first you have to know what is a workflow. So, we will come to that later not now today. So, there are interesting questions. So, how many students have never viewed learning resource A? So, you may say okay question. The first thing is not to ask like though you can do something magical saying that I want to ask a question which cannot be answered by your data. Don't ask a question much closer to the data and then you build other ones based on this. Don't say that I want to know how many students would ever take up a job in TTS? That will be useless to answer because you do not have information. Get closer to the system and say that this is the information that I want. So, very quite simple you know it is not very difficult in the beginning. So, you have to run some queries you will write you do not write you will not get anything. So, it is important. So, how many students have never viewed learning resource A? Good, very simple. If students do well on activity B, do they also do well on activity C? That will give you some information that somewhere there is I mean that is up to the analytic team to understand why such a thing is important. Generally, programmers are not allowed to ask even these questions. Programmers are just given. I want this information. Can you give me? So, you are actually playing a dual role. So, anyway I mean we can sort out that matter later. If students all access E, I will just go faster. What is the average mark on quiz G got by students who have viewed resource F and like this question. So, I am not saying what I have written is interesting. I am saying that questions are interesting and you need to write more and more like this. Recommendation system. So, you have content based recommendation systems like based on preferences. This is my preference and I say Google recommends me some something based on my preference. That is the first one. The second one is collaborative filtering. That means somebody has bought a book on Flipkart and based on based on his preference or based on the sequence that he bought a book, he bought that book and then he bought something else. Now, face Flipkart is also asking me to buy something. So, based on other people's preferences. So, and there are hybrid approaches. So, these are not very difficult. I mean not very useful. Just you have to know. Then adaptive learning system. So, why I am coming to adaptive learning? Recommendation system is one, but it is not complete because it does not have the knowledge about what happens between two different concepts, domain concepts. So, that is not enough and if you need to represent this with the previous just recommending. So, based on it will be like a graph and in graph you may not be able to handle so much information. You would need a structure which will map to a graph. So, that structure I am not going to define it here. Imagine the student requires a highly personalized tutoring system that after these he gets some marks and then he gets a particular activity and then after doing this activity he gets something else, but not everybody get the same thing. Everybody's flow is different. So, that is all is this all about. So, it is highly adaptive navigation. So, you can go to any direction you want to say I want to learn physics, I want to learn mathematics, I want to learn something else, but the path will be chosen by the system and given to you that this is the path. Now, how do you choose such a path? There is a knowledge dependences between domain objects. So, if you know something then it can be said that you know something about this based on this. Now, that needs to be codified in the system and there is a model which allows you to code such a thing in your system. That is fuzzy cognitive map. So, I am not going to get into the you know into that fuzzy cognitive mapping and all those things. So, that we will see later. There are many, many papers which you can read. Now, question needs to be addressed. If a student learn the concept A, what is her his knowledge level on the dependent domain object B? So, something A is dependent on B and B is dependent you know some you know something is there is a dependency relation here. So, if something so, if one knowledge is increased in one direction, do you see an increase of knowledge in the other direction in the in the other domain concept. So, like for loop and while loop. So, you can consider for loop if you know for loop maybe you know while loop or you know while loop maybe you know for loop. So, that is for you to decide. And student knowledge of concept A and B and C improves you know you can also have multiple dependencies A, B and C then some something is dependent on something you can enforce you can look into this type of questions you know and does your system is your system going to answer such question that is more important. So, your requirement phase will tell you like you know this is what we want and there will be SRS and all those things that they will be doing. If the student has misconceptions on a domain concept A, how is her his knowledge level of the dependent concept B, C and D affect it. Can you say that there is a misconception somewhere else? So, it is almost looking very similar actually, but there is more to it why how you can do it and that almost I have listed around 30 papers how many 30, 40 I do not remember you can just go through it third project. So, load testing and benchmarking for big data. So, earlier times we had you know big data is not about recent phenomena it has happened and now it is happening more because as I said there are four three V's which are important and you start from something like an OLTP online transaction processing where you know it is like a banking system where people are you know debiting, crediting, debiting, crediting you know all these things happening that is OLTP mostly. So, generally in OLTP we had benchmarks called TPCC ad hoc querying. So, ad hoc is you know what is the meaning of ad hoc if you do not know then you can look into the dictionary. So, TPCC is actually transaction processing councils. So, which happened long back where industry and academia got together and then they built a system or some sort of a benchmark for systems you know they want to say that this machine is this much having performance or price per performance or you know they have some metrics. So, this machine is this much powerful that is it. How do you do that? You just do not load that machine with huge files and say that oh it can take all of my files what happens with the processing. So, the benchmark is having mixed workload sometimes. So, OLTPC generally was focused on OLTP type of transactions where only like you know customers firing, request for debit, credit you know transfer you know all the time it is like an ad hoc you had OLTP OLTP online analytical processing where the historical data was I mean put in warehouse and then the analytics are done later and whether to know such a system could do analytics was important. So, the benchmark was introduced. So, you know so you understand flexible latency analytics now that these are the phases that are happening these these and this coming you know we should be focusing mostly somewhere here, but we cannot leave these people these two people here because they also have some history to tell us and then some of the things are important in them and the TPCCH are important and they need to be carried forward for understanding this type of things. So, here your flexible latency analytics low latency desirable not essential. So, you may need a system with low latency, but it is not say that full proof to say that is 0 latency or point something it can vary from this point to this point for a full load the full data set that you have in the system. So, you know if you want to expand you know if you want to do capacity planning of your system you know how much more to put right how much hardware to put how much memory to put probably. Then interactive analytics this is generally of daily cycle you know what happens there are some patterns you know like you know in on Facebook somebody is doing this somebody is doing something else these are the patterns these are daily patterns or stock market daily patterns that type of workload that the queries are such that this is the workflow pattern which is which is which falls in this category. So, that is called interactive analytics there is nothing great about these words some sort of pattern they have low computational latency also and it, but it is slightly broader than OLAP that you may come to know later semi-streaming analytics recent data rather than historical data. So, you do not have you do not consider the old data you just consider recent data that you have collected for last 5 days or 10 days probably recent data. Then you want to do some schema discovery also what is the schema for that data. For example, click rate statistics for a product. So, for a particular product you will be interested in knowing how far is my product clicking right is it going well in the market. So, these are the things that are handled these type of patterns are different. So, all are very different by the way. So, this you will understand only when you go into while looking at the workload whether there is a read workload write workload what are the combinations possible patterns. Your data benchmark has to be representative that means it should be possible to replicate I mean to demonstrate with a real life example. So, that is more important it has to be represent table then it should be relevant it should not be about fiction science fiction things right it should be relevant to a particular domain probably. Four table of course because Hadoop goes something else it should still be there scalable it should scale if I have if I have to if I actually want to do a have a benchmark of a cluster size a number of clusters say cluster size of 50,000 clusters. And if you ask me today I will not be able to give 50,000 clusters, but your benchmark has to be such that if I put it on 5 clusters and then I do that multiplication and something else you should be able to assure me that this after multiplying it is going to run on these many machines. So, you understand you understand. So, that is called scalability. So, there should not be so if it is running on 5 it has to run on 50,000 whatever is it right. So, such a data exists then it has to run on such machines. Multiplexing because map reduce workload you know it is not a mostly high queries that are running or not just structured queries are running it is also about programs written by by in some some tools which have been written in a map reduce format and those also need to be explored, but maybe we will not go into it, but you know mostly a query like framework will be more interested, but actual benchmark full blown benchmark should have all of these things very important processing generation. So, you need to generate what you want to do you want to only read or you want to only write what do you want to do I mean or there is there a pattern that you want to follow. So, what patterns and scaled down workloads I just now explained what is a scaled down workload. So, for 50,000 cluster can you scale it down to this much and you still say that it is exactly going to perform on that scale empirical models for workload traces from workload traces that is you are building models from those traces that means your workload generation should be able to take care of generating those type of workloads which if you find in real life that such a workload exists somewhere then my workload should be able to generate such a thing. So, the processing generation should understand those type of workloads also. So, those can be broken down into small steps and then you can generate those. So, here edx is the system edx platform that general people say edx or edx and then you have model which has been used extensively. Now, there are there is a lot of data on model can we push them on edx platform if there is a lot of data on edx can we push on model in some cases it may be useful to do something on this and something on edx platform. So, is it possible that if you have a scenario where you have a centralized system and then distributed nodes which are located far away as a remote nodes and if you want to synchronize some data and in the centralized system you have just edx installation and in the remote installation you have say model because model is very common. So, in many many colleges you may have model installations coming to IIT Bombay like the 10 I mean the 10 T 10 KT program which is running. So, you must have heard about 10 thousand teachers training. So, can we have edx as a centralized system and then model distributed across these are the questions need to answer. You might have heard about maybe not, but it is just an IDE which gives you capability to run say C programs or C++ or Python or any of this program. So, for a beginner level say first year you know they are coming and they do not know how what is this and they seek Eclipse or net means they get scared. So, something which is very simple is good for them. So, now that that also looks very difficult. So, now we have thought of simplifying it written in C++ and whatever whatever. Features create project is that it is there on Ubuntu also it is there on Windows and we needed something cross platform not just running on Windows. So, Windows Ubuntu because there is a problem in other colleges where they only have Windows machine. Now here you have Linux machine. So, that was the important thing decision point why this one and why not something else that was chosen. So, here what we need is very simple it may be see you it may look very simple in the beginning, but I do not think it is that simple to do, but it is simple paradoxically. So, what is important here is just create so there is a problem it is not a big problem, but I will tell you there is a problem wherein every time when you say new it creates a project and in that project you have to build and all those things. So, when you are a kid like first year kid coming for programming. So, they get scared I mean what the hell is this build build build every time. So, you like turbo C plus plus you might have used right there is no build just write program run execute compile something something comes. So, like that. So, not only that there may be other features extension now you have to find out what could be those extensions. So, it has also been integrated with simple CPP the package developed by Aviram Ranade IIT Bombay here. So, there are one M Tech student working on it and so there are two or three project staff probably yeah. So, you may get help from them. And here it is more about this is a turtle actually this is a turtle the red thing is a turtle turtle just moves and creates all these things. So, that I think the turtle I mean code is very simple move forward 30 degrees 40 degrees 60 degrees like that. So, code is not very difficult there is they do not have to write int main and all those things they do not have to even read any value they just say forward 90 backward something and left 90 something something and then is that simple as that that is a code. So, first year no I mean you have to be very cautious give them more dose they will slip. So, this was introduced and then it moves slowly to C C plus plus is a different way of teaching. So, in that context we need you know for such execution this has already been integrated, but more work is had to be done. We have to understand how is there are some problem bugs here and there you know they remain always. So, you have it has to run means run. So, simple one click installation is more important in this project. So, the current is not friendly for a novice person. So, all these things you can read maybe later source code available everything is available and then this is our link everything is over. So, this is our key where we have lots of projects which are been stated and you know. So, so many things are there. So, this has been written by human people humans is not generated by a computer. So, some of them may be useful some of them may be useful useless for you. So, do not think that everything will be useful and then there is another one which is your summer internship that is the page. So, we will be creating lots of maybe we will give you give you an account on this page and then you will be writing like this something good presentable. So, these are the references they are very important. Thank you so much.