 So, we are a team of IoT Bombay search engine, myself Mayankpal, he is Harsh Mohan, Bipu Dutta Panik Rehi, Mohanlal Kalvanya and Gyanesha. So first of all, before moving into each and every component of search engine, I would like to present an overview of what exactly our problem statement is, what this search engine is basically doing and how different components are interacting with each other. So like everyone here, like Maximum of you all might have experienced this, like IoT Bombay X platform already, it is based on Open edX, so it already has an inbuilt search feature, like a search engine is there where you can search anything and you will get the result. But then what is the motivation behind building the search engine upon that? So the platform of IoT Bombay X that provides the search engine that is not efficient and it doesn't search everything that we put into it. So like basically we have two kinds of database, one is MongoDB and one is MySQL, currently in this IoT Bombay X server. So that initial IoT Bombay X search engine just searches from that MongoDB database and what we have tried to build and partially succeeded into it is that it crawls or searches into MySQL as well as MongoDB, both databases. So this is the basic workflow or the component interaction into our project. So as the previous group explained about different instances of the IoT Bombay X, so that is same like those three or more are the IoT Bombay X instances. And then in that dotted square is what we have built. So first of all, like what will be input to the search engine and what will be output out of it. So any user who is going to use that search engine, like it's very basic, he or she will search anything, anything what he or she wants to and the result will come from the database itself. So first of all we need data for those results. So what we are doing is we have a crawler for that. So we need two types of crawlers, one for MySQL and one for MongoDB. And the specialty of this is like it is not happening sequentially. Like one crawler will go for MySQL, crawl everything out of MySQL and then post the data to elastic search. Then another crawler will go to MongoDB. So it is happening parallelly like this is the threading part, which will be explained by my friend later. Like everything is happening in a parallel fashion, like crawling of MySQL is also happening at the same time and at the very same time crawling of MongoDB is taking place. After both the database have been crawled or it is also like that, at the very same time of crawling itself we are posting that data to the elastic search. So elastic search is the search, like is the basic or the core of this project like what we have used to search into our database. So at the time when we are crawling everything, it is getting posted to the elastic search and then we are getting that result into our front end. So for the demo purpose, we have included only three indexes in the elastic search as of now, but if we want we can easily increase it. So indexes are nothing but database itself. So one for the content, enrollment and interactions. It will be explained later. This is the searcher, like from here itself the search will start. Any user will type the search and then we'll get the result out of it. We have also integrated cache into it for obvious reasons. So like whenever a person is searching anything, first of all we have the data into elastic search. Elastic search will perform its operations and will give the result to the searcher back. So these are the components that we'll have and all these interactions are happening with the help of Django REST APIs. So now Harsh Mohan will continue further. So as already explained, for any search engine to function, the first thing that it will need is data. So IT Bombay X has data stores for it. For example, the MongoDB database stores the course content and also the Discussion Forum data. Also the events are stored in log files. The MySQL database will store the student information. So these constitute a huge amount of data, so which needs to be filtered so as to answer relevant query. To facilitate this, what we have done is we have divided the search into two types. The first type is based on operator and data store, we have divided the searches into two categories. The first category is a content-based search that will give a output based on the content of phrase you type in. For example, give me all the courses related to pointers. So the pointer is the content which you are searching in the course. In that case, the data stores comes out to be the MongoDB database. The second type of query that we are doing is the query-based searches. For example, give me the course having maximum number of enrollments. In that case, we have an operator added to it, which in this case is maximum. And then there will be a data store. For example, we want to answer for enrollments. It has to be a data store, which will be the MySQL database. So in order to facilitate this, we have made two types of, we can say crawler. The first is MySQL crawler, which will crawl the MySQL database and bring out useful information for it, which we'll explain next. Then will be the MongoDB parser, which will facilitate the content-based searches. So MySQL crawler, as I said earlier, it forms the backbone of query-based searches. So what it does is it filters out useful information so that it can be used to answer relevant queries. So this component itself is made of two parts. The first is Apache Scoop, and the second will be a form of Apache Hadoop. So the function of Scoop will be to import the required data from multiple IIT Bombay instances and transfer it to the HDFS system so that a map-reduced program that has been written in accordance with the operator will function on the imported data to give out the output, which will be stored in Elasticsearch and will be used later to answer the queries. So here we'll give an example. Suppose the query that needs to be answered is, give me the most interactive of courses. So here, there are two things, the most. This most word forms the operator. That means we want to give the course which has maximum number of interactions. And the second is the course. So interactive course. So we want to find out the database where interaction and courses, interaction versus courses are stored. So the first thing we'll do here is Scoop will import the course versus interaction data from the databases which we identified as courseware student module, which is a MySQL table. The next is we'll write a map-reduced program to perform in accordance to the operator. Here we calculate the course ID versus the number of interactions. And that is the map-reduced algorithm to perform the same. At the very left end, you can see that there are multiple instances which has course in accordance with the number of interactions in them. So as the map-reduced works on key value pairs, so first of all, it attaches a value to each of the key in the map stage. In the suffer stage, the similar keys make a list of the values. And in the map stage, all the values are added together to give the course versus the number of interactions. This was all about the MySQL crawler. And now the problem would be that, suppose there are some changes, interactions will increase, enrollments will increase. So we need to crawl it at regular intervals. So for that, we have written everything into a script and we have scheduled a script so that it performs from time to, it runs time to time according to the schedule and stores the relevant data. Now, he'll explain about MongoDB parser. Hello everyone, I'll be explaining the MongoDB parser and you can see as my friend has already discussed, the work of MongoDB parser is to parse the content of every course stored in Mongo database. So in the picture, you can see that it shows all the collections of the Mongo database. So the content is stored in four collections, that is the module store, module store.active versions, module store definitions, and module store the structures. So there are two versions of the courses. One is the new version and one is the old version. The old version is stored in module store only. And the new version is collectively stored in the other three collections. So the technology used by Mongo for querying into the Mongo database, and natural language toolkit for processing the sentence and word. Moving on to the structure of the courses. So as there are two versions of courses, so the structures are different. The old version of course has a tree structure as you can see. The course is as the root node and followed by chapter, then sequences, then verticals, and there are four leaf nodes as HTML problems, discussions, and videos. So it will be more clear with examples. So this is an example of a chapter. And you can see the pointer in the children part you can see. There are few URLs and every URL point to a sequential. As you can see in the category part, there is a sequential part. So moving on to the next example, it is an example of a vertical. And you can see in the children part there are a few URLs, which points to the leaf nodes like HTML and problems. Moving on to the next example, that is an example of a leaf node that is HTML. And you can see inside the definition part, there is a data. And in the data, against the data, a hypertext is stored, which is generally the content of this of the subsection. And we are interested in this content. So there is another part called the metadata. And inside the metadata, there is display name, which is actually a short description of the subsection. So we are parsing both the things. And so moving on to the structure of the new version of courses. It has a structure like a link list. And there are, as I have already said, there are three collections. First one is the module store that active version which stores an overview of the course. And then it is module store the structure of the course. And then it is module store the definitions which actually contain the definition or you can say the content of the course, or the subsections. So it will be more clear with the examples. So this is an example of an object of module store that active versions. As you can see, there are a few descriptions only. The main thing that we are interested in is the published branch. Again, so object ID is stored, which is actually the address of the structure of the course stored in module store the structures. Then moving on to the module store the structures. Inside it, as you can see in the first one is the ID which is the address I have just shown you. Then inside the block, there are a list of JSON objects. Every JSON object represent a subsection of the course. And as you can see, the type of this subsection is course. And this was the same example. And as you can see, another object is there, and the type of that object is about. So another field you can see is definition, against which another object ID is stored. Actually, this was the address of the definition of this subsection, which is stored in module store the definitions. So moving on to the module store definitions, you can see again, there is a data part. And inside that, there is some hypertext, which is also the content of this subsection. So now, coming to the working of the MongoDB parser. There are two modules, one to pass the old version and another to pass the new version. For the old version, what the parser do? It collects all the distinct name of the courses. Then it collects all the nodes of the course tree. And then it collects all the content of every node. Then it processes the content with the help of analytical libraries that is removing the HTML tags. Then tokenizing, then removing the punctuation, then removing the stopwatch, and then stores it into Elasticsearch using the store API created. Then for the new version of courses, again it collects all the distinct courses from the module store the active versions. Then searches for the public ID in modules to the structures. Now we have the structure of the course. Then for every subsection, it collects the content from module store the definition. Then again it processes using analytical libraries and store it into Elasticsearch. So now Mohan will explain about threading. After crowling and parsing, we are using threading. Now threading is used to store MongoDB data and pass the data into Elasticsearch. So here we are creating threads for one instance. MongoDB data parser, number of enrollment and number of interaction. Here we are creating for threads for each type of data. Now Elasticsearch will explain why. So in Elasticsearch what we are doing is that all the pass data is passed to the Elasticsearch with the help of threads. So we have all maintained the two APIs basically to store data into the Elasticsearch. Also we have maintained three indexes in the Elasticsearch, which is course content, course enrollment, and interactions. So course content is from MongoDB and enrollment and interactions are from MySQL database. So whenever the thread is called it just costs the store data API and then the data is stored in the Elasticsearch. That's what the structure of the index which we have maintained in the Elasticsearch. So we have our demo ready. At first we are trying some absurd searches. And then I'm trying to show the full text search in which is supported by the Elasticsearch. This is just an example of edX demonstration course and we will just take some random text and that will be searched. So we are just taking this learning sequence, a text which is present in that demonstration X. And we are searching for that text now. So what we are getting is that we are getting the URL of that course. We are getting that course from the search result. But two results. Sir, as we are doing the full text search, so it is learning sequence. Then the word learning might have occurred in some other courses or sequences. Both are courses. Yes, both are courses. This particular text can be written in any object. Are you searching for objects also or just the courses? Sir, currently we are just searching for the courses. That's also included in our future scope. Okay, that's okay. So if you are searching for something, if I am searching for something and that is there in course description, title, any of the object related to the course, and discussion forum. Yes, sir. So while we are parsing, we are just considering all these things and we are just- Including discussion forum? Yes, sir. So if I search something, you will search in discussion forums also. And all the courses in which discussion forums, this word is present, you will bring those courses also here? No, sir. Discussion forum, no. Okay. Okay. So only course description, course title, course objects. Done. Then it's okay. Because if it will be discussion forum, then the result will not be useful. So we are using that for another purpose. So like we are searching for the minimum enrolments, the course in which the enrolment has been minimum. So we are just getting that. Or like enrolment word has been occurred in some other courses, that has also come. Now I'm just trying to show that your minimum interaction thing. What? Minimum. Yes, that's what we, that he explained, tried to call that operators. Sir, rather than writing, they have some buttons or- No, not button, something. So let us say somebody is searching for minimum, some important word. Minimum possibility. Minimum possibility. What will happen then? If currently we haven't implemented the minimum possibilities thing, so it will do a full text search. It will find that possibilities, it will search for the minimum and the possibilities word throughout the path data. No, no, my concern is different. So for example, minimum is the filter. Yes, sir. Minimum enrolment is the filter or minimum is the filter? Minimum is the filter. So I'll- So now somebody is searching for minimum possibility. Okay. Okay, and that text can be there in many courses. So your system will assume minimum as filter and search for the possibility or how it will work. Sir, it will do both. Both, okay. It will first search it as a filter but it won't find the possibility in the mapping area. So it will just leave that. And then it will do a full text search for the entire minimum possibility thing. I see. So can we replace this filters with slash filter, like slash minimum? So by putting some symbol in beginning, it will be very easy, convenient. Any more questions?