 Okay, now let's move on to the next team. Load testing and benchmarking for big data. Managed by Mr. Nagesh Karmali and mentored by Mrs. Shukla Nag, Ms. Feroza Aibra and Mr. Nagesh Karmali. I may not be very explanatory, but I'm sure that the team is going to prove the worth of their work. The team, please come up. Hi everyone. I am Aishagmar, Jai Modi and Sunil. Our aim was to load testing and benchmarking for big data. The major task was to develop a cluster and then run the queries on this one and using the big data benchmark as a standard tool. The important terms are load testing, benchmarking and big data. Load testing is the increase the load steadily up to its threshold limit and then benchmarking, comparing our own benchmark with the standard benchmark. The big data is that it covers the data such as so large that it cannot be processed by the traditional processing tools. The tools we have used are the Hadoop. It is used to generate, it is used to maintain the HDFS, the distributed file system and it is used to load the huge amount of datas and it also provides a MapReduce engine which is used to perform the processing of data. The Hive, it is a data warehouse infrastructure built on Hadoop for providing the data analysis and querying features. The Ganglia, it is used as a monitoring tool, a NATO monitoring tool which monitors our Hadoop system. Big Bench, it is used as an industrial benchmark. With this benchmark we have introduced our own benchmark. We have customized the benchmark. This is the block diagram. First we have set up the cluster and then in cluster we have generated the data locally and then we push the data onto the Hadoop cluster and then we have populated the Hive tables and then we run the benchmark which we have customized it and then we have analysis on this one. So moving ahead in the block diagram from the step, the first step was the experimental setup. So we obtained a number of PCs from the lab. One of them is implemented as the master or you can call it the name node for the Hadoop which was a server Intel Xeon 2650 which was provided by ASL lab. And it had the specifications as mentioned here. In short it had 128 GB of RAM. It had 8 RAMs of 16 GB, 128 GB RAM and powerful server. And for the data nodes we had a simple desktop commodity machine as in code to your PCs and just with 2 GB of RAM on each of them. Nothing great. And the network used was the Ethernet LAN network set already existing in IID Bombay. Once the system is set up, the first task we have is we have to generate the data because in a benchmark we always need, when we want to test a benchmark we need to generate data. We can't pick up data from anywhere. We need a set of data which is in the format required for the benchmark. So, PDGF generator is used parallel data generation framework which is provided by Big Bench itself to generate the data in its format. So, we use that to generate data and then we push the data on to the cluster. Hadoop cluster that was set up using the PCs mentioned the name node and data node and the data on to the cluster should be in the form of directories and files within the directories so that Hadoop can straightaway pick them up from there and put it in its tables. Hadoop is basically like RDBMS means it's a parallel framework to an RDBMS but for big data rather than for small data sets. And then we means push the cluster and then populated Hadoop using the same format and then we need to run the benchmark. So, out of the 30 queries provided by Big Bench we just use nine of them because many of the queries from those 30 produce empty sets as results. So, there was no meaning testing them all. So, rather we thought that just let's choose nine queries and then run the benchmark which produces proper results means visible results. So, the nine queries are divided into three categories. They can be declarative queries which are pure SQL type queries. They can be procedural queries which involve map reduce tasks rather than sort of SQL task it involves intensive map reduce tasks and the mixed queries are mixture of both of them and these set of nine queries can operate on three types of data. The data can be structured as in tables then semi-structured or it can be unstructured. So, here we can see that any query can operate on any type of data means not any type of data but generally a declarative query will operate only on structured data and a mixed query can operate on either structured data or semi-structured data. So, that type of relation is always there because you can't simply run an SQL task on an unstructured data and you can't just run map reduce on structured data straight away. So, that was some things and then after running the queries we obtained the time running for running these queries and then compared it to the standard benchmark. So, these are the average response time that were obtained. So, first in we had three experiments in there was one name node and two data nodes in the Hadoop cluster and we tested its limit then during that time till 25 on 25 GB data test the data field means this data size comes from the PDGF generator means you have to specify how much data we want to generate and then first we specify 1 GB then 5 GB 10 GB 25 GB that way progressively we increase. So, in the two data node system it failed at 25 GB it could not process it could not give the answer. So, then we moved to three data nodes. So, for three data nodes it ran till 100 GB and then it failed at ahead of that when we tried testing it and for four data nodes we are yet to test beyond 100 GB that is included in future work because if till three data nodes run till 100 GB four will surely run beyond this four data nodes will surely go beyond 100 GB. So, this is a graph that indicates the comparison between response time for three data nodes and four data nodes for two data nodes we just said three lines. So, we rather avoided that into the graph but for three data nodes this is the comparison the green red is for three and green is for four. So, we can just observe some peculiarities here like all these parts green is less than red three data nodes is more but in this part green is more than red and again it is less these are some peculiarities observed in the graph and we have tried to explain that in the further slides. And this is a same way line graph representing the same bar graph which we can see that it is crossing at two points again. The dotted line represents four data nodes whereas the solid line represents three data nodes. This observation that I just told you that initially three data node takes more time then in middle range three data nodes is taking less time compared to four data nodes and then again the time is increasing. So, it gives several graphs one of them is the memory usage graph and the network usage graph and also graph. So, we tried to derive some results from those graphs the response time one factor which on which response time depends is division of task on the data nodes like when we run a query it that work process is given to the nodes. So, the task is divided and then it is combined and the final output is produced. So, and the other thing which affected response time is the speed of the network like this is a IIT Bombay network is a shared network. So, that was that might have reduced the increase the response time for the queries and another is the degree of swapping that takes place because of memory constraint. So, initially the when the data size is small the network factor is not much because intermediary use that habit to be transferred is that does not make much difference. So, initially that for four data nodes is more than the time for four data nodes is less than three data nodes because the division of task is a more dominating factor. But when this in this where peculiarity comes the we observed that there is a high level of network transfer happening. So, that is causing the lag and along with that there is a huge the swapping is increasing memory swapping it is almost reaching to its maximum limit. So, that causes the peculiarity after around 50 GB what happens is the swapping after the maximum limit is reached the factor of swapping becomes constant means when the maximum limit is reached the effect of swapping becomes constant. And the at this point when the data size is larger the division of task for how much more the task is divided the faster as the results are computed. So, that becomes a more dominating factor and so again the we see that the graph returns with normal behavior. So, that is somewhat analysis of the data that we have done from the graphs that we have obtained from the monitoring tool. And these are we have plotted the experiment results and we have taken the mean of the slope and we have plotted a predicted line and based actually we have taken for each data set we have taken its increase in comparison to one GB data set and that scale factor we have plotted on this line in the solid line and the mean of the slope of all the intervals that we have plotted as this predicted line and then we have tried to see how much the deviation we observe from the experimental data. This is the experimental and predicted in deviation we see that within limits of error we can predict somewhat predict the response time of the query. 188 is 1 GB that is factor of 1 GB 1 GB we need know for scaling up. This is how our project moved on we first studied the benchmarks TPC and TPCS which were already relation these benchmarks are based on relational database system and then we had to study some research papers and then we had got some machines from the lab. So we set up a Hadoop cluster so then we started working with Big Bench we modified it a bit to suit or like we took those nine queries which generated the data locally and all that we did and then we have conducted this load testing experiments. Future scope we are trying considering running the experiment on larger size clusters of larger size and for greater data set so that we can maybe we can confirm the result analysis that we have obtained if there are any other parameters which affect the response time of the system and the customized version that we have made of Big Bench we are thinking of making an open source release of that. Good work any learning challenges? Is that we faced mainly our Big Bench 30 queries were too much to run and so that first we had when we run that on smaller data set even that took a long time so then we had to think about that how will we be able to do more experiments if we have such a long response time means as I said for nine queries it takes 12 hours to run on 100GB so for 30 it will take so many hours so we had to do some good experiments think about something how we will be able to do good experiments in the given time period and still we were short of time and the one another thing is there is Big Bench is provided as an industry standard but still it is also not a concrete benchmark as other benchmarks are available for relational databases like TPCC for transaction counters it is a solid benchmark provided by the TPC council but Big Bench is not as such concrete benchmark because Big Data is evolving and you cannot fix the node size means you change number of nodes and the benchmark will change so performance so that fixing is still that research is going on in the entire world right now and people are trying to evolve new and new things to stabilize this thing so this field is right now not so much as stabilized as other relational databases field related things are that were main challenges